Machine Learning Approaches for Investigating Breast Cancer

Sumit Das; Subhodip Koley; Tanusree Saha

Volume 20, number 4

Views: (Visited 220 times, 1 visits today) PDF Downloads: 163

Das S, Koley S, Saha T. Machine Learning Approaches for Investigating Breast Cancer. Biotech Res Asia 2023;20(4).

Manuscript received on : 05-05-2023
Manuscript accepted on : 18-11-2023
Published online on: 04-12-2023

Plagiarism Check: Yes

Reviewed by: Dr Geetha Lakshmi

Second Review by: Dr. Syamdas Bandyopadhyay

Final Approval by: Dr. Eugene A. Silow

How to Cite | Publication History | PlumX Article Matrix

Machine Learning Approaches for Investigating Breast Cancer

Sumit Das^*, Subhodip Koley and Tanusree Saha

JIS College of Engineering, Kalyani, 741235, India.

Corresponding Author E-mail: sumit.das@jiscollege.ac.in

DOI : http://dx.doi.org/10.13005/bbra/3163

ABSTRACT: This study aims to predict whether the case is malignant or benign and concentrate on the anticipated diagnosis; if the case is malignant, it is advised to admit the patient to the hospital for treatment. The primary goal of this work is to put together models in two distinct datasets to predict breast cancer more accurately, faster, and with fewer errors than before. Then contrast the techniques that produced datasets with the highest accuracy. In this study, the datasets were processed using Support Vector Machine, Logistic Regression, Decision Tree, K-Nearest Neighbours, Artificial Neural Network, Nave Bayes, Stochastic Gradient Descent (SGD),Gradient boosting classifiers(GBC), Stochastic Gradient Boosting (SGB), Extreme Gradient Boosting (XGBoost),and Random Forest. Two datasets—the Wisconsin Diagnostic Breast Cancer dataset and the Breast Cancer dataset—are used to test these methods. to evaluate the findings and choose the algorithm that is more adept in predicting breast cancer. Seven algorithms that operate on both datasets in the AI platform were used to build the article. Breast cancer prediction has gotten much harder because so many people die from the disease in its early stages. Consequently, using two real-time datasets, one for Wisconsin diagnosis and the other for research on breast cancer. The same methods are applied to both datasets, and it is found that SVM provides the best accuracy in the shortest time and with the lowest error rate.

KEYWORDS: Artificial Neural Network; Brest Cancer; Logistic Regression; Machine Learning; Support Vector Machine

Download this article as:

Copy the following to cite this article:

Das S, Koley S, Saha T. Machine Learning Approaches for Investigating Breast Cancer. Biotech Res Asia 2023;20(4).

Copy the following to cite this URL:

Das S, Koley S, Saha T. Machine Learning Approaches for Investigating Breast Cancer. Biotech Res Asia 2023;20(4). Available from: https://bit.ly/3THv4qN

Introduction

Breast cancer is a type of cancer that develops in the breast cells. Breast cancer affects between 5% and 10% of females due to genetics and family history. A considerable percentage of girls also smoke and consume alcohol. Females who use hormone replacement therapy may be at a higher risk of developing breast cancer. Females who have previously received radiation therapy, particularly to the neck, head, and chest, may be at increased risk of developing breast cancer. Breast cancer cells are essentially a tumour that can be visualised using an x-ray or realised as a wafer. Metastatic breast cancer occurs when breast cancer spreads to the liver, lungs, or brain. Breast cancer cells are limited to increase the number of healthy cells. The purpose of this job is to predict whether the case is malignant or benign, and to concentrate on the expected diagnosis; if malignant, then advise admission to a hospital for treatment.

Breast cancer is a serious condition that develops when breast cancer cells multiply out of control. Breast cancer is brought on by malignant cells. The connective tissue, ducts, and lobules make up the three regions of the breast. The lobule is a gland that makes milk. Milk travels through the duct from the nipples to the lobules. The fibrous and fatty tissue that the connective tissue forms surrounds and binds everything together. Most breast cancers start in the lobules or ducts. Invasive ductal carcinoma and invasive lobular carcinoma are the two most prevalent types of breast cancer. 70 to 80 percent of women will develop invasive ductal carcinoma. Other breast cancer types include Paget’s disease, medullary, mucinous, and inflammatory breast cancer. Several symptoms, such as altered breast size and shape, numerous dimpling on the breast, the appearance of a newly inverted nipple, and alterations in skin tone, such as the look of orange, are displayed by women who have been diagnosed with breast cancer.The stages of breast cancer are frequently categorised into Stages 0 through Stage IV, with each stage having a distinct severity level and spectrum of treatment options. The most used approach for staging breast cancer is the TNM method, which stands for tumour, lymph nodes, and metastasis. The relative risk of breast cancer was found to increase with increasing intake of alcohol, both in never-smokers and in ever-smokers¹.

The primary purpose of this work is to develop models in two distinct datasets to predict breast cancer with greater accuracy, in less time, and with fewer errors. Then, compare the approaches that produced the most accurate results in datasets. Support Vector Machine, Logistic Regression, Decision Tree, K-Nearest Neighbours, Artificial Neural Network, Nave Bayes, SGD, GBC, SGB, XGBoost, and Random Forest were utilised in datasets in this study. These methods are tested on two datasets: Wisconsin Diagnostic breast cancer dataset² and Breast cancer dataset³. To examine the results and determine whether algorithm is better at predicting this cancer or not. The article is produced with eleven algorithms that run in the AI platform on both datasets.

Various stages of brest cancer are mention as follows:

Stage 0: This stage is frequently referred to as cancer in situ. At this stage, abnormal cells are present in the breast duct lining but have not yet spread to the surrounding tissues.

Stage I: The tumour has not yet progressed to the body’s lymph nodes or other organs and is quite small (less than 2 cm in diameter).

Stage II is divided into the following two categories:

Phase IIA: If the tumour measures less than 2 centimetres and hasn’t spread to any lymph nodes, or if it measures between 2 and 5 centimetres and has spread to one to three nearby lymph nodes.

Stage IIB: The tumour is larger than 5 centimetres and has spread to one to three nearby lymph nodes, or it is between 2 and 5 centimetres and has not migrated to any lymph nodes.

Stage III: This stage has two subcategories as well:

Stage IIIA: The tumour has grown to a diameter of more than 5 cm and has spread to one to three nearby lymph nodes or lymph nodes around the breastbone.

Stage IIIB: The disease has progressed to the skin, chest wall, or lymph nodes above or below the collarbone.

Stage IV: In this stage, it has progressed to other organs. The cancer has now spread to several organs, including the lungs, liver, bones, or brain.

About 30% of women are affected by breast cancer every year. The WHO estimates that 2.3 million women worldwide are affected by breast cancer, and 685,000 will pass away from the disease by 2020. One of the most serious and fatal diseases in the world is breast cancer. Consequently, a model that forecasts this cancer for that has been eliminated using machine learning and AI techniques. The main objective is to compare the algorithms that provided the highest levels of accuracy across both datasets and to construct models using two separate datasets to predict this cancer with a higher degree of accuracy, with less time and error.

The creation of a prediction model and the used algorithms are explained in the methodology section. This work uses feature selection strategies such as the correlation ranking base approach and mutual information method to build more accurate models. The Wisconsin Diagnostic dataset², which has 31 columns and an output column named “diagnostic,” lists benign and malignant illnesses. Generally speaking, benign cells grow slowly and do not spread, but malignant cells grow quickly and spread throughout the body by attacking and obliterating nearby healthy cells. Both malignant and benign tumours are described in the output column of the other breast cancer dataset³, class. There are 11 columns in it.

This cancer prediction model was developed utilising a total of eleven algorithms, including Random Forest, Logistic Regression, SVM, KNN, ANN, Decision Tree, SGD, GBC, SGB, XGBoost, and Naive Bayes, using the Wisconsin Diagnostic dataset². On additional breast cancer datasets, ten algorithms—including Random Forest, Logistic Regression, SVM, KNN, Decision Tree, SGD, GBC, SGB, XGBoost,and Naive Bayes—were applied³. Locate the algorithms that provide the best or most accurate results for the two datasets. The final component is the outcome, which describes the accuracy of each algorithm. The Wisconsin Diagnostic dataset²was subjected to the Random Forest, Logistic Regression, Naive Bayes, KNN, Decision Tree, SVM, SGD, GBC, SGB, XGBoost,and ANN algorithms; the resulting accuracy scores were 96.49%, 95.61%, 93.85%, 96.49%, 93.85%, 98.24%,96.49%, 97.36%, 96.49%, 96.49%, and 95.61% respectively. The accuracy of the algorithms Random Forest, Logistic Regression, Naive Bayes, KNN, Decision Tree, SGD, GBC, SGB, XGBoost,and SVM was determined using a different dataset³. 97.08%, 95.62%, 94.16%, 94.89%, 95.62%,94.89%, 95.62%, 95.62%, 97.08%, and 95.62% were the outcomes respectively. Each algorithm’s accuracy is described in the results section using a table including data from both datasets.

In this emerging article used a multitask learning architecture to determine the histological grade and ki-67 proliferation status in order to predict this cancers. The dataset comprises of 203 biopsy samples that were collected from the affiliated hospital of Zhejiang Chinese Medical University. Among the techniques used in this include SVM, logistic regression, and MTC. This study aims to improve tumour radiomic analysis’s ability to forecast this cancer. Different radiomics from the MRI are combined for better prediction⁴. Several factors, such as ER (Oestrogen Receptor), PGR (Progesterone Receptor), and HER2 (Human Epidermal Growth-Factor Receptor 2), affect the diagnosis of that cancer. Consequently, DNA methylation, gene expression, and miRNA (Multimodel Autoencoders) were used to create MAE. The methods decision tree, SVM, KNN, naive bayes, gradient boosting tree, random forest, and logistic regression were used to create the model. However, the ER platform’s accuracy result was the highest, followed by the PGR platform’s accuracy values of 91% and 86%⁵.

Using deep learning, machine learning, and data mining approaches, the primary objective of this work is to accurately forecast the massive dataset^6–10. A multi-layer perceptron, KNN, SVM, a classification and regression tree, and gaussian naive bayes were used to create the model. MLP has a 96.70% accuracy rate, SVM has a 97.59% accuracy rate, Naive Bayes has a 92.6% accuracy rate, Classification And Regression Tree (CART) has a 92.9% accuracy rate, and KNN has a 93.6% accuracy rate¹¹.A number of ideas are developed and assessed in this study to demonstrate how well machine learning models can forecast the spread of this cancer. The entreaty has led to improved categorization models for more reliable and transparent model interpretations, which has also inspired interest in biology. We employed a number of feature types, including LR, NN, ISVM, rSVM, and RF¹².In order to find genes linked to breast cancer, this study introduces CapsNetMND, a deep learning technique that models multi-omic data based on the capsule network. The feature matrix genes, which incorporate CANs, DNA methylation, and mRNA expression as well as a z-score for mRNA expression, were constructed using the TCGA dataset. In this instance, the techniques XGBoost, SVM, KNN, NN, and Adaboost are used¹³.

In order to improve prediction, this study will investigate the breast cancer GE dataset utilising three classification algorithms. Analysing two more types—DM and a composite dataset made up of GE and DM—was the strategy employed in this article. Techniques like decision trees, SVM, and random forests were used in this study. The highest level of accuracy available with SVM is 99.68%¹⁴.The advanced hybrid model used in this study examines the use of thresholding, gaussian mixture, k-means and GMM in combination, gaussian mixture, SVM techniques, and the Growth region FCM-GA selection process. The Gaussian mixture technique has the highest accuracy (93.80%) while the FCM-GA selection strategy has the highest error rate (50%) in this model. The combinations of K-means and GMM (95.5%), gaussian mixture (93.8%), thresholding (86%), and SVM (56.33%) are other methods that deliver accuracy in a variety of ways¹⁵.Women suffer significant suffering as a result of breast cancer and mortality. To make this cancer prediction model that is as accurate and reliable as possible with the least amount of error. constructed the model utilising the KNN, random forest, SVM, and logistic regression techniques. Put precision first by predicting cancer before diagnosis, then this cancer diagnosis, and finally treatment. They compile a dataset about this cancer and do data mining to remove any unnecessary columns. then utilise the wrapper technique to select features. Divide this dataset into two sections: training data (80%) and tasting data (20%). To obtain the best accuracy number, combine LR, SVM, KNN, and RFC; SVM then provides 97% accuracy in 0.07 seconds¹⁶.

Women are impacted by breast cancer every year. Consequently, develop a model that classifies patients into benign or malignant groupings using ML and AI techniques. Finding this cancer as fast and safely as feasible is the goal. Combine SVM, decision trees, logistic regression, KNN, and naive Bayes approaches to create a model. Check the highest accuracy values after the model has been constructed. 75 percent of the data were used for training, and 25 percent for testing. The random forest classifier has a 96.5% accuracy rate.¹⁷.Women are impacted by this cancer every year. So, develop a model that divides patients into categories that are malignant or benign using ML and AI. The goal of this project is to create a model that can more quickly and accurately diagnose this cancer. Create a model to determine if breast cancer is malignant or benign by using decision trees, naive bayes, logistic regression, KNN, and SVM techniques. made a prediction using Wisconsin’s breast cancer diagnostic data. Seventy-five percent of the dataset is used for training, while 25 percent is used for testing. 97.2%, 96.5%, 93.7%, 95.8%, and 95.1% of the results are provided by SVM, Random Forest, KNN, Logistic Regression, and Decision Tree, respectively¹⁸.

According to a study, 50% of breast tumours are not discovered when they are first developing. Utilising AI and machine learning, develop a model that can forecast breast cancer. Demographic, mammographic, and lab risk factors are all of this cancer risk factors. As a result, a model for predicting this cancer is developed using machine learning and AI. Gradient boosting tree, genetic algorithm, random forest, and multi-layer perceptron were used to build this model. The goal is to forecast of this cancer using a variety of machine learning techniques, with very accurate results. Gradient boosting, random forest, multi-layer perceptrons, and gradient boosting all have accuracy rates of 80%, 74%, 73%, and 86%, respectively. However, the random forest model provides the most sensitivity and has a 95% accuracy rate¹⁹. Machine learning and AI technology are highly valued in the medical sector since they can predict and detect any type of cancer. The 1580 datasets were divided into four groups for this project: 50, 100, 150, and 200 sequences. The prediction of breast cancer is a three-step process that involves feature selection, machine learning algorithms, and performance evaluation. The linear discriminant analysis model, logistic regression, decision tree, KNN, SVM, naive bayes, AdaBoost, gradient boosting, and random forest were only a few of the nine supervised machine learning methods used in this work. All supervised machine learning techniques save one use decision trees, and its accuracy is 94.03%²⁰.

This paper proposes a comparison of various machine learning techniques^15–19, including data mining, ensemble method, blood analysis, etc., using six different machine learning techniques on the Wisconsin diagnostic breast cancer dataset, including ANN, SVM, KNN, decision tree, random forest, and naive bayes. The dataset was divided into a training component and a testing component in order to employ machine learning techniques. As a result, overall accuracy is 97.47%, whereas PCi-ANN accuracy is 99.63%²⁴.According to estimates, there were 246660 new cases of this cancer in the US in 2016 and 40450 deaths among women. Utilising the Wisconsin diagnostic dataset and a variety of machine learning methods, including decision trees, KNN, SNM, and naive bayes, create a model. In the dataset, the system predicts both malignant and benign this cancer. The main objective is to create the best accurate model in the quickest time. Currently, SVM delivers 97.13% accuracy with the lowest error rate of 0.02%, whereas KNN and naive bayes offer 95.28%, 95.12%, 0.06, and 0.03 error rates, respectively²⁵.

Methodologies

Figure 1 depicts the process for creating a breast cancer prediction model. Seven methods are used in total to build the model and choose the best accuracy. If algorithms fail, feature selection will be used once again. Following the use of the spit dataset, K-Fold cross-validation, a subset of cross-validation on KNN, SVM, decision trees, Navia Bayes, random forests, and logistic regression is used. After that, use hyperparameter adjustment to determine the accuracy that works best.

Figure 1: Before apply K-Fold Technique

share this article

Follow us on:

Search Website

Member of

Journal archived in

Visitor’s Insight