A CNN to Detect Melanomas in Skin Lesions

Introduction

According to the Skin Cancer Foundation, as the most common cancer worldwide, skin cancer affects one in five Americans, with melanomas causing up to 75% of skin-cancer related deaths. Melanoma arises from melanocytes, cells that produce pigments responsible for skin color. A group of melanocytes or nevi, seen as moles or pigmented lesions, can grow at an abnormal rate or experience a transformation, resulting in melanoma. Detecting melanomas at an early stage is paramount in treatment and can significantly increase the survival rate. A dermatologist would visually examine the region with the birthmark or mole, and often would require technological support to take the dermatoscopic images. Typically, a detection can be reinforced through the ABCDE technique in assessing a mole: asymmetry, border irregularity, color variation, diameter, and evolving lesion. If the melanoma is diagnosed early, surgery as treatment would be the preferable route. However, larger lesions would require a lymph biopsy and immunotherapy. Often, early diagnosis through biopsies, histological examinations, and pathological interpretations remains time-consuming and expensive.

Utilized in medical imaging technology and diagnostics, deep neural networks yield much promise to the healthcare field. The core of the machine learning model relies on access to a large amount of data, such as population demographics or medical images. Deep learning remains a prominent aspect of computer vision. In fact, convolutional neural networks, a type of deep learning model, have surpassed human performance on the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). CNNs have been developed for neurodevelopment and survival with disease prediction, aiding surgeons and doctors with their decisions (Kawahara et al., 2016; Nie et al., 2016). With an influx of patient data, healthcare applications employ artificial intelligence to analyze CT scans, predict cardiac arrests, diagnosis disease, and consolidate electronic health records.

Convolutional neural networks have been utilized in the past to classify dermoscopic images of skin lesions as benign or malignant (Esteva et al., 2017; Brinker et al., 2019). These models have the potential to offer a more accurate diagnosis in an efficient manner. Prior work on CNNs for melanoma detection has classified clinical images into binary categories: malignant carcinomas versus benign seborrheic keratoses, and malignant melanomas versus benign nevi. In Estava et al., a CNN was trained with 129,450 clinical images and tested against 21 board-certified dermatologists on these binary categories. The performance was comparable to that of the dermatologists. Brinker et al. had a similar result, as they trained a CNN that distinguished nevi and melanoma, and found that the automated classification was superior to the dermatologists (Brinker et. al, 2019). Such networks performed on par with expert dermatologists, and even better in some scenarios.

In this work, we developed a CNN to detect the type of cancer in dermoscopic images of skin lesions. In order to more accurately classify the data into the various types of melanoma, several techniques were required to enhance the model, including data augmentation, pretraining of the model, and the early stopping method. The best model was successful in classifying these types and achieved an accuracy of 82%. Future work is required to further improve the performance of the model and validate it in prospective settings.

Methods

Data

Images of pigmented skin lesions were retrieved from the HAM10000 dataset, as part of the International Skin Imaging Collaboration (ISIC). These 10,015 dermoscopic images were collected over 20 years at sites in Austria and Australia. The Austrian segment includes patients with a large number of nevi, or birthmarks/moles, and a hereditary tie with melanoma. Images from the Australian set are obtained from patients in a high skin cancer incidence area. A clinician works to distinguish between malignant and benign lesions, as well as the different diagnostic categories. The dataset includes: actinic keratosis and bowen’s disease (akeic), basal cell carcinoma (bcc), benign keratosis-like lesions (bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv), vascular lesions (vasc). For more than 50% of lesions, the ground truth was confirmed through histopathology, while the rest was through follow-up examinations, expert consensus, or confocal microscopy. The dataset was split into a training set (~64%) to learn model parameters, a validation set (~16%) to tune model hyperparameters, and a test set (~20%) to evaluate the best model. Statistics on each of the sets are shown in Table I, examples of the metadata are shown in Table II, and examples of images are shown in Figure I.

Table I: HAM10000 Dataset

The number of images and lesions in the training, validation, and test sets derived from the HAM10000 dataset, as well as the age and sex of the patients that the images were collected from.

	Training	Validation	Test
Number of images	6444	1571	2000
Number of lesions	4780	1196	1494
Age	51.8 (± 16.1)	51.9 (± 15.4)	50.9 (± 16.4)
Sex (% Female)	45%	47%	46%

Table II: HAM10000 Metadata Snippet

Example rows and columns of the HAM10000 dataset’s csv file, informing the users of the lesion ID, image ID, type of diagnosis, sex, and the location of the lesion.

Figure I: Images

Examples of images from the HAM10000 dataset.

Data Preprocessing

We utilized Google Colab for the initial parts of our experiment. The HAM10000 dataset was loaded onto the Google Drive directory, and linked to the Colab. The images in each category are transformed into PyTorch Tensors of a normalized range between -1 and 1, which has been shown to improve neural network optimization. Images were resized to 224 by 244 pixels for better model readability. Furthermore, data augmentation methods were implemented for the convolutional network to generate natural perturbations of the images and increase the size of the dataset to reduce model overfitting. In particular, the images were randomly flipped horizontally, randomly flipped vertically, or randomly rotated before being input to the model.

CNN Architectures and Loss Function

The core of computer vision lies on the use of deep neural networks. There are several pre-built convolutional neural network architectures that have been shown to demonstrate good performance on a variety of tasks . The specific CNN architecture used to define the model in this study was DenseNet 121 (Huang et al., 2016). Densely Connected Convolutional Neural Networks, or DenseNets, increase the depth of CNNs by adding more layers where each layer is directly connected to every other layer within blocks, which leads to better regularization. In comparison to ResNet, another common CNN architecture, Densenets require fewer parameters and integrate many more contributing layers. The model was defined through PyTorch packages and run on a GPU. A key component of any model is the loss function. A loss function represents the error of the predictions and parameters are learned to minimize the loss using gradient descent. If the predictions deviate from the ground-truth results, the loss function would output an abnormally high number. We used a cross-entropy loss function and an Adam optimizer, with a learning rate of 0.001.

Training Procedure

Training the network remains the core of the algorithm, as the model runs through various iterations in search for the best possible parameters to map inputs to outputs. Training data is passed through the model defined above and a prediction is obtained. This prediction is compared to the values of the expected labels, the loss is calculated, and parameters are updated using gradient descent. After running the network for an epoch over the entire training data, the network is run on the validation set to compute accuracy on held-out data. When the accuracy of the network on the validation images is at the highest, the network is saved. This method is known as early stopping, and the model is less likely to overfit the training data. The model is trained for 10 epochs using a batch size of 2000.

Experiments

The model was run with and without multiple features, namely data augmentation, weights, and pre-training, to achieve the best accuracy. Data augmentation would reduce model overfitting and work to generate a higher accuracy. When predicting the class labels, the weights are modified to reduce overall error. This may allow for a more accurate application, but can also hinder the model. Pre-training strengthens the image classification network as feature extraction of the images has already been learnt. The model can now learn the new task more efficiently. Several trials were conducted by removing each one of these features and assessing the end accuracy of the model.

Results

A series of experiments were run to identify the most accurate model. As each feature was removed (weights, data augmentation, pre-training) was eliminated from the algorithm at each subsequent experiment, the accuracy varied. The evaluation metric utilized was the accuracy of the model and the f1 score, a combination of precision and recall. Model A included data augmentation methods and pre-training, but weights were removed. This produced the highest accuracy, 0.82, and f1 score, 0.67. The following experiments had one feature added or removed, with Model A serving as the baseline. As shown, the model with solely augmentation and pre-training achieved the best accuracy. The full set of results on the validation set is shown in Table III.

The model was then run on the test set, evaluating whether the model can produce a similar accuracy on a new set of test images. As shown in Table V, Model A generated the highest accuracy, 0.81, and f1 score, 0.66. This gave a similar result as Model A in the validation set. Figure II depicts the confusion matrix of Model A. The number of true positives for the “nv” diagnosis was 991, 87 for “mel,” 73 for “bkl,” and 141 for “akeic, bcc, df, and vasc.” In other words, the model’s prediction accurately correlated with the ground-truth for these diagnoses.

Table III: Results of the experiments on the validation set

Models run with weights, data augmentation, or pre-training, as well as their respective accuracy and f1 scores.

These experiments were run on the validation set.

Model Number	Name of Experiment	Weights	Data Augmentation	Pre-training	F1	Accuracy
A	Total	False	True	True	0.67	0.82
B	withweights	True	True	True	0.65	0.81
C	withoutaugmentation	False	False	True	0.64	0.81
D	withoutpretraining	False	True	False	0.57	0.77

Table IV: Results of Model A on the validation set

The classification report for Model A on the validation set.

Figure II: Confusion Matrix of Model A

The number of true positives for each diagnosis (“nv,” “mel,” “bkl,” and “others) are 991, 87, 73, and 141, respectively. The model’s predictions versus the ground-truth of the diagnoses can be assessed.

Table V: Results of the experiments on the test set

Models run with weights, data augmentation, or pre-training, as well as their respective accuracy and f1 scores. These experiments were run on the test set.

Model Number	Name of Experiment	Weights	Data Augmentation	Pre-training	F1	Accuracy
A	Total	False	True	True	0.66	0.81
B	withweights	True	True	True	0.64	0.81
C	withoutaugmentation	False	False	True	0.62	0.80
D	withoutpretraining	False	True	False	0.54	0.75

Table VI: Results of the Model A on the test set

The classification report for Model A on the test set.

Conclusion

To facilitate early diagnosis of melanoma on skin lesions, a convolutional neural network was developed, predicting the various different types of cancers on images. A DenseNet CNN model was used for the classification of melanomas, and a CrossEntropyLoss function was utilized. To improve accuracy, the data was augmented by resizing, flipping, and rotating the training images, as well as adjusted through the weights parameter. Overall, an accuracy of 82% on sample test images was achieved through the model that incorporated data augmentation and pre-training methods. For future testing, the model can be run on larger datasets and experimented with different architectures, such as ResNet. Other data augmentation methods can be implemented in the pursuit of achieving the highest accuracy attainable. As a revolutionary tool for doctors diagnosing pigmented skin lesions, this convolutional neural network can catch an early onset of skin cancer and give way to applications enhancing treatment.

References

“Skin Cancer Facts & Statistics.” The Skin Cancer Foundation, 4 Dec. 2020, www.skincancer.org/skin-cancer-information/skin-cancer-facts/.
Berrios-Colon, Eva. “Melanoma Review: Background and Treatment.” U.S. Pharmacist – The Leading Journal in Pharmacy, 23 Apr. 2012, www.uspharmacist.com/article/melanoma-review-background-and-treatment.
“What Is Melanoma Skin Cancer?: What Is Melanoma?” American Cancer Society, www.cancer.org/cancer/melanoma-skin-cancer/about/what-is-melanoma.html.
Phillips, Michael, et al. “Detection of Malignant Melanoma Using Artificial Intelligence: An Observational Study of Diagnostic Accuracy.” Dermatology Practical & Conceptual, Derm101.Com, 31 Dec. 2019, www.ncbi.nlm.nih.gov/pmc/articles/PMC6936633/.
Tschandl, Philipp, et al. “The HAM10000 Dataset, a Large Collection of Multi-Source Dermatoscopic Images of Common Pigmented Skin Lesions.” Nature News, Nature Publishing Group, 14 Aug. 2018, www.nature.com/articles/sdata2018161.
Brinker, Titus Josef, et al. “Skin Cancer Classification Using Convolutional Neural Networks: Systematic Review.” Journal of Medical Internet Research, JMIR Publications, 17 Oct. 2018, www.ncbi.nlm.nih.gov/pmc/articles/PMC6231861/.
Esteva, A., Kuprel, B., Novoa, R. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017). https://doi.org/10.1038/nature21056
Huang, Gao, et al. “Densely Connected Convolutional Networks.” ArXiv.org, 28 Jan. 2018, arxiv.org/abs/1608.06993.