Diagnosis of Lung Cancer Using Histopathological Images
Diagnosis Lung Cancer Using Deep Learning
This notebook will introduce some foundation machine learning and deep learning concepts by exploring the problem of diagnosis lung cancer.
1. Problem Definition
In our case, the problem we will be exploring is Multi-Class Classification
This is because we're going to be using a Histopathological Images about a person to predict whether they have :
- Lung benign tissue
- Lung adenocarcinoma
- Lung squamous cell carcinoma
In a statement,
Given histopathological images about a patient, can we diagnosis type of Lung Cancer?
The original data came from the Academic Torrents from Cornell University
However, you can download the dataset from Kaggle.
Original Article :
Borkowski AA, Bui MM, Thomas LB, Wilson CP, DeLand LA, Mastorides SM. Lung and Colon Cancer Histopathological Image Dataset (LC25000). arXiv:1912.12142v1 [eess.IV], 2019
Relevant Links :
The evaluation metric is something you might define at the start of a project.
Since deep learning is very experimental, you might say something like,
IF we can reach 95% accuracy at predicting whether or not a patient has lung cancer during the proof of concept, we'll pursue this project.
Lung cancer may not cause any symptoms, especially in the early stages of the disease.
Therefore, it may first be detected on X-rays, CT scans or other kinds of tests being done to check on another condition.
We will define every type of lung cancer and Lung benign tissue
Our data set contains 3 types of Lung conditions:
1 Lung benign tissue ==> lung_n :
A benign lung tumor is an abnormal growth of tissue that serves no purpose and is found not to be cancerous. Benign lung tumors may grow from many different structures in the lung.
Determining whether a nodule is a benign tumor or an early stage of cancer is very important. That's because early detection and treatment of lung cancer can greatly enhance your survival.
2 Lung adenocarcinoma ==> lung_aca :
- Lung adenocarcinoma is a subtype of non-small cell lung cancer (NSCLC). Lung adenocarcinoma is categorized as such by how the cancer cells look under a microscope. Lung adenocarcinoma starts in glandular cells, which secrete substances such as mucus, and tends to develop in smaller airways, such as alveoli. Lung adenocarcinoma is usually located more along the outer edges of the lungs. Lung adenocarcinoma tends to grow more slowly than other lung cancers.
Lung adenocarcinoma accounts for 40% of all lung cancers. It is found more often in women. Younger people (aged 20-46) who have lung cancer are more likely to have lung adenocarcinoma than other lung cancers. Most lung cancers in people who have never smoked are adenocarcinomas.
Note that lung adenocarcinoma may not cause any symptoms, especially early on in its development and that the signs and symptoms are not specific to lung adenocarcinoma and may be caused by other conditions.
3 Lung squamous cell carcinoma ==> lung_scc :
Squamous cell lung cancer, or squamous cell carcinoma of the lung, is one type of non-small cell lung cancer (NSCLC).Squamous cell lung cancer is categorized as such by how the cells look under a microscope. Squamous cell lung cancer begins in the squamous cells—thin, flat cells that look like fish scales when seen under a microscope. They line the inside of the airways in the lungs. Squamous cell lung cancer is also called epidermoid carcinoma
Squamous cell lung tumors usually occur in the central part of the lung or in one of the main airways (left or right bronchus). The tumor’s location is responsible for symptoms such as cough, trouble breathing, chest pain, and blood in the sputum. If the tumor grows to a large size, a chest X-ray or computed tomography (CT or CAT ) scan may detect a cavity in the lung. A cavity is a gas- or fluid-filled space within a tumor mass or nodule and is a classic sign of squamous cell lung cancer. Squamous cell lung cancer can spread to multiple sites, including the brain, spine and other bones, adrenal glands, and liver.
About 30% of all lung cancers are classified as squamous cell lung cancer. It is more strongly associated with smoking than any other type of non-small cell lung cancer. Other risk factors for squamous cell lung cancer include age, family history, and exposure to secondhand smoke, mineral and metal dust, asbestos, or radon.
Note that squamous cell lung cancer may not cause any symptoms, especially early on in its development and that the signs and symptoms are not specific to squamous cell lung cancer and may be caused by other conditions.
5. Preparing the tools
- pandas for data analysis.
- NumPy for numerical operations.
- Matplotlib/seaborn for plotting or data visualization.
- Scikit-Learn for machine learning modelling and evaluation.
- TensorFlow for bundles together Machine Learning and Deep Learning models and algorithms./Deep Learning Framework
import numpy as np # np is short for numpy import pandas as pd # pandas is so commonly used, it's shortened to pd import matplotlib.pyplot as plt import seaborn as sns # seaborn gets shortened to sns import tensorflow as tf import os # We want our plots to appear in the notebook %matplotlib inline ## Models from tensorflow.keras.models import Sequential ## Layers from tensorflow.keras.layers import Conv2D from tensorflow.keras.layers import MaxPool2D from tensorflow.keras.layers import Flatten from tensorflow.keras.layers import Dense from tensorflow.keras.layers import Dropout #for Overfiting Problem ## Preprocessing from tensorflow.keras.preprocessing.image import ImageDataGenerator from tensorflow.keras.preprocessing.image import img_to_array from tensorflow.keras.preprocessing.image import load_img from tensorflow.keras.callbacks import EarlyStopping #for Overfiting Problem from tensorflow.keras.utils import plot_model from matplotlib.image import imread import matplotlib.image as mpimg from tensorflow.keras.preprocessing import image ## Model evaluators from sklearn.metrics import confusion_matrix, classification_report from sklearn.metrics import precision_score, recall_score, f1_score #Load Model from tensorflow.keras.models import load_model
6. Load Data
- This dataset contains 15,000 histopathological images with 3 classes. All images are 768 x 768 pixels in size and are in jpeg file format.
- The images were generated from an original sample of HIPAA compliant and validated sources, consisting of 750 total images of lung tissue (250 benign lung tissue, 250 lung adenocarcinomas, and 250 lung squamous cell carcinomas) and augmented to 15,000 using the Augmentor package.
- There are three classes in the dataset, each with 5,000 images, being :
- Lung benign tissue ==> lung_n
- Lung adenocarcinoma ==> lung_aca
- Lung squamous cell carcinoma ==> lung_scc
'E:\\Final Project\\lung_data_split' os.listdir(data_dir) train_data_dir = data_dir+'\\train\\' val_data_dir = data_dir+'\\val\\' test_data_dir = data_dir+'\\test\\' os.listdir(train_data_dir)data_dir =
7.Preparing the Data
7.1 Data Augmentation
- We will increase the size of the image training dataset artificially by performing some Image Augmentation techniques.
# Create Image Data Generator for Train Set image_gen = ImageDataGenerator( rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True ) # Create Image Data Generator for Test/Validation Set test_data_gen = ImageDataGenerator(rescale=1./255)
7.2 Loading the Images
train =image_gen.flow_from_directory( train_data_dir, target_size=(150,150), color_mode='rgb', class_mode='categorical', batch_size=32 ) test =test_data_gen.flow_from_directory( test_data_dir, target_size=(150,150), color_mode='rgb', class_mode='categorical', batch_size=32, shuffle=False ) # setting shuffle as False just so we can later compare it with predicted values without having an indexing problem valid =test_data_gen.flow_from_directory( val_data_dir, target_size=(150,150), color_mode='rgb', class_mode='categorical', batch_size=32, )
Look at some of the train set images
lung_aca_names = os.listdir(train_data_dir+'\\lung_aca\\') lung_aca_names[:10] lung_n_names = os.listdir(train_data_dir+'\\lung_n\\') lung_n_names[:10] lung_scc_names = os.listdir(train_data_dir+'\\lung_scc\\') lung_scc_names[:10]
# parameters for the graph. The images will be in a 4x4 configuration nrows = 8 ncols = 8 pic_index = 0 #index for iterating over images #set up matplotlib figure fig = plt.gcf() fig.set_size_inches(ncols*4, nrows*4) pic_index+=8 lung_aca_pic = lung_n_pic = lung_scc_pic = for i, img_path in enumerate(lung_aca_pic + lung_n_pic +lung_scc_pic): # setting up subplot. subplots start at index 1 sub = plt.subplot(nrows, ncols, i + 1) sub.axis("off") #turning off axis. Don't show axis img = mpimg.imread(img_path) plt.imshow(img) plt.show() #1-Lung adenocarcinoma #2-Lung benign tissue #3-Lung squamous cell carcinoma
8.Convolutional Neural Network
- IT is an Artificial Neural Network that has the ability to pinpoint or detect patterns in the images. CNN architecture Example
- Image to see how CCN work
#Define the model Layers (Define The model) model =Sequential() # first convolution model.add(Conv2D(32,(3,3),activation='relu',input_shape=(150,150,3))) model.add(MaxPool2D(pool_size=(2,2))) # second convolution model.add(Conv2D(64,(3,3),activation='relu')) model.add(MaxPool2D(pool_size=(2,2))) # Third convolution model.add(Conv2D(64,(3,3),activation='relu')) model.add(MaxPool2D(pool_size=(2,2))) # flatten before feeding into Dense neural network. model.add(Flatten()) # 512 neurons in the hidden layer model.add(Dense(512,activation='relu')) model.add(Dropout(rate=0.4)) #3 = 3 different categories # softmas takes a set of values and effectively picks the biggest one. for example if the output layer has # [0.1,0.1,0.5], it will take it and turn it into [0,0,1] model.add(Dense(3,activation='softmax')) #Compiling the model optmz = tf.keras.optimizers.Adam(learning_rate=3e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,name='Adam') model.compile(optimizer =optmz,loss="categorical_crossentropy",metrics=['accuracy'])
Getting a summary of the model built above.
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) (None, 148, 148, 32) 896 _________________________________________________________________ max_pooling2d (MaxPooling2D) (None, 74, 74, 32) 0 _________________________________________________________________ conv2d_1 (Conv2D) (None, 72, 72, 64) 18496 _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 36, 36, 64) 0 _________________________________________________________________ conv2d_2 (Conv2D) (None, 34, 34, 64) 36928 _________________________________________________________________ max_pooling2d_2 (MaxPooling2 (None, 17, 17, 64) 0 _________________________________________________________________ flatten (Flatten) (None, 18496) 0 _________________________________________________________________ dense (Dense) (None, 512) 9470464 _________________________________________________________________ dropout (Dropout) (None, 512) 0 _________________________________________________________________ dense_1 (Dense) (None, 3) 1539 ================================================================= Total params: 9,528,323 Trainable params: 9,528,323 Non-trainable params: 0 _________________________________________________________________
8.1 Fit the model
'val_loss',patience=2,mode='min') model.fit_generator(train,epochs=25,validation_data=valid)early = EarlyStopping(monitor=
Epoch 2/25 329/329 [==============================] - 289s 878ms/step - loss: 0.2608 - accuracy: 0.8943 - val_loss: 0.2166 - val_accuracy: 0.9113 Epoch 3/25 329/329 [==============================] - 270s 821ms/step - loss: 0.2037 - accuracy: 0.9188 - val_loss: 0.1838 - val_accuracy: 0.9210 Epoch 4/25 329/329 [==============================] - 277s 841ms/step - loss: 0.1859 - accuracy: 0.9284 - val_loss: 0.1579 - val_accuracy: 0.9370 Epoch 5/25 329/329 [==============================] - 284s 863ms/step - loss: 0.1673 - accuracy: 0.9350 - val_loss: 0.1489 - val_accuracy: 0.9413 Epoch 6/25 329/329 [==============================] - 308s 935ms/step - loss: 0.1607 - accuracy: 0.9356 - val_loss: 0.1804 - val_accuracy: 0.9240 Epoch 7/25 329/329 [==============================] - 308s 936ms/step - loss: 0.1313 - accuracy: 0.9491 - val_loss: 0.1405 - val_accuracy: 0.9377 Epoch 8/25 329/329 [==============================] - 307s 932ms/step - loss: 0.1245 - accuracy: 0.9526 - val_loss: 0.1126 - val_accuracy: 0.9527 Epoch 9/25 329/329 [==============================] - 303s 920ms/step - loss: 0.1251 - accuracy: 0.9510 - val_loss: 0.1431 - val_accuracy: 0.9450 Epoch 10/25 329/329 [==============================] - 303s 921ms/step - loss: 0.1064 - accuracy: 0.9587 - val_loss: 0.0765 - val_accuracy: 0.9717 Epoch 11/25 329/329 [==============================] - 300s 911ms/step - loss: 0.0875 - accuracy: 0.9657 - val_loss: 0.0680 - val_accuracy: 0.9760 Epoch 12/25 329/329 [==============================] - 384s 1s/step - loss: 0.0925 - accuracy: 0.9648 - val_loss: 0.0904 - val_accuracy: 0.9637 Epoch 13/25 329/329 [==============================] - 366s 1s/step - loss: 0.0723 - accuracy: 0.9730 - val_loss: 0.0613 - val_accuracy: 0.9760 Epoch 14/25 329/329 [==============================] - 249s 756ms/step - loss: 0.0788 - accuracy: 0.9678 - val_loss: 0.0723 - val_accuracy: 0.9740 Epoch 15/25 329/329 [==============================] - 201s 611ms/step - loss: 0.0582 - accuracy: 0.9794 - val_loss: 0.0633 - val_accuracy: 0.9753 Epoch 16/25 329/329 [==============================] - 201s 611ms/step - loss: 0.0617 - accuracy: 0.9758 - val_loss: 0.0596 - val_accuracy: 0.9790 Epoch 17/25 329/329 [==============================] - 204s 619ms/step - loss: 0.0482 - accuracy: 0.9835 - val_loss: 0.0713 - val_accuracy: 0.9753 Epoch 18/25 329/329 [==============================] - 202s 613ms/step - loss: 0.0477 - accuracy: 0.9825 - val_loss: 0.0637 - val_accuracy: 0.9737 Epoch 19/25 329/329 [==============================] - 202s 615ms/step - loss: 0.0387 - accuracy: 0.9853 - val_loss: 0.0260 - val_accuracy: 0.9920 Epoch 20/25 329/329 [==============================] - 203s 618ms/step - loss: 0.0358 - accuracy: 0.9866 - val_loss: 0.0504 - val_accuracy: 0.9833 Epoch 21/25 329/329 [==============================] - 203s 616ms/step - loss: 0.0410 - accuracy: 0.9856 - val_loss: 0.0635 - val_accuracy: 0.9720 Epoch 22/25 329/329 [==============================] - 202s 613ms/step - loss: 0.0391 - accuracy: 0.9858 - val_loss: 0.0294 - val_accuracy: 0.9890 Epoch 23/25 329/329 [==============================] - 202s 615ms/step - loss: 0.0321 - accuracy: 0.9895 - val_loss: 0.0441 - val_accuracy: 0.9807 Epoch 24/25 329/329 [==============================] - 202s 613ms/step - loss: 0.0318 - accuracy: 0.9884 - val_loss: 0.0293 - val_accuracy: 0.9873 Epoch 25/25 329/329 [==============================] - 202s 613ms/step - loss: 0.0376 - accuracy: 0.9875 - val_loss: 0.0265 - val_accuracy: 0.9910
After 25th epochs , loss: 3.76% - accuracy: 98.75% - val_loss =2.65% and val_accuracy = 99.1%.
Display Data Frame , to show result losses = pd.DataFrame(model.history.history) losses
loss accuracy val_loss val_accuracy 0 0.406556 0.818762 0.282786 0.874000 1 0.260829 0.894286 0.216612 0.911333 2 0.203748 0.918762 0.183814 0.921000 3 0.185943 0.928381 0.157870 0.937000 4 0.167282 0.934952 0.148858 0.941333 5 0.160665 0.935619 0.180409 0.924000 6 0.131286 0.949143 0.140499 0.937667 7 0.124453 0.952571 0.112573 0.952667 8 0.125083 0.951048 0.143113 0.945000 9 0.106354 0.958667 0.076468 0.971667 10 0.087470 0.965714 0.067990 0.976000 11 0.092488 0.964762 0.090435 0.963667 12 0.072257 0.972952 0.061263 0.976000 13 0.078837 0.967809 0.072340 0.974000 14 0.058178 0.979429 0.063255 0.975333 15 0.061739 0.975810 0.059614 0.979000 16 0.048189 0.983524 0.071324 0.975333 17 0.047651 0.982476 0.063689 0.973667 18 0.038731 0.985333 0.026008 0.992000 19 0.035842 0.986571 0.050398 0.983333 20 0.041019 0.985619 0.063487 0.972000 21 0.039074 0.985810 0.029373 0.989000 22 0.032058 0.989524 0.044130 0.980667 23 0.031831 0.988381 0.029324 0.987333 24 0.037558 0.987524 0.026539 0.991000
Display some graphs to see if the model work good in training or not
9.1 Evaluating our model with Data that he didn't see before
We can make a classification report using
classification_report() and passing it the true labels, as well as our models, predicted labels.
A classification report will also give us information of the precision and recall of our model for each class.
#Print classification Report test_steps_per_epoch = np.math.ceil(test.samples / test.batch_size) predictions = model.predict_generator(test, steps=test_steps_per_epoch) predicted_classes = np.argmax(predictions, axis=1) true_classes = test.classes class_labels = list(test.class_indices.keys()) report = classification_report(true_classes, predicted_classes, target_names=class_labels) print(report)
precision recall f1-score support lung_aca 0.98 0.99 0.99 500 lung_n 1.00 1.00 1.00 500 lung_scc 0.99 0.98 0.98 500 accuracy 0.99 1500 macro avg 0.99 0.99 0.99 1500 weighted avg 0.99 0.99 0.99 1500
- Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.
- Recall - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.
- F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.
- Support - The number of samples each metric was calculated on.
- Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.
- Macro avg - Short for macro average, the average precision, recall and F1 score between classes. Macro avg doesn’t class imbalance into effort, so if you do have class imbalances, pay attention to this metric.
- Weighted avg - Short for weighted average, the weighted average precision, recall and F1 score between classes. Weighted means each metric is calculated with respect to how many samples there are in each class. This metric will favour the majority class (e.g. will give a high value when one class out performs another due to having more samples).
A confusion matrix is a visual way to show where your model made the right predictions and where it made the wrong predictions (or in other words, got confused).
7,7)) cm = confusion_matrix(true_classes,predicted_classes) cm_df =pd.DataFrame(cm, columns=['Predicted Lung ACA ','Predicted Lung BT','Predicted Lung SCC'], index=['Actual Lung ACA ','Actual Lung BT','Actual Lung SCC']) sns.heatmap(cm_df,annot=True,cmap="Blues", fmt='.0f');plt.subplots(figsize=(
#Save The Model model.save("lung_cancer.h5")
9.2 Predict Some Images :
- Index 0 =Lung adenocarcinoma ==> lung_aca
- Index 1 =Lung benign tissue ==> lung_n
- Index 2 =Lung squamous cell carcinoma ==> lung_scc
# Index 1 = Lung benign tissue ==> lung_n single_Lung_hist_n = "E:\\Final Project\\lung_data_split\\test\\lung_n\\lungn242.jpeg" img = image.load_img(single_Lung_hist_n, target_size = (150, 150)) array = image.img_to_array(img) x = np.expand_dims(array, axis=0) test_image = model.predict_proba(x) result = test_image.argmax(axis=-1) if result ==: print('Lung squamous cell carcinoma ') elif result ==: print('Lung Adenocarcinoma') elif result ==: print('Lung benign tissue ')
Lung benign tissue
10. Load Model
#Load Model model_dir= 'lung_cancer.h5' # loading using .h5 file new_lung_model = tf.keras.models.load_model( "lung_cancer.h5", custom_objects=None, compile=True)
Great Job! We’ve just created a CNN model that can classify histopathological images as a Lung benign tissue or a Lung adenocarcinoma or a Lung squamous cell carcinoma with 99.1% accuracy after the 25th epochs.