Machine Learning and Medical Imaging

CNN for Pneumonia Diagnosis

Project Details

In the context of my "Classes Préparatoires" in the MP* class of Lycée Saint-Louis (Paris), I developped Machine Learning techniques to create an algorithm that diagnoses pneumonia from radiographs. This project is inspired by a Kaggle project by Gaurav Dutta. It combines medical imaging and convolutional neural networks, responds to the national theme of the year 2021-2022 common to all CGPE students: Health and Prevention. In this national contest, I received the mark of 19/20.



Libraries and modules


First, we need to import all the necessary libraries, modules, and parameters for the project.

import matplotlib.pyplot as plt
import keras.backend as K
import tensorflow as tf
import pandas as pd 
import numpy as np
import random
import cv2
import os

from keras.models import Model, Sequential
from keras.layers import Input, Dense, Flatten, Dropout, BatchNormalization
from keras.layers import Conv2D, SeparableConv2D, MaxPool2D, LeakyReLU, Activation
from tensorflow.keras.optimizers import Adam
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping

input_path = 'C:/Users/theog/chest_xray/'

Pseudo-random numbers


The issue with computers is that they are deterministic, not random. Pseudo-random number generators produce numbers that seem random but are actually determined as they depend on input parameters. We use pseudo-random numbers because they allow the code to be re-executable in the same way, with an aspect of randomness to see if the code works well, but also to keep the same input values for coherent and rigorous output analyses.

seed = 232
np.random.seed(seed)
tf.random.set_seed(seed)

Data Sampling


Then, I divided the data sets into three subsets: train, val, and test, which represent the training, validation, and test data, respectively.

for _set in ['train', 'val', 'test']:
	n_normal = len(os.listdir(input_path + _set + '/NORMAL'))
	n_infect = len(os.listdir(input_path + _set + '/PNEUMONIA'))
	print('Ensemble : {} / Scans sains : {} / Scans de pneumonie : {}'.format(_set, n_normal, n_infect))

Image preprocessing


For the preprocessing phase, I defined a function to process the data and load all training and test values into the network. I also created labels for the images. Since our dataset is small, we use a method called Data Augmentation to increase the volume of labelled images, which current algorithms are fond of.

Data Augmentation


In data analysis, this method is used to increase the quantity of data by adding slightly modified copies of existing data, or by creating new data from a specific instruction (we will use the first method here). Note: thus, in the code, "datagen" means "data generation objects". This allows to regularize the data and helps to reduce overfitting when training a machine learning model. This method is sometimes compared to oversampling in music (which involves constructing a piece by recording several successive snippets one over the other using a sampler). In our project, Data Augmentation modifies the size of the dataset (the training set, to be precise) by altering the appearance of the images, without changing their semantics (their essence).

Once the training set is more diversified, the network will be able, during its training phase, to take a broader perspective on the data. In the code, we have:
  • rescale=1./255: we resize the images
  • zoom_range=0.3: we apply a zoom of ratio 0.3
  • vertical_flip=True: we apply a rotation

  • Data Generators


    We then define two data generators:
  • train_gen for training data
  • test_gen for validation data

  • These generators allow to load the necessary data directly from the source folder and convert them into training data that will be injected into the model.

    def process_data(img_dims, batch_size):
    
    	train_datagen = ImageDataGenerator(rescale=1./255, zoom_range=0.3, vertical_flip=True)
    	test_val_datagen = ImageDataGenerator(rescale=1./255)
    	
    	train_gen = train_datagen.flow_from_directory(directory=input_path+'train', 
    		target_size=(img_dims, img_dims), batch_size=batch_size, class_mode='binary', shuffle=True)
    
    	test_gen = test_val_datagen.flow_from_directory(directory=input_path+'test', 
    		target_size=(img_dims, img_dims), batch_size=batch_size, class_mode='binary', shuffle=True)
    	
    	test_data = []
    	test_labels = []
    
    	for cond in ['/NORMAL/', '/PNEUMONIA/']:
    		for img in (os.listdir(input_path + 'test' + cond)):
    			img = plt.imread(input_path+'test'+cond+img)
    			img = cv2.resize(img, (img_dims, img_dims))
    			img = np.dstack([img, img, img])
    			img = img.astype('float32') / 255
    			if cond=='/NORMAL/':
    				label = 0
    			elif cond=='/PNEUMONIA/':
    				label = 1
    			test_data.append(img)
    			test_labels.append(label)
    		
    	test_data = np.array(test_data)
    	test_labels = np.array(test_labels)
    	
    	return train_gen, test_gen, test_data, test_labels
    

    Network Hyperparameters


    In this part, I just defined a few constants that will be useful later:

  • img_dims = 150 : dimension of images (for the preprocessing phase)
  • epochs = 10 : the epoch indicates the number of times the machine learning algorithm will go through the entire test data set. In general, datasets are grouped into batches (especially when the number of data is high). We can use the term 'iteration' to talk about injecting a batch into the model (thus representing 1 iteration).
  • batch_size = 32 : the batch size will usually take a value between 32 and 128, this value can be varied depending on the computing power of the computer and the level of precision expected from the model.

  • "batch_size" is the number of examples that are "shown" to the algorithm before the weights/parameters of the model are recalculated (the gradient is calculated for each subset of cardinal batch_size and not on all the data at once to circumvent memory and processing problems). This is equivalent to adjusting the model every n examples, where n = batch_size, instead of adjusting it after each of these examples. This tends to facilitate learning and avoid certain drifts that can complicate learning.For example, if the batch size is equal to the entirety of the training set, then the number of epochs is the number of iterations.

    The general relationship is b * d = e * i where :
  • d = size of the dataset
  • e = number of epochs
  • i = number of iterations
  • b = size of the batch

  • img_dims = 150
    epochs = 5
    batch_size = 32

    Training and Testing Data


    We apply the preprocessing function to our training and test data generators, our test data, and our labels.

    train_gen, test_gen, test_data, test_labels = process_data(img_dims, batch_size)

    Convolutional Neural Network Architecture


    Then comes the crucial part of building the convolutional neural network (CNN) model. A CNN architecture is formed by stacking processing layers:

  • The convolution layer (CONV) that processes data from a receptive field
  • The pooling layer (POOL), which compresses information by reducing the size of the intermediate image (often by subsampling)
  • The rectification layer (ReLU), often mistakenly called "ReLU" in reference to the activation function (Rectified Linear Unit)
  • The "fully connected" layer (FC), which is a perceptron type layer
  • The loss layer (LOSS)

  • For this project, I assembled 5 convolutional blocks composed of a convolution layer, a pooling layer, and a batch-normalization layer.

    Convolution Layer and Hyperparameters


    The convolution layer is the basic building block of a convolutional neural network. To size the output volume (also called "convolution layer volume"), we define 3 hyperparameters: depth, stride, and padding (in machine learning, a hyperparameter is a parameter whose value is used to control the learning process).

  • The depth of the layer corresponds to the number of neurons associated with the same receptive field* (number of convolution kernels).
  • The stride specifies the overlap of receptive fields. The smaller it is, the more receptive fields overlap and the larger the output volume will be.
  • Padding size (at 0), also called "zero padding". This margin consists of placing zeros at the border of the input volume, it allows to control the spatial
  • dimension of the output volume, i.e. the number of receptive fields to manage. Indeed, it is sometimes desirable to keep the same surface as the input volume.

    * Receptive field of a neuron: volume of space that modifies the response of this neuron, when a sufficiently powerful and rapid stimulus occurs within it. Such receptive fields have been identified in the visual, auditory, and somatosensory systems. Thus, the receptive field of a neuron in the visual system is the portion of the visual field that, when a light stimulus is presented within it, modifies the response of this neuron. This definition has been extended to more abstract spaces that describe the possible parameters of a stimulation. For example, some neurons in the visual system are only excited by certain wavelengths. We can therefore define the receptive field of a neuron as the subset of stimulation parameters that modify its activity.

    Pooling Layer


    Pooling is a form of subsampling of the image. The input image is divided into a series of non-overlapping cells of n pixels per side. The output signal of a cell is defined based on the values taken by the different pixels in the cell. Pooling decreases the spatial size of an intermediate image, which can reduce the amount of parameters and the number of calculations in the network. It can therefore reduce overfitting and also create a form of translation invariance. The most common form is a pooling layer with 2x2 cells and the output value being the maximum input value. In this case, it is called "Max-Pool 2x2" (compression by a factor of 4).

    Batch Normalization Layer:


    Batch normalization is a method that makes an artificial neural network faster and more stable by normalizing the input layers by centering and resizing the data received. Even today, its effectiveness remains empirical and the theoretical functioning of the method is not completely mastered. The most common explanation is that this method reduces the "covariate shift" (or covariate displacement). The covariate shift is the variation in the distribution of a network's activation due to training parameters. In a neural network, the output of the first layer is sent to the input of the second etc., when the parameters of a layer change, then the distribution of inputs to the following layers is also modified. These variations in the distribution of inputs can become problematic for the neural network, especially when there are a large number of layers.

    Rectification Layer


    To improve the performance of the neural network, we use a rectification layer that operates a mathematical function (called the activation function) on the output of layers. Several functions exist:

  • Hyperbolic tangent rectification: f(x) = tanh(x)
  • Saturating hyperbolic tangent rectification: f(x) = |tanh(x)|
  • Sigmoid function rectification: f(x) = (1+e^(-x))^(-1)
  • ReLU (Rectified Linear Unit) rectification: f(x) = max(0,x)

  • Often, ReLU correction is preferable, as it results in faster neural network formation, without making a significant difference to generalization accuracy*. This function, also called a "non-saturating activation function," increases the non-linear properties of the decision function and of the whole network without affecting the receptive fields of the convolution layer.

    * source: "ImageNet Classification with Deep Convolutional Neural Networks", Advances in neural processing systems, A. Krizhevsky, I. Sutskever, and G. E. Hinton

    inputs = Input(shape=(img_dims, img_dims, 3))
    
    # 1er bloc convolutif
    x = Conv2D(filters=16, kernel_size=(3, 3), activation='relu', padding='same')(inputs)
    x = Conv2D(filters=16, kernel_size=(3, 3), activation='relu', padding='same')(x)
    x = MaxPool2D(pool_size=(2, 2))(x)
    
    # 2ème bloc convolutif
    x = SeparableConv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same')(x)
    x = SeparableConv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same')(x)
    x = BatchNormalization()(x)
    x = MaxPool2D(pool_size=(2, 2))(x)
    
    # 3ème bloc convolutif
    x = SeparableConv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same')(x)
    x = SeparableConv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same')(x)
    x = BatchNormalization()(x)
    x = MaxPool2D(pool_size=(2, 2))(x)
    
    # 4ème bloc convolutif
    x = SeparableConv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same')(x)
    x = SeparableConv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same')(x)
    x = BatchNormalization()(x)
    x = MaxPool2D(pool_size=(2, 2))(x)
    x = Dropout(rate=0.2)(x)
    
    # 5ème bloc convolutif
    x = SeparableConv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same')(x)
    x = SeparableConv2D(filters=256, kernel_size=(3, 3), activation='relu', padding='same')(x)
    x = BatchNormalization()(x)
    x = MaxPool2D(pool_size=(2, 2))(x)
    x = Dropout(rate=0.2)(x)

    Fully Connected Layer


    After constructing the 5 convolutional blocks, I applied a so-called "fully connected" layer. These layers connect each neuron from one layer to every neuron of another layer, and it always constitutes the last layer of a neural network. In input, this layer receives a vector and produces a new output vector, which is obtained by applying a linear combination (and possibly an activation function) to the values of the input vector.

    This layer returns a vector of size N, where N is the number of classes in our image classification problem. Each element of the vector indicates the probability for the input image to belong to a class. For example, if the first element represents the probability of having a "normal" scan and the second represents the probability of having a "pneumonia" scan, the final vector [0.3 ; 0.7] means that the image has a 70% chance of belonging to the "pneumonia" class.

    Each value of the input vector "votes" in favor of a class. The votes do not all have the same importance as the layer assigns them weights which depend on the element of the array and the class. To calculate the probabilities, the fully connected layer multiplies each input element by a weight, sums them, then applies an activation function. This process is equivalent to multiplying the input vector by the matrix containing the weights. The fact that each input value is connected with all output values explains the term "fully connected".

    How do we know the value of these weights? The convolutional neural network learns the values of the weights in the same way as it learns the filters of the convolutional layer: during the training phase, by backpropagation of the gradient. The fully connected layer determines the link between the position of the features in the image and a class. Indeed, the input vector comes from the previous layer, it corresponds to an activation map for a given feature: high values indicate the location (more or less precise depending on the pooling) of this feature in the image. If the location of a feature in a certain place in the image is characteristic of a certain class, then a high weight is assigned to the corresponding value in the vector.

    Dropout Method


    In each block, I applied the dropout method. This is a regularization technique to reduce overfitting in neural networks. This allows an averaging of the computation model with neural networks (*https://arxiv.org/abs/1207.0580*). The term "dropout" refers to a temporary removal of neurons from a neural network (*https://jmlr.org/papers/v15/srivastava14a.html*). Thus, the neural network is partially deprived of its neurons during the training phase (their value is estimated at 0) and they are reactivated to test the new model.

    # Couche "entièrement connectée"
    x = Flatten()(x)
    x = Dense(units=512, activation='relu')(x)
    x = Dropout(rate=0.7)(x)
    x = Dense(units=128, activation='relu')(x)
    x = Dropout(rate=0.5)(x)
    x = Dense(units=64, activation='relu')(x)
    x = Dropout(rate=0.3)(x)

    Sigmoid Function


    So far, I have been using the Relu function. However, for the final layer of the network, I have applied the "sigmoid" function, simply because our problem is fundamentally binary (NORMAL or PNEUMONIA). The sigmoid function produces a value between 0 and 1, representing a probability. It is commonly used in binary classification when a model needs to determine only two labels, which is our case. The sigmoid function is defined as f(x) = (1+e^(-x))^(-1).

    # Couche de sortie
    output = Dense(units=1, activation='sigmoid')(x)
    

    Adam Optimizer


    The Adam optimizer is an extension of stochastic gradient descent that assigns weights to vector values. The name "Adam" stands for "adaptive moment estimation". Adam is efficient and requires low computational memory, making it suitable for processing large amounts of data.

    Loss Layer


    The loss layer specifies the difference between the expected and obtained results. It is typically placed at the end of the network. In this case, we are using the binary cross-entropy function with a sigmoid activation function as input. Binary cross-entropy is a loss function designed for binary situations, such as our medical diagnostic case. The binary cross-entropy loss combined with a sigmoid function is used to predict K independent probability measures within the range of [0,1].

    # Création du modèle et compilation
    model = Model(inputs=inputs, outputs=output)
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    Callback Functions


    A callback function is a function that is passed as an argument to another function. Callback functions are commonly used when working with asynchronous functions in a program. In synchronous execution, code runs line by line, waiting for the completion of the previous line. However, in asynchronous code, the next line does not wait for the asynchronous line to finish executing. This allows for the use of callback functions:

  • ModelCheckpoint: During the training of a network, it takes time to achieve good results. To address this, the network is trained over multiple epochs.
  • ModelCheckpoint is a callback function that saves the best performing model at the end of each iteration.
  • EarlyStopping: This callback function is useful for stopping the generalization process when the difference between the training results and the validation data
  • starts to increase significantly, indicating overfitting.

    # Fonctions de rappel
    checkpoint = ModelCheckpoint(filepath='best_weights.hdf5', save_best_only=True, save_weights_only=True)
    lr_reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.3, patience=2, verbose=2, mode='max')
    early_stop = EarlyStopping(monitor='val_loss', min_delta=0.1, patience=1, mode='min')

    Model Training


    I then trained the model for 10 epochs with a batch size of 32. It is worth noting that a larger batch size tends to yield better results but also consumes significant computational resources. Research has shown that there is an optimal batch size for achieving the best results without excessive resource consumption. Finding this optimal batch size often involves more in-depth exploration of high-parameter tuning. However, this study itself could be a separate topic for further investigation (such as a TIPE project). For now, we will stick with our chosen parameters.

    hist = model.fit_generator(
    			train_gen, steps_per_epoch=train_gen.samples // batch_size, 
    			epochs=epochs, validation_data=test_gen, 
    			validation_steps=test_gen.samples // batch_size, callbacks=[checkpoint, lr_reduce])

    To evaluate the performance of the model, we can visualize various metrics. Here are some common metrics used for performance evaluation:

  • Accuracy: It measures the overall correctness of the model's predictions by calculating the ratio of correct predictions to the total number of predictions.
  • Precision: It indicates the proportion of true positive predictions (correctly predicted positives) out of all positive predictions made by the model. Precision is useful when the cost of false positives is high.
  • Recall (Sensitivity): It measures the proportion of true positive predictions out of all actual positive instances in the dataset. Recall is important when the cost of false negatives (missed positives) is high.
  • F1 Score: It combines precision and recall into a single metric, providing a balanced measure of the model's performance. It is the harmonic mean of precision and recall.
  • Confusion Matrix: It is a table that shows the counts of true positives, true negatives, false positives, and false negatives, providing a detailed breakdown of the model's predictions.
  • ROC Curve (Receiver Operating Characteristic): It is a graphical plot that illustrates the performance of a binary classifier by varying the decision threshold. It shows the trade-off between true positive rate (TPR) and false positive rate (FPR).

  • These metrics and visualizations can help assess the model's performance in different aspects and provide insights into its strengths and weaknesses.

    from sklearn.metrics import accuracy_score, confusion_matrix
    
    preds = model.predict(test_data)
    accuracy = accuracy_score(test_labels, np.round(preds))*100
    cm = confusion_matrix(test_labels, np.round(preds))
    tn, fp, fn, tp = cm.ravel()
    
    print('MATRICE DE CONFUSION ---------------------')
    print(cm)
    
    print('\nMETRIQUES DE TEST ----------------------')
    precision = tp/(tp+fp)*100
    recall = tp/(tp+fn)*100
    print('Accuracy: {}%'.format(accuracy))
    print('Precision: {}%'.format(precision))
    print('Recall: {}%'.format(recall))
    print('F1-score: {}'.format(2*precision*recall/(precision+recall)))
    
    print('\nMETRIQUES D\'ENTRAINEMENT --------------')
    print('Train accuracy: {}'.format(np.round((hist.history['accuracy'][-1])*100, 2)))
    plt.show()

    Conclusion


    In conclusion, the model achieves an accuracy of 91.02% in 10 epochs, which is quite promising considering the limited amount of training data available compared to what is typically accessible to companies and medical sectors. With more powerful computers, it would be possible to increase the number of epochs and batch sizes, which would greatly improve the results.