Théo Gachet - Portfolio

Real Estate Price Prediction Algorithm across France

Project Details

Category: Machine Learning
Language: Python
Client: CNRS and Inria
Date: February 2023
GitHub: Project Repository

The objective is to develop a Machine Learning model with the capability of accurately predicting the price of a house in France. This model will utilize a wide array of characteristic parameters that play a crucial role in determining the market value of a property. The code structure and dataset are inspired by a project by Prasanna Sattigeri. The features include the number of rooms, which speaks to the size and capacity of the house; the local crime rate, which greatly affects the desirability of a neighborhood; and the level of education, a statistic often indicative of the socioeconomic status of the surrounding area.

Introduction

The study will be done in several steps:

Step 1: Importing libraries and modules

Step 2: Loading and pre-processing data

Step 3: Training a Gaussian regression model

Step 4: Evaluation of prediction intervals

Step 5: Recalibration via UCC (Uncertainty Characteristic Curve)

Step 6: Adding a parameter to reduce random uncertainty

Step 7: Augmenting the data to reduce model uncertainty

Step 8: Interpretation and accounting for uncertainties

Step 1: Importing libraries and modules

# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn

from matplotlib.pyplot import plot, scatter, xlim, legend, title, figure
from torch.utils.data import Dataset, TensorDataset, DataLoader
from importlib import reload
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import UQ360 library for uncertainty quantification
from uq360.algorithms.homoscedastic_gaussian_process_regression import HomoscedasticGPRegression
from uq360.algorithms.ucc_recalibration import UCCRecalibration
from uq360.metrics import picp, mpiw, compute_regression_metrics, plot_uncertainty_distribution, plot_uncertainty_by_feature, plot_picp_by_feature
from uq360.metrics.uncertainty_characteristics_curve import UncertaintyCharacteristicsCurve as ucc

Step 2: Loading and pre-processing data

# Initialize parameters
nb_maisons = 1000
features = ['nb_pieces', 'taux_criminalite', 'taux_education']

# Generate synthetic data
nb_pieces = np.clip(np.ceil(10.0 * np.random.rand(nb_maisons)), 1, 10)
taux_criminalite = np.random.rand(nb_maisons)
taux_education = np.random.rand(nb_maisons)

# Define target variable as a function of input features
y = 0.5 * ( taux_education - taux_criminalite ) + 0.1 * ( nb_pieces + np.random.randn(nb_maisons))
y += 1.0
y = (250.0 / y.mean()) * y
X = np.hstack([nb_pieces.reshape(-1,1), taux_criminalite.reshape(-1,1), taux_education.reshape(-1,1)])

To verify that the generated data is consistent, we will now visualize on a graph the variation in property prices according to the crime rate and education level.

# Visualize the generated data
fig = plt.figure(figsize=(10, 10))
graphe = plt.scatter(taux_education, taux_criminalite, linewidths=0.5, alpha=0.7, edgecolor='black', s = 250, c=y)
plt.xlabel("Taux d'éducation")
plt.ylabel("Taux de criminalité dans le quartier")
plt.title("Visualisation des données")
graphe = plt.colorbar(graphe)
graphe.set_label("Prix de l'immobilier")
plt.show()

# Split the dataset into train and test subsets
x_train_full, x_test_full, y_train_full, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

The test_size parameter is the proportion of the entire data set to include in the training set.

# Further split the training data for calibration
x_train_full, x_train_calib_full, y_train_full, y_train_calib = train_test_split(x_train_full, y_train_full, test_size=0.4, random_state=0)	

# Filter out certain data points
x_train_full, y_train_full = x_train_full[x_train_full[:,0]>1], y_train_full[x_train_full[:,0]>1]

IMPORTANT: To understand how data influence the model's prediction, we will deliberately reduce the quality of the data set, as follows:

reduce the number of houses with more than 6 rooms

we reduce the number of houses whose price is higher than 300 k€

we will initially not take into account the level of education

# Further split the training data
x_train_keep, x_train_discard, y_train_keep, y_train_discard = train_test_split(x_train_full, y_train_full, test_size=0.9, random_state=0)

# Concatenate the kept and selected discarded data
idxs = x_train_discard[:,0] < 6
x_train = np.concatenate([x_train_keep, x_train_discard[idxs]])
y_train = np.concatenate([y_train_keep, y_train_discard[idxs]])

# Apply a filter to the target variable
idxs = y_train < 300
x_train = x_train[idxs]
y_train = y_train[idxs]

# Select only two features
x_train_two_features = x_train[:,:2]
x_test_two_features = x_test_full[:,:2]
x_train_calib_two_features = x_train_calib_full[:,:2]

# Visualize the training data
fig = plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(x_train[:,0], y_train, linewidths=0.5, alpha=0.7, edgecolor='black', s = 40)
plt.xlabel(features[0])
plt.ylabel('House pricing')
plt.title('Train data')

plt.subplot(1, 2, 2)
plt.scatter(x_train[:,1], y_train, linewidths=0.5, alpha=0.7, edgecolor='black', s = 40)
plt.xlabel(features[1])
plt.ylabel('House pricing')
plt.title('Train data')

Here we clearly see that we used two types of variables:

a discrete variable: the number of rooms (with few houses having more than 5 rooms)

a continuous variable: the crime rate (with few houses whose price is higher than 300 k€)

Logically, we observe that the price increases with the number of rooms and the education rate, it decreases with the crime rate.

# Visualize the test data
fig = plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(x_test_two_features[:,0], y_test, linewidths=0.5, alpha=0.7, edgecolor='black', s = 40)
plt.xlabel(features[0])
plt.ylabel('House pricing')
plt.title('Test data')

plt.subplot(1, 2, 2)
plt.scatter(x_test_two_features[:,1], y_test, linewidths=0.5, alpha=0.7, edgecolor='black', s = 40)
plt.xlabel(features[1])
plt.ylabel('House pricing')
plt.title('Test data')

Step 3: Training a Gaussian regression model

The choice of UQ methods depends on several parameters. In our case, we want to:

train a new model (therefore intrinsic algorithm)

determine epistemic uncertainty (model uncertainty)

work on a relatively small data set

Based on the flowchart above, the most judicious choice is therefore to implement a Gaussian process to obtain predictions with 95% confidence intervals as well as information on uncertainties.

# Fit a Gaussian Process model to the data
gp = HomoscedasticGPRegression()
gp.fit(x_train_two_features, y_train.reshape(-1,1))

# Perform predictions with the model
y_test_mean, y_test_lower_total, y_test_upper_total, y_test_lower_epistemic, y_test_upper_epistemic, y_dists = gp.predict(x_test_two_features, return_epistemic=True, return_dists=True)

We now want to visualize the mean, prediction intervals, and uncertainties for the number of rooms.

# Plot uncertainty by features
plot_uncertainty_by_feature(x_test_two_features[:, 0], y_test_mean, y_test_lower_total, y_test_upper_total, y_test_lower_epistemic, y_test_upper_epistemic, xlabel=features[0], ylabel='Prix de l\'immobilier (en k€)');

Step 4: Evaluating Prediction Intervals

Prediction Interval Coverage Probability (PICP) is used to evaluate the calibration of the UQ method, we choose to use the PICP metric provided by UQ360. This is defined as the proportion of a sample (usually test data) covered by the prediction interval. In our case, we want PICP ~ 95% (or higher).

Mean Prediction Interval Width (MPIW) allows calculating the average width of prediction intervals (measured on test data).

IMPORTANT: When using re-calibration methods to improve PICP, MPIW can increase. To find the "operating point" of a model, you thus need to combine PICP and MPIW, which is what the Uncertainty Characteristic Curve (UCC) tool does, which we will use during the recalibration in step 5.

# Compute regression metrics
res = compute_regression_metrics(y_test, y_test_mean, y_test_lower_total, y_test_upper_total)

# Plot the Prediction Interval Coverage Probability (PICP) by feature before calibration for test data
plot_picp_by_feature(x_test_two_features[:, 0], y_test,
						y_test_lower_total, y_test_upper_total,
						xlabel=features[0],
						title="Before recalibration \nTest Data: PICP={:.2f} and MPIW={:.2f}".format(res["picp"], res["mpiw"]));

For houses that have more than 5 rooms, the PICP is very low, this corresponds to the removal of some values done previously and confirms the influence of such a modification on the data set.

plot_picp_by_feature(x_test_two_features[:, 1], y_test,
y_test_lower_total, y_test_upper_total,
xlabel=features[1],
title="Before recalibration \nTest Data: PICP={:.2f} and MPIW={:.2f}".format(res["picp"], res["mpiw"]));

The displayed values are statistical information about the model's prediction:

RMSE: Root Mean Square Error

NLL: Negative Log Likelihood

AUUCC_GAIN: Area Under the Uncertainty Characteristics Curve

R2: Coefficient of Determination (we want r2 ~ 1)

# Print the PICP for houses with different numbers of rooms
for nb in np.unique(x_test_two_features[:,0]):
	coverage = picp(y_test[x_test_two_features[:,0]==nb], 
				y_test_lower_total[x_test_two_features[:,0]==nb], 
				y_test_upper_total[x_test_two_features[:,0]==nb])
	print("The PICP for houses with nb_pieces = {} is {}".format(nb, coverage))

Step 5: Recalibration via UCC (Uncertainty Charasteristic Curve)

To improve PICP values, we recalibrate the prediction intervals with the "Uncertainty Characteristic Curve" and re-evaluate the model.

# Fit the calibration model
gp_option_a = UCCRecalibration(base_model=gp)
gp_option_a = gp_option_a.fit(x_train_calib_two_features, y_train_calib)
calib_y_test_mean, calib_y_test_lower_total, calib_y_test_upper_total = gp_option_a.predict(x_test_two_features, missrate=0.05)

# Compute regression metrics after calibration
res_calibrated = compute_regression_metrics(y_test, calib_y_test_mean, calib_y_test_lower_total, calib_y_test_upper_total)

# Print the calibrated PICP for houses with different numbers of rooms
for nb in np.unique(x_test_two_features[:,0]):
	coverage = picp(y_test[x_test_two_features[:,0]==nb], 
				calib_y_test_lower_total[x_test_two_features[:,0]==nb], 
				calib_y_test_upper_total[x_test_two_features[:,0]==nb])
	print("The calibrated PICP for houses with nb_pieces = {} is {}".format(nb, coverage))

# Plot the uncertainty by feature and PICP by feature before and after calibration
fig, axs = plt.subplots(2, 2,figsize=(15,10))

plot_uncertainty_by_feature(x_test_two_features[:, 0], y_test_mean,
							y_test_lower_total, y_test_upper_total,
							xlabel=features[0], ylabel='house price in $1000s',
							ax=axs[0,0],
							title="Before recalibration \nTest Data: PICP={:.2f} and MPIW={:.2f}".format(
							res["picp"], res["mpiw"]));

plot_uncertainty_by_feature(x_test_two_features[:, 0], y_test_mean,
							calib_y_test_lower_total, calib_y_test_upper_total,
							xlabel=features[0], ylabel='house price in $1000s',
							ax=axs[0,1],
							title="After calibration \nTest Data: PICP={:.2f} and MPIW={:.2f}".format(
							res_calibrated["picp"], res_calibrated["mpiw"]));

plot_picp_by_feature(x_test_two_features[:, 0], y_test,
						y_test_lower_total, y_test_upper_total,
						xlabel=features[0],
						title="",
						ax=axs[1,0],
						ylims=[0.6,1.1]);

plot_picp_by_feature(x_test_two_features[:, 0], y_test,
						calib_y_test_lower_total, calib_y_test_upper_total,
						xlabel=features[0],
						title="",
						ax=axs[1,1],
						ylims=[0.6,1.1]);

# Create a subplot of 1 row and 2 columns with a specified figure size
fig, axs = plt.subplots(1, 2, figsize=(15, 5))

# Plot the Prediction Interval Coverage Probability (PICP) before recalibration for test data,
# y_test_lower_total and y_test_upper_total are the lower and upper prediction intervals respectively
plot_picp_by_feature(x_test_two_features[:, 1], y_test,
						y_test_lower_total, y_test_upper_total,
						xlabel=features[1],
						ax=axs[0],
						title="Before recalibration \nTest data: PICP={:.2f} and MPIW={:.2f}".format(res["picp"], res["mpiw"]),
						ylims=[0.6,1.1])

# Plot the PICP after recalibration for test data
plot_picp_by_feature(x_test_two_features[:, 1], y_test,
						calib_y_test_lower_total, calib_y_test_upper_total,
						xlabel=features[1],
						ax=axs[1],
						title="After recalibration \nTest data: PICP={:.2f} and MPIW={:.2f}".format(res["picp"], res["mpiw"]),
						ylims=[0.6,1.1])

We clearly observe that the PICP has increased thanks to the recalibration, particularly for houses with more than 5 rooms. However, be aware that improving calibration does not reduce uncertainty. It's often the opposite because the MPIW metric is likely to increase in such situations.

Step 6: Adding a Parameter to Reduce Aleatoric Uncertainty

To improve the model's performance, we can seek to reduce uncertainties. Random uncertainty is related to the data, so adding features to these data will increase the reliability of the model's predictions by reducing the width of the confidence intervals (i.e., MPIW).

We will first seek to visualize the predictive uncertainty of the model, with Gaussian UQ methods. In particular, we want to distinguish random uncertainty and epistemic uncertainty. Indeed, they are different and are not reduced in the same way.

The random uncertainty of our model is due to the fact that it doesn't always understand why two houses with the same number of rooms can have a price that varies a lot. This variation can be explained by the crime rate, but sometimes that's not enough and other parameters must be added to the model (i.e., other features to the houses).

In our case, we mentioned earlier the education level, and we are going to implement it in our model, which until now, was content with the first two parameters (number of rooms and crime rate).

# Assign x_train to x_train_three_features and x_test_full to x_test_three_features
x_train_three_features = x_train
x_test_three_features = x_test_full

# Initialize the homoscedastic GP regression model
gp_expanded = HomoscedasticGPRegression()

# Fit the model on the training data
gp_expanded.fit(x_train_three_features, y_train.reshape(-1, 1))

# Predict the test set results and compute uncertainty measures
y_test_mean_expanded, y_test_lower_total_expanded, y_test_upper_total_expanded, y_test_lower_epistemic_expanded, y_test_upper_epistemic_expanded, y_epistemic_dists_expanded = gp_expanded.predict(x_test_three_features, return_epistemic=True, return_epistemic_dists=True)

# Compute the regression metrics for the test set
res_expanded = compute_regression_metrics(y_test, y_test_mean_expanded, y_test_lower_total_expanded, y_test_upper_total_expanded)

# Create a subplot of 2 rows and 2 columns with a specified figure size
fig, axs = plt.subplots(2, 2, figsize=(15,10))

# Plot the uncertainties before and after adding the education rate feature
plot_uncertainty_by_feature(x_test_two_features[:, 0], y_test_mean,
							y_test_lower_total, y_test_upper_total,
							y_test_lower_epistemic, y_test_upper_epistemic,
							xlabel=features[0], ylabel='house price in $1000s',
							ax=axs[0,0],
							title="Before adding education rate \nTest data: PICP={:.2f} and MPIW={:.2f}".format(res["picp"], res["mpiw"]))
							
plot_uncertainty_by_feature(x_test_three_features[:, 0], y_test_mean_expanded,
							y_test_lower_total_expanded, y_test_upper_total_expanded,
							y_test_lower_epistemic_expanded, y_test_upper_epistemic_expanded,
							xlabel=features[0], ylabel='house price in $1000s',
							ax=axs[0,1],
							title="After adding education rate \nTest data: PICP={:.2f} and MPIW={:.2f}".format(res_expanded["picp"], res_expanded["mpiw"]))

# Plotting PICP (Prediction Interval Coverage Probability) by feature for two features test data
plot_picp_by_feature(x_test_two_features[:, 0], y_test,
						y_test_lower_total, y_test_upper_total,
						xlabel=features[0],
						ax=axs[1,0],
						ylims=[0.6,1.1],
						title="Données de test : PICP={:.2f} et MPIW={:.2f}"\
						.format(res["picp"], res["mpiw"]));

# Plotting PICP by feature for three features test data (expanded)
plot_picp_by_feature(x_test_three_features[:, 0], y_test,
						y_test_lower_total_expanded, y_test_upper_total_expanded,
						xlabel=features[0],
						ax=axs[1,1],
						ylims=[0.6,1.1],
						title="Données de test : PICP={:.2f} et MPIW={:.2f}"\
						.format(res_expanded["picp"], res_expanded["mpiw"]));

We clearly see better results after adding the education level, especially concerning random uncertainty.

# Repeat the process for each feature in three features test data
plot_picp_by_feature(x_test_three_features[:, 1], y_test,
						y_test_lower_total_expanded, y_test_upper_total_expanded,
						xlabel=features[1],
						title="Données de test : PICP={:.2f} et MPIW={:.2f}"\
						.format(res_expanded["picp"], res_expanded["mpiw"]));

plot_picp_by_feature(x_test_three_features[:, 2], y_test,
					y_test_lower_total_expanded, y_test_upper_total_expanded,
					num_bins=5,
					xlabel=features[2],
					title="Données de test : PICP={:.2f} et MPIW={:.2f}"\
					.format(res_expanded["picp"], res_expanded["mpiw"]));

# Calculate and print the PICP coverage for each unique value in the first feature of the three features test data
for nb in np.unique(x_test_three_features[:,0]):
	coverage = picp(y_test[x_test_three_features[:,0]==nb], 
				y_test_lower_total_expanded[x_test_three_features[:,0]==nb], 
				y_test_upper_total_expanded[x_test_three_features[:,0]==nb])
	print("Le PICP pour des maisons avec nb_pieces = {} est {}".format(nb, coverage))

Thus, adding the education_rate feature helped reduce MPIW.

Step 7: Data Augmentation to Reduce Epistemic Uncertainty

Epistemic uncertainty (i.e., related to the model) can be reduced by increasing the data used to train the model. This will allow diversifying the training set and thus will improve the model's likelihood.

In our case, we deliberately influenced the diversification of data by reducing the number of houses with more than 5 rooms and those whose price is higher than 300 k€. This modification was observed when we visualized the model evaluation metrics so we can solve this kind of problem even when we didn't deliberately influence the data set.

This time, we will therefore use all the data generated during the second step of our project.

# Assigning the full train and calibration data to new variables
x_train_new, x_train_calib_new, y_train_new, y_train_calib_new = x_train_full, x_train_calib_full, y_train_full, y_train_calib

# Plotting training data
fig = plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(x_train_new[:,0], y_train_new, linewidths=0.5, alpha=0.7, edgecolor='black', s = 40)
plt.xlabel(features[0])
plt.ylabel('Prix de l\'immobilier')
plt.title('Données d\'entraînement')

plt.subplot(1, 2, 2)
plt.scatter(x_train_new[:,1], y_train_new, linewidths=0.5, alpha=0.7, edgecolor='black', s = 40)
plt.xlabel(features[1])
plt.ylabel('Prix de l\'immobilier')
plt.title('Données d\'entraînement')

We clearly see that the data are distributed more "homogeneously" without any cases being left out.

# Fitting Homoscedastic Gaussian Process Regression model
gp_new = HomoscedasticGPRegression()
gp_new.fit(x_train_new, y_train_new.reshape(-1,1))

# Predicting test data using the model and also retrieving epistemic uncertainty information
sortie_modele = gp_new.predict(x_test_three_features, return_epistemic=True, return_epistemic_dists=True)

y_test_mean_new = sortie_modele.y_mean
y_test_lower_total_new = sortie_modele.y_lower
y_test_upper_total_new = sortie_modele.y_upper
y_test_lower_epistemic_new = sortie_modele.y_lower_epistemic
y_test_upper_epistemic_new = sortie_modele.y_upper_epistemic
y_epistemic_dists_new = sortie_modele.y_epistemic_dists

res_full = compute_regression_metrics(y_test, y_test_mean_new, y_test_lower_total_new, y_test_upper_total_new)

# Calculate and print the PICP coverage for each unique value in the first feature of the three features test data after fitting new model
for nb in np.unique(x_test_three_features[:,0]):
	coverage = picp(y_test[x_test_three_features[:,0]==nb], 
				y_test_lower_total_new[x_test_three_features[:,0]==nb], 
				y_test_upper_total_new[x_test_three_features[:,0]==nb])
	print("Le PICP pour des maisons avec nb_pieces = {} est {}".format(nb, coverage))

# Plotting uncertainty by feature for three features test data before and after adding training data
fig, axs = plt.subplots(2, 2,figsize=(15,10))

plot_uncertainty_by_feature(x_test_three_features[:, 0], y_test_mean_expanded,
							y_test_lower_total_expanded, y_test_upper_total_expanded,
							y_test_lower_epistemic_expanded, y_test_upper_epistemic_expanded,
							xlabel=features[0], ylabel='Prix de l\'immobilier (en k€)',
							ax=axs[0,0],
							title="Avant l'ajout de données d\'entraînement \nDonnées de test : PICP={:.2f} et MPIW={:.2f}"\
							.format(res_expanded["picp"], res_expanded["mpiw"]));

plot_uncertainty_by_feature(x_test_three_features[:, 0], y_test_mean_new,
							y_test_lower_total_new, y_test_upper_total_new,
							y_test_lower_epistemic_new, y_test_upper_epistemic_new,
							xlabel=features[0], ylabel='Prix de l\'immobilier (en k€)',
							ax=axs[0,1],
							title="Après l'ajout de données d\'entraînement \nDonnées de test : PICP={:.2f} et MPIW={:.2f}"\
							.format(res_expanded["picp"], res_expanded["mpiw"]));

# Plotting PICP by feature for three features test data before and after adding training data
plot_picp_by_feature(x_test_three_features[:, 0], y_test,
						y_test_lower_total_expanded, y_test_upper_total_expanded,
						xlabel=features[0],
						ax=axs[1,0],
						ylims=[0.6,1.1],
						title="Données de test : PICP={:.2f} et MPIW={:.2f}"\
						.format(res_expanded["picp"], res_expanded["mpiw"]));

plot_picp_by_feature(x_test_three_features[:, 0], y_test,
						y_test_lower_total_expanded, y_test_upper_total_expanded,
						xlabel=features[0],
						ax=axs[1,0],
						ylims=[0.6,1.1],
						title="Données de test : PICP={:.2f} et MPIW={:.2f}"\
						.format(res_expanded["picp"], res_expanded["mpiw"]));

We visualize several things:

PICP fluctuates around 95% (that's what we wanted, and especially it wasn't decreased)

MPIW remains at 65.02 (especially it didn't increase)

the objective is achieved: epistemic uncertainty has been greatly reduced (orange portion of the prediction interval)

# Re-plotting the PICP for the first feature of three-features test data, but now with the new prediction results
plot_picp_by_feature(x_test_three_features[:, 0], y_test,
						y_test_lower_total_new, y_test_upper_total_new,
						xlabel=features[0],
						ax=axs[1,1],
						ylims=[0.6,1.1],
						title="Données de test : PICP={:.2f} et MPIW={:.2f}"\
						.format(res_expanded["picp"], res_expanded["mpiw"]));

Step 8: Interpreting and Accounting for Uncertainties

Prices are predicted with a 95% confidence interval. There are many methods to account for uncertainties:

Range of interval (verbal): Easy to read at a glance, but could miss the details of how possible values are distributed in the range.

Probability density plot: Gives detailed information with a visualization about how possible values are distributed in the prediction interval.

Quantile dot plot: Shows distribution with a visualization that makes it easier to judge the relative likelihood of where possible values can fall.

# Re-plotting the PICP for the second feature of two-features test data, but now with the new prediction results
plot_picp_by_feature(x_test_two_features[:, 1], y_test,
						y_test_lower_total_new, y_test_upper_total_new,
						xlabel=features[1],
						title="Test data: PICP={:.2f} and MPIW={:.2f}"\
						.format(res_expanded["picp"], res_expanded["mpiw"]));

# Sorting the expanded and new epistemic distributions based on their standard deviation
sorted_expanded = np.argsort([dist.std() for dist in y_epistemic_dists_expanded])
sorted_new = np.argsort([dist.std() for dist in y_epistemic_dists_new])

# Initialize a subplot grid
fig, axs = plt.subplots(2, 2,figsize=(15,10))

# Plotting uncertainty distribution of the most certain prediction before and after adding training data
plot_uncertainty_distribution(y_epistemic_dists_expanded[sorted_expanded[0]], show_quantile_dots=True, 
								qd_sample=20, qd_bins=7, ax=axs[0,0],
								xlabel = "Real estate price (in k€)",
								ylabel = "Probability density",
								title="Before: Most certain (with a standard deviation of {:.2f} k€)"\
								.format(y_epistemic_dists_expanded[sorted_expanded[0]].std()));

plot_uncertainty_distribution(y_epistemic_dists_new[sorted_new[0]], show_quantile_dots=True, 
								qd_sample=20, qd_bins=7, ax=axs[0,1],
								xlabel = "Real estate price (in k€)",
								ylabel = "Probability density",
								title="After: Most certain (with a standard deviation of {:.2f} k€)"\
								.format(y_epistemic_dists_new[sorted_new[0]].std()));

# Plotting uncertainty distribution of the most uncertain prediction before and after adding training data
plot_uncertainty_distribution(y_epistemic_dists_expanded[sorted_expanded[-1]], show_quantile_dots=True, 
								qd_sample=20, qd_bins=7, ax=axs[1,0],
								xlabel = "Real estate price (in k€)",
								ylabel = "Probability density",
								title="Before: Most uncertain (with a standard deviation of {:.2f} k€)"\
								.format(y_epistemic_dists_expanded[sorted_expanded[-1]].std()));

plot_uncertainty_distribution(y_epistemic_dists_new[sorted_new[-1]], show_quantile_dots=True, 
								qd_sample=20, qd_bins=7, ax=axs[1,1],
								xlabel = "Real estate price (in k€)",
								ylabel = "Probability density",
								title="After: Most uncertain (with a standard deviation of {:.2f} k€)"\
								.format(y_epistemic_dists_new[sorted_new[-1]].std()));

We see that recalibration and the reduction of random and epistemic uncertainties help reduce the standard deviations of our predictions.

# Creating a new figure
figure(figsize=(15,8), dpi=80)

# Initializing lists to store true prices, predicted prices, and indices
Y_prix_vrais = []
Y_prix_preds = []
X_index = []

# Initializing variables to calculate average true price and average predicted price
moyenne_vraie = 0
moyenne_preds = 0

# For the first 400 predictions, store true prices, predicted prices, and calculate averages
for i in range(400):
	X_index.append(i)
	Y_prix_vrais.append(y_test[sorted_new[i]])
	Y_prix_preds.append(y_test_mean_new[sorted_new[i]])
	moyenne_vraie += y_test[sorted_new[i]]
	moyenne_preds += y_test_mean_new[sorted_new[i]]

# Printing average true price and average predicted price
print("Average of true prices = ", moyenne_vraie/400)
print("Average of predicted prices = ", moyenne_preds/400)

# Plotting true prices and predicted prices
plt.plot(X_index,Y_prix_vrais,color='red',alpha=0.5,linewidth=0.7)
plt.plot(X_index,Y_prix_preds,color='blue',alpha=0.5,linewidth=0.7)
plt.show()

# Printing the true price, predicted price, and the prediction uncertainty for the first 10 predictions
print("For 10 houses:\n")
for i in range(10):
		print("True price: {:.2f} k€, Prediction: {:.2f} +/- {:.2f} k€, number of rooms: {:.2f}, crime rate: {:.2f}, education rate: {:.2f}"          .format(y_test[sorted_new[i]], y_test_mean_new[sorted_new[i]], 2.0 *y_epistemic_dists_new[sorted_new[i]].std(),                  x_test_full[sorted_new[i],0], x_test_full[sorted_new[i],1], x_test_full[sorted_new[i],2]))