GitHub - ipshitag/Life-Expectancy-prediction: Predicting Life expectancy, based on different parameters using linear regression, in DLTK_AI framework.

Life Expectancy Prediction using Linear Regression

Aim:

The main aim of the project is to predict the life expectancy of people according to various parameters.

Data:

The dataset consisted of 22 columns and 2056 rows. The different columns were as follows:

Cleaning Dataset:

a. Null Values

For cleaning nulls, two methods were used: imputation, grouped mean and dropping.

Imputation

For columns like Schooling, Alcohol, Income, BMI, Population, thinness, columns which were positively correlated to these columns were used to understand corresponding missing values.
Group mean

The columns which did not have a robust visible correlation were filled according to the mean of the values grouped by country. If the number of nulls of a given group is less than 10, the entire column's mean was used as filler.
Dropping

Specific columns, like Life expectancy and Adult Mortality, has a low number of nulls, and thus, they were dropped.

b. Outliers

Zscore for values from columns was calculated; if the Z-score's absolute value was more significant than the threshold, it was replaced with the country mean.

Exploratory Data Analysis:

Boxplots of all the columns,

HeatMap between different values,

It can be observed that there are a few very high positive correlation, like 'Expenditure' and 'GDP'.

For Life expectancy, 'Schooling' and 'Income composition of resources' are very high.

Important Distribution Plots:

PairPlot of the dataset:

The image can be seen at https://ibb.co/BK6CLTg

From the image, we can see that mane values are linearly dependant.

The target variable, 'Life expectancy', is linearly related to 'Income composition', 'Schooling',' HIV/Aids', and 'Adult Mortality'.

These features will be an essential factor in the model.

Model 1 (Baseline Model)

Task: Regression

Library: Weka

Algorithm: Linear Regression

R2 Score: 0.7663755802699652 (Test)

For the baseline model, the features chosen were,

'Adult Mortality',

'Alcohol',

' BMI ',

'under-five deaths ',

'Polio',

' HIV/AIDS',

' thinness 1-19 years',

'Schooling',

'Income composition of resources'

Model 2

Task: Regression

Library: Weka

Algorithm: Random Forest

R2 Score: 0.9494861514474349 (Test)

The features that were used were:

Adult Mortality',

'Alcohol'

' BMI ',

'under-five deaths '

'Polio',

' HIV/AIDS',

' thinness 1-19 years',

'infant deaths','Schooling',

'Total expenditure',

'Measles ',

'Diphtheria ',

'Income composition of resources'

Model 3

Task: Regression

Library: Weka

Algorithm: Random Forest

R2 Score: 0.9535801444908933 (Test)

The features used were

'Income composition of resources',

'Schooling'

' thinness 1-19 years',

' HIV/AIDS',

'Adult Mortality'

Model 4

Task: Regression

Library: Weka

Algorithm: Random Forest

R2 Score: 0.962142599298509 (Test)

The features used were,

'Measles ',

'percentage expenditure',

'infant deaths',

'Diphtheria ',

'Total expenditure',

'Population',

' HIV/AIDS',

'Schooling',

'Hepatitis B'

Model 5

Task: Regression

Library: Weka

Algorithm: Random Forest

R2 Score: 0.9673519085487334 (Test)

Parameters used for the model are,

'Adult Mortality',

'Hepatitis B',

' thinness 1-19 years',

' HIV/AIDS','Diphtheria ',

'Income composition of resources',

'infant deaths',

'Alcohol'

Model 6

Task: Regression

Library: Weka

Algorithm: Random Forest

R2 Score: 0.9702330510703981 (Test)

The parameters used were,

'Adult Mortality',

'Hepatitis B',

' thinness 1-19 years',

'Year',

'under-five deaths ',

'Polio',

' HIV/AIDS',

'Diphtheria ',

'Income composition of resources',

'infant deaths',

'Alcohol'

Summary:

First, the dataset was cleaned and scaled. For scaling min-max scaler was used.

All the outliers were removed, using three different techniques, imputation, grouped mean and dropping.

Specific attributes were found to be highly correlated to the target variable, which was 'Life Expectancy'.

Models were built by tweaking the parameters, to receive the highest accuracy.

The highest accuracy reached was 97.02% on test data.

Future works:

Deploying the model

-Deploying the model will be a huge step, making interactions between the model and end-user easier. Additional data can also be collected. Using AWS or other platforms to deploy the model will be useful.
Use of better algorithms

-The only algorithms used are RandomForest and LinearRegression; checking other algorithms will benefit.
Trying to improve imputation

-Since there are many null values, a better way to fill in those values may result in higher accuracy.

Business Idea:

The project can be used in two different domains: the healthcare domain and the insurance company.

Healthcare Domain:

In the healthcare domain, the model can help governments and hospitals understand the significant features that affect life expectancy and how it can be handled. The government can conduct specific awareness programs for better life expectancy.
Insurance Company:

Insurance companies can plan their packages based on the life expectancy of individuals. Different kinds of offers and packages can be given accordingly so that the person buys it.

Contributors

ThomasKutty Reji Github

Ipshita Ghosh

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
lifeExpectancy(SoothSayers).ipynb		lifeExpectancy(SoothSayers).ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

ipshitag/Life-Expectancy-prediction

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages