[ANN] To Deposit or Not Deposit? --

Hi guys good to see you again! This would be my third project where I build a neural network model that helps to predict whether a customer would invest in a term deposit at a bank which would be extremely helpful in providing guidance or maneuver a company's bussiness strategy. Credits to Kunal Gupta for the dataset! Without further ado lets begin:

[MODEL UPDATE]

Application of SMOTE+ENN resamples data by performing oversampling on the minority class (subscribe term deposit) and undersampling on the majority class (not subscribing term deposit) hence bringing output to a balanced proportion:

Although improvement on overall accuracy is mild (90% to 91%), precision and recall scores for positive cases showed significant improvement:

Classification Report	Confusion Matrix
	smenn_conmat

Accuracy Plot	Loss Plot

Model Accuracy

This model has achieved 90% accuracy: (1-Subscribe term deposit, 0-NOT subscribing term deposit)

Classification Report	Confusion Matrix

Performance Plot

Accuracy plot and loss plot is displayed on Tensorboard as shown below:

Model Architecture

The model is constructed with 2 hidden layers (64 and 32 nodes), with added Dropout and BatchNormalization on both layers:

Optimizer, validation methods and callback parameters are as such:

Dataset Summary

Dataset consist of 31647 rows of observatitions and 18 columns:

Index(['id', 'customer_age', 'job_type', 'marital', 'education', 'default', 'balance', 'housing_loan', 'personal_loan', 'communication_type', 'day_of_month', 'month', 'last_contact_duration', 'num_contacts_in_campaign', 'days_since_prev_campaign_contact', 'num_contacts_prev_campaign', 'prev_campaign_outcome', 'term_deposit_subscribed'], dtype='object')

Null Values

'days_since_prev_campaign_contact' has highest number of null values, followed by 'customer_age','balance','last_contact_duration', 'marital','personal_loan','num_contacts_in_campaign'. NaNs account for 82% in 'days_since_prev_campaign_contact' column hence will be dropped.

Negative Values

Negative values observed in balance column however it is acceptable as cases of overdraft accounts.

Duplicates

No duplicate observation is observed in dataset.

Label Proportions

Dataset is not balanced:

Observations

Some early observations pulled from graphs:

Participants mostly in age range 30-40.
Most participants has balance of less than 10000.
Highest success rate appears to be in May-Aug.
Most participants who subsribed term deposit has secondary/tertiary education.
Participants who are on housing/personal loan not likely to have term deposit.
Promotional strategy via cellular has highest success rate to subscribe term deposit.

Data Cleaning

Few steps taken to clean data:

Numerize categorical columns using LabelEncoder()
Drop 'id' and 'days_since_prev_campaign_contact' columns
Impute null values using IterativeImputer()

Feature Selection

Logistic Regression and Cramer's V are used to analyse correlation of numeric and catagorical features respectively to target column:

Logistic Regression	Cramer's V

All numeric features listed in `'num_data'` displayed at least 89% accuracy, inferring the numeric features have good correlation to target feature	Since all categorical features display negligible to weak correlation to output they will be discarded

Preprocessing

2 steps involved in preprocessing:

Expanding columns of target feature into 2 using OneHotEncoder() since output is binary.
Create train and test dataset using train_test_split()

Discussion

Much to my surprise 'communication_type','education', 'housing_loan','personal_loan' are not important features for the model. Instead, 'day_of_month' turned out to be one of the important feature that is something beyond my comprehension. 😅

Nevertheless it has served as a mathematical proof that day of month plays an important part in scoring a term deposit subscription. Weak correlation among other features may be affected by unbalanced dataset, hence it is suggested to:

Apply SMOTE method to allow equal proportion of classes in dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
dataset		dataset
logs		logs
model		model
static		static
.gitattributes		.gitattributes
CramersV.py		CramersV.py
NN_modules.py		NN_modules.py
README.md		README.md
Visuals.py		Visuals.py
_init__.py		_init__.py
term_depo.py		term_depo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ANN] To Deposit or Not Deposit? --

[MODEL UPDATE]

Model Accuracy

Performance Plot

Model Architecture

Dataset Summary

Null Values

Negative Values

Duplicates

Label Proportions

Observations

Data Cleaning

Feature Selection

Preprocessing

Discussion

About

Releases

Packages

Languages

KTong06/-ANN-Term_Deposit_Predict-

Folders and files

Latest commit

History

Repository files navigation

[ANN] To Deposit or Not Deposit? --

[MODEL UPDATE]

Model Accuracy

Performance Plot

Model Architecture

Dataset Summary

Null Values

Negative Values

Duplicates

Label Proportions

Observations

Data Cleaning

Feature Selection

Preprocessing

Discussion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages