Bank Loan Case Study: Default Risk & Credit Analysis

Loan defaults cost financial institutions billions of dollars each year, yet many approval decisions still rely on rules of thumb rather than rigorous data analysis. This project examines a real-world bank loan dataset to understand what drives defaults, identify the highest-risk borrower profiles, and build a predictive model that can support more consistent, data-backed credit decisions. The goal is not to replace underwriters but to give them better information.

Dataset overview

The dataset contains records of individual loan applications and their outcomes, with each row representing one applicant. The available fields are:

Field	Description
Applicant ID	Anonymized unique identifier
Age	Applicant age in years
Income	Annual gross income
Employment Status	Employed, Self-employed, or Unemployed
Employment Length	Years at current employer
Loan Amount	Total amount requested
Loan Term	Repayment period in months (36 or 60)
Interest Rate	Annual percentage rate assigned
Loan Purpose	Stated reason (debt consolidation, home improvement, etc.)
Debt-to-Income Ratio (DTI)	Monthly debt payments as a proportion of monthly income
Credit History Length	Number of years since oldest credit account opened
Number of Open Accounts	Total active credit lines
Derogatory Marks	Number of negative items on credit report
Default Status	Binary target — 1 if the loan defaulted, 0 if repaid

The dataset is moderately imbalanced: defaults account for roughly 20–25% of records, which required careful handling during model training to avoid a classifier that simply predicts “no default” for everything.

Methodology

Problem definition

The business problem was framed precisely before touching any data: predict the probability that a given applicant will default, and identify which features most strongly predict that outcome. This framing drove every subsequent modeling and evaluation choice — binary classification with a focus on recall for the positive (default) class, since the cost of a missed default is higher than the cost of a rejected good applicant.

Data cleaning & EDA

The raw data was inspected for missing values, impossible values, and outliers. Key cleaning actions:

Imputed missing interest rate values using the median rate within each loan grade bucket
Capped extreme income outliers at the 99th percentile to prevent them from dominating distance-based features
Verified that no target leakage existed — features that would only be known after a loan was issued were excluded
Produced univariate distributions and default rate breakdowns for every feature to understand their individual predictive signal

Risk factor identification

Before modeling, a set of risk segmentation analyses was conducted to build intuition:

Default rate by DTI bucket (low / medium / high / very high)
Default rate by employment status and length
Default rate by loan purpose
Default rate by credit history length quartile
Correlation analysis between numeric features and the default flag

These analyses directly informed feature selection and the business recommendations developed later.

Model building

Four classification models were trained and compared using stratified 5-fold cross-validation:

Logistic Regression (interpretable baseline)
Decision Tree Classifier
Random Forest Classifier
Gradient Boosting Classifier

Class imbalance was addressed using class_weight='balanced' for linear models and SMOTE oversampling for tree-based models. Models were evaluated on AUC-ROC, precision, recall, and F1 score on the positive (default) class.

Business recommendations

The final model’s feature importances and the risk segmentation analysis were translated into actionable policy recommendations for the loan approval process. These are presented in the business impact section below.

Key findings

High-risk factors to watch: Applicants with a debt-to-income ratio above 35% default at nearly three times the rate of applicants below 15% DTI. Similarly, applicants with a credit history shorter than two years and more than two derogatory marks on their report represent the single highest-risk segment in the dataset — their default rate exceeds 45%. These two factors together should trigger enhanced scrutiny in any approval workflow.

Loan purpose matters: Debt consolidation loans — the most common purpose in the dataset — have a higher-than-average default rate despite often going to borrowers with above-average income. This suggests that applicants using loans to roll over existing debt may already be in financial distress at the time of application.

Additional findings:

Interest rate is highly correlated with default, but this is partly a reflection of the bank’s own risk-based pricing: higher-risk applicants are already charged more. This makes interest rate a useful signal but a potentially circular one for modeling purposes.
Employment length shows a non-linear relationship with default risk — applicants with less than one year of employment and those with more than ten years both show lower default rates than the 2–5 year group, which contains a mix of career-changers and mid-career earners.
Loan term is a significant predictor: 60-month loans default at roughly 1.6× the rate of 36-month loans even after controlling for loan amount and borrower income.

Business impact

The predictive model and risk segmentation analysis support several concrete improvements to the loan approval process:

Tiered review thresholds

Introduce a three-tier review system — auto-approve, human review, and auto-decline — based on model probability scores rather than binary cut-offs.

DTI hard caps

Implement a hard DTI cap at 40% for unsecured personal loans, with exceptions only after manual underwriter review.

Credit history minimums

Require a minimum credit history length of 18 months for loan amounts above a defined threshold.

Loan purpose flags

Flag debt consolidation applications for additional income verification, given their elevated default rate relative to stated income.

Technologies used

Python

End-to-end analysis, feature engineering, modeling, and evaluation.

Pandas

Data cleaning, aggregation, risk segmentation tables, and feature preparation.

Scikit-learn

Classification models, cross-validation, SMOTE integration, and evaluation metrics.

Matplotlib

ROC curves, feature importance charts, and risk segmentation bar plots.

Seaborn

Default rate heatmaps, distribution comparisons, and correlation matrices.

​Dataset overview

​Methodology

​Key findings

​Business impact

Tiered review thresholds

DTI hard caps

Credit history minimums

Loan purpose flags

​Technologies used

Python

Pandas

Scikit-learn

Matplotlib

Seaborn

Dataset overview

Methodology

Key findings

Business impact

Technologies used