Loan defaults cost financial institutions billions of dollars each year, yet many approval decisions still rely on rules of thumb rather than rigorous data analysis. This project examines a real-world bank loan dataset to understand what drives defaults, identify the highest-risk borrower profiles, and build a predictive model that can support more consistent, data-backed credit decisions. The goal is not to replace underwriters but to give them better information.Documentation Index
Fetch the complete documentation index at: https://github-52.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Dataset overview
The dataset contains records of individual loan applications and their outcomes, with each row representing one applicant. The available fields are:| Field | Description |
|---|---|
| Applicant ID | Anonymized unique identifier |
| Age | Applicant age in years |
| Income | Annual gross income |
| Employment Status | Employed, Self-employed, or Unemployed |
| Employment Length | Years at current employer |
| Loan Amount | Total amount requested |
| Loan Term | Repayment period in months (36 or 60) |
| Interest Rate | Annual percentage rate assigned |
| Loan Purpose | Stated reason (debt consolidation, home improvement, etc.) |
| Debt-to-Income Ratio (DTI) | Monthly debt payments as a proportion of monthly income |
| Credit History Length | Number of years since oldest credit account opened |
| Number of Open Accounts | Total active credit lines |
| Derogatory Marks | Number of negative items on credit report |
| Default Status | Binary target — 1 if the loan defaulted, 0 if repaid |
Methodology
Problem definition
The business problem was framed precisely before touching any data: predict the probability that a given applicant will default, and identify which features most strongly predict that outcome. This framing drove every subsequent modeling and evaluation choice — binary classification with a focus on recall for the positive (default) class, since the cost of a missed default is higher than the cost of a rejected good applicant.
Data cleaning & EDA
The raw data was inspected for missing values, impossible values, and outliers. Key cleaning actions:
- Imputed missing interest rate values using the median rate within each loan grade bucket
- Capped extreme income outliers at the 99th percentile to prevent them from dominating distance-based features
- Verified that no target leakage existed — features that would only be known after a loan was issued were excluded
- Produced univariate distributions and default rate breakdowns for every feature to understand their individual predictive signal
Risk factor identification
Before modeling, a set of risk segmentation analyses was conducted to build intuition:
- Default rate by DTI bucket (low / medium / high / very high)
- Default rate by employment status and length
- Default rate by loan purpose
- Default rate by credit history length quartile
- Correlation analysis between numeric features and the default flag
Model building
Four classification models were trained and compared using stratified 5-fold cross-validation:
- Logistic Regression (interpretable baseline)
- Decision Tree Classifier
- Random Forest Classifier
- Gradient Boosting Classifier
class_weight='balanced' for linear models and SMOTE oversampling for tree-based models. Models were evaluated on AUC-ROC, precision, recall, and F1 score on the positive (default) class.Key findings
Loan purpose matters: Debt consolidation loans — the most common purpose in the dataset — have a higher-than-average default rate despite often going to borrowers with above-average income. This suggests that applicants using loans to roll over existing debt may already be in financial distress at the time of application.
- Interest rate is highly correlated with default, but this is partly a reflection of the bank’s own risk-based pricing: higher-risk applicants are already charged more. This makes interest rate a useful signal but a potentially circular one for modeling purposes.
- Employment length shows a non-linear relationship with default risk — applicants with less than one year of employment and those with more than ten years both show lower default rates than the 2–5 year group, which contains a mix of career-changers and mid-career earners.
- Loan term is a significant predictor: 60-month loans default at roughly 1.6× the rate of 36-month loans even after controlling for loan amount and borrower income.
Business impact
The predictive model and risk segmentation analysis support several concrete improvements to the loan approval process:Tiered review thresholds
Introduce a three-tier review system — auto-approve, human review, and auto-decline — based on model probability scores rather than binary cut-offs.
DTI hard caps
Implement a hard DTI cap at 40% for unsecured personal loans, with exceptions only after manual underwriter review.
Credit history minimums
Require a minimum credit history length of 18 months for loan amounts above a defined threshold.
Loan purpose flags
Flag debt consolidation applications for additional income verification, given their elevated default rate relative to stated income.
Technologies used
Python
End-to-end analysis, feature engineering, modeling, and evaluation.
Pandas
Data cleaning, aggregation, risk segmentation tables, and feature preparation.
Scikit-learn
Classification models, cross-validation, SMOTE integration, and evaluation metrics.
Matplotlib
ROC curves, feature importance charts, and risk segmentation bar plots.
Seaborn
Default rate heatmaps, distribution comparisons, and correlation matrices.