Skip to main content

Documentation Index

Fetch the complete documentation index at: https://github-52.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Loan defaults cost financial institutions billions of dollars each year, yet many approval decisions still rely on rules of thumb rather than rigorous data analysis. This project examines a real-world bank loan dataset to understand what drives defaults, identify the highest-risk borrower profiles, and build a predictive model that can support more consistent, data-backed credit decisions. The goal is not to replace underwriters but to give them better information.

Dataset overview

The dataset contains records of individual loan applications and their outcomes, with each row representing one applicant. The available fields are:
FieldDescription
Applicant IDAnonymized unique identifier
AgeApplicant age in years
IncomeAnnual gross income
Employment StatusEmployed, Self-employed, or Unemployed
Employment LengthYears at current employer
Loan AmountTotal amount requested
Loan TermRepayment period in months (36 or 60)
Interest RateAnnual percentage rate assigned
Loan PurposeStated reason (debt consolidation, home improvement, etc.)
Debt-to-Income Ratio (DTI)Monthly debt payments as a proportion of monthly income
Credit History LengthNumber of years since oldest credit account opened
Number of Open AccountsTotal active credit lines
Derogatory MarksNumber of negative items on credit report
Default StatusBinary target — 1 if the loan defaulted, 0 if repaid
The dataset is moderately imbalanced: defaults account for roughly 20–25% of records, which required careful handling during model training to avoid a classifier that simply predicts “no default” for everything.

Methodology

1

Problem definition

The business problem was framed precisely before touching any data: predict the probability that a given applicant will default, and identify which features most strongly predict that outcome. This framing drove every subsequent modeling and evaluation choice — binary classification with a focus on recall for the positive (default) class, since the cost of a missed default is higher than the cost of a rejected good applicant.
2

Data cleaning & EDA

The raw data was inspected for missing values, impossible values, and outliers. Key cleaning actions:
  • Imputed missing interest rate values using the median rate within each loan grade bucket
  • Capped extreme income outliers at the 99th percentile to prevent them from dominating distance-based features
  • Verified that no target leakage existed — features that would only be known after a loan was issued were excluded
  • Produced univariate distributions and default rate breakdowns for every feature to understand their individual predictive signal
3

Risk factor identification

Before modeling, a set of risk segmentation analyses was conducted to build intuition:
  • Default rate by DTI bucket (low / medium / high / very high)
  • Default rate by employment status and length
  • Default rate by loan purpose
  • Default rate by credit history length quartile
  • Correlation analysis between numeric features and the default flag
These analyses directly informed feature selection and the business recommendations developed later.
4

Model building

Four classification models were trained and compared using stratified 5-fold cross-validation:
  • Logistic Regression (interpretable baseline)
  • Decision Tree Classifier
  • Random Forest Classifier
  • Gradient Boosting Classifier
Class imbalance was addressed using class_weight='balanced' for linear models and SMOTE oversampling for tree-based models. Models were evaluated on AUC-ROC, precision, recall, and F1 score on the positive (default) class.
5

Business recommendations

The final model’s feature importances and the risk segmentation analysis were translated into actionable policy recommendations for the loan approval process. These are presented in the business impact section below.

Key findings

High-risk factors to watch: Applicants with a debt-to-income ratio above 35% default at nearly three times the rate of applicants below 15% DTI. Similarly, applicants with a credit history shorter than two years and more than two derogatory marks on their report represent the single highest-risk segment in the dataset — their default rate exceeds 45%. These two factors together should trigger enhanced scrutiny in any approval workflow.
Loan purpose matters: Debt consolidation loans — the most common purpose in the dataset — have a higher-than-average default rate despite often going to borrowers with above-average income. This suggests that applicants using loans to roll over existing debt may already be in financial distress at the time of application.
Additional findings:
  • Interest rate is highly correlated with default, but this is partly a reflection of the bank’s own risk-based pricing: higher-risk applicants are already charged more. This makes interest rate a useful signal but a potentially circular one for modeling purposes.
  • Employment length shows a non-linear relationship with default risk — applicants with less than one year of employment and those with more than ten years both show lower default rates than the 2–5 year group, which contains a mix of career-changers and mid-career earners.
  • Loan term is a significant predictor: 60-month loans default at roughly 1.6× the rate of 36-month loans even after controlling for loan amount and borrower income.

Business impact

The predictive model and risk segmentation analysis support several concrete improvements to the loan approval process:

Tiered review thresholds

Introduce a three-tier review system — auto-approve, human review, and auto-decline — based on model probability scores rather than binary cut-offs.

DTI hard caps

Implement a hard DTI cap at 40% for unsecured personal loans, with exceptions only after manual underwriter review.

Credit history minimums

Require a minimum credit history length of 18 months for loan amounts above a defined threshold.

Loan purpose flags

Flag debt consolidation applications for additional income verification, given their elevated default rate relative to stated income.

Technologies used

Python

End-to-end analysis, feature engineering, modeling, and evaluation.

Pandas

Data cleaning, aggregation, risk segmentation tables, and feature preparation.

Scikit-learn

Classification models, cross-validation, SMOTE integration, and evaluation metrics.

Matplotlib

ROC curves, feature importance charts, and risk segmentation bar plots.

Seaborn

Default rate heatmaps, distribution comparisons, and correlation matrices.