Skip to main content

Documentation Index

Fetch the complete documentation index at: https://github-52.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This project builds a machine learning regression model to predict the market price of used cars. Given how opaque used car pricing can be, the goal is to give buyers and sellers a data-driven estimate based on objective vehicle attributes. By training on a rich dataset of real listings, the model learns which features matter most and how they interact to drive price.

Dataset overview

The dataset aggregates used car listings and captures a wide range of vehicle attributes. Each row represents a single listing and includes the following features:
FeatureDescription
Make / ModelManufacturer and model name (e.g., Toyota Camry)
YearModel year of manufacture
MileageOdometer reading in miles or kilometers
ConditionSeller-reported condition (Excellent, Good, Fair, Poor)
Fuel TypePetrol, Diesel, Electric, or Hybrid
TransmissionAutomatic or Manual
Engine SizeDisplacement in litres
ColorExterior color
Number of OwnersHow many previous owners the car has had
PriceTarget variable — listed sale price
The 2023 Car Features List asset (2023-Car-Features-List.jpg) provided additional reference for standardizing make/model names and aligning feature categories across listing sources, ensuring consistency during the cleaning phase.

Methodology

1

Data collection & cleaning

Raw listing data was scraped and compiled from multiple sources. Cleaning steps included:
  • Removing duplicate listings and entries with missing prices
  • Standardizing make/model strings using the 2023 Car Features List reference
  • Imputing missing mileage values using the median within each make/model/year group
  • Converting categorical fields (fuel type, transmission, condition) to consistent encodings
  • Clipping extreme outlier prices beyond three standard deviations from the mean
2

Exploratory data analysis

EDA examined the distribution of the target variable (price) and the relationships between individual features and price. Key observations were logged and visualized before any modeling began to avoid data leakage and guide feature selection.
3

Feature engineering

New features were derived to better capture the car’s effective age and value depreciation:
  • car_age: current year minus model year
  • mileage_per_year: mileage divided by car age
  • brand_tier: grouping brands into budget, mid-range, and premium tiers based on average listing prices
  • One-hot encoding for fuel type, transmission, and condition
4

Model training

Several regression algorithms were trained and compared:
  • Linear Regression (baseline)
  • Ridge and Lasso Regression (regularized baselines)
  • Random Forest Regressor
  • Gradient Boosting Regressor (XGBoost)
Hyperparameters were tuned using 5-fold cross-validation with GridSearchCV. The Gradient Boosting Regressor produced the best results and was selected as the final model.
5

Evaluation

Models were evaluated on a held-out test set (20% of data) using:
  • R² score — proportion of variance explained
  • Mean Absolute Error (MAE) — average absolute difference between predicted and actual prices
  • Root Mean Squared Error (RMSE) — penalizes large errors more heavily
The final Gradient Boosting model achieved an R² of approximately 0.89 on the test set, with an MAE roughly 7% of the average listing price.

Key findings

Top predictors: Mileage and car age are consistently the strongest predictors of used car price, accounting for the bulk of the model’s explained variance. Brand tier is the next most important feature — premium brand vehicles depreciate more slowly. Condition rating has an outsized effect at the extremes: cars rated “Poor” sell for significantly less than the mileage/age baseline would suggest.
Additional findings from the analysis:
  • Diesel vehicles command a modest price premium over equivalent petrol models in the mid-range mileage band (30,000–80,000 miles).
  • Electric vehicles show a different depreciation curve — early-year models depreciate sharply, while newer models hold value better due to improved battery technology and range.
  • Color has a small but statistically significant effect: white, black, and silver cars sell faster and at slightly higher prices than less common colors.
  • First-owner cars carry a meaningful premium over second- or third-owner equivalents, even controlling for mileage and age.

Visualizations

The following plots were produced during the analysis. You can reproduce them by running the notebook in the project repository.
  • Price distribution histogram — showed a right-skewed distribution; a log transformation improved model fit significantly.
  • Correlation heatmap — confirmed the strong negative correlation between price and both mileage and car age.
  • Feature importance bar chart — ranked all input features by their contribution to the Gradient Boosting model.
  • Prediction vs. actual scatter plot — points clustered tightly around the diagonal for mid-range prices, with wider spread at the high-price end (luxury vehicles with fewer training examples).
  • Residuals plot — checked for heteroscedasticity; residuals were approximately normally distributed with no strong pattern.

Technologies used

Python

Core programming language for all data processing, modeling, and visualization work.

Pandas

Data loading, cleaning, transformation, and feature engineering pipelines.

Scikit-learn

Model training, cross-validation, hyperparameter tuning, and evaluation metrics.

Matplotlib

Static plots including histograms, scatter plots, and residual diagnostics.

Seaborn

Correlation heatmaps, distribution plots, and styled statistical visualizations.

Jupyter Notebook

Interactive development environment combining code, output, and narrative.
You can browse the full project source code, notebooks, and data samples on GitHub: Sumit-SC/Sumit-SC.github.io