This project builds a machine learning regression model to predict the market price of used cars. Given how opaque used car pricing can be, the goal is to give buyers and sellers a data-driven estimate based on objective vehicle attributes. By training on a rich dataset of real listings, the model learns which features matter most and how they interact to drive price.Documentation Index
Fetch the complete documentation index at: https://github-52.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Dataset overview
The dataset aggregates used car listings and captures a wide range of vehicle attributes. Each row represents a single listing and includes the following features:| Feature | Description |
|---|---|
| Make / Model | Manufacturer and model name (e.g., Toyota Camry) |
| Year | Model year of manufacture |
| Mileage | Odometer reading in miles or kilometers |
| Condition | Seller-reported condition (Excellent, Good, Fair, Poor) |
| Fuel Type | Petrol, Diesel, Electric, or Hybrid |
| Transmission | Automatic or Manual |
| Engine Size | Displacement in litres |
| Color | Exterior color |
| Number of Owners | How many previous owners the car has had |
| Price | Target variable — listed sale price |
2023-Car-Features-List.jpg) provided additional reference for standardizing make/model names and aligning feature categories across listing sources, ensuring consistency during the cleaning phase.
Methodology
Data collection & cleaning
Raw listing data was scraped and compiled from multiple sources. Cleaning steps included:
- Removing duplicate listings and entries with missing prices
- Standardizing make/model strings using the 2023 Car Features List reference
- Imputing missing mileage values using the median within each make/model/year group
- Converting categorical fields (fuel type, transmission, condition) to consistent encodings
- Clipping extreme outlier prices beyond three standard deviations from the mean
Exploratory data analysis
EDA examined the distribution of the target variable (price) and the relationships between individual features and price. Key observations were logged and visualized before any modeling began to avoid data leakage and guide feature selection.
Feature engineering
New features were derived to better capture the car’s effective age and value depreciation:
car_age: current year minus model yearmileage_per_year: mileage divided by car agebrand_tier: grouping brands into budget, mid-range, and premium tiers based on average listing prices- One-hot encoding for fuel type, transmission, and condition
Model training
Several regression algorithms were trained and compared:
- Linear Regression (baseline)
- Ridge and Lasso Regression (regularized baselines)
- Random Forest Regressor
- Gradient Boosting Regressor (XGBoost)
GridSearchCV. The Gradient Boosting Regressor produced the best results and was selected as the final model.Evaluation
Models were evaluated on a held-out test set (20% of data) using:
- R² score — proportion of variance explained
- Mean Absolute Error (MAE) — average absolute difference between predicted and actual prices
- Root Mean Squared Error (RMSE) — penalizes large errors more heavily
Key findings
Top predictors: Mileage and car age are consistently the strongest predictors of used car price, accounting for the bulk of the model’s explained variance. Brand tier is the next most important feature — premium brand vehicles depreciate more slowly. Condition rating has an outsized effect at the extremes: cars rated “Poor” sell for significantly less than the mileage/age baseline would suggest.
- Diesel vehicles command a modest price premium over equivalent petrol models in the mid-range mileage band (30,000–80,000 miles).
- Electric vehicles show a different depreciation curve — early-year models depreciate sharply, while newer models hold value better due to improved battery technology and range.
- Color has a small but statistically significant effect: white, black, and silver cars sell faster and at slightly higher prices than less common colors.
- First-owner cars carry a meaningful premium over second- or third-owner equivalents, even controlling for mileage and age.
Visualizations
The following plots were produced during the analysis. You can reproduce them by running the notebook in the project repository.- Price distribution histogram — showed a right-skewed distribution; a log transformation improved model fit significantly.
- Correlation heatmap — confirmed the strong negative correlation between price and both mileage and car age.
- Feature importance bar chart — ranked all input features by their contribution to the Gradient Boosting model.
- Prediction vs. actual scatter plot — points clustered tightly around the diagonal for mid-range prices, with wider spread at the high-price end (luxury vehicles with fewer training examples).
- Residuals plot — checked for heteroscedasticity; residuals were approximately normally distributed with no strong pattern.
Technologies used
Python
Core programming language for all data processing, modeling, and visualization work.
Pandas
Data loading, cleaning, transformation, and feature engineering pipelines.
Scikit-learn
Model training, cross-validation, hyperparameter tuning, and evaluation metrics.
Matplotlib
Static plots including histograms, scatter plots, and residual diagnostics.
Seaborn
Correlation heatmaps, distribution plots, and styled statistical visualizations.
Jupyter Notebook
Interactive development environment combining code, output, and narrative.