Skip to main content

Documentation Index

Fetch the complete documentation index at: https://github-52.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This project performs a comprehensive exploratory data analysis (EDA) of IMDB movie data to uncover the patterns that separate well-received, commercially successful films from the rest. Rather than building a predictive model, the emphasis here is on asking the right questions of the data and letting visualizations tell the story — which genres dominate ratings, whether release timing affects performance, and which directors consistently produce highly rated work.

Dataset overview

The dataset contains metadata for thousands of movies sourced from IMDB, covering several decades of releases. Each record includes:
FieldDescription
TitleMovie title
GenreOne or more genre tags (e.g., Drama, Action, Comedy)
DirectorName of the primary director
Lead Actor/ActressPrimary cast member
RuntimeFilm length in minutes
Release YearYear the film was released
Release MonthMonth of theatrical release
IMDB RatingAverage user rating on a 10-point scale
Number of VotesTotal votes contributing to the rating
Gross RevenueWorldwide box office gross in USD
LanguagePrimary spoken language
CountryCountry of production
The dataset required careful handling: gross revenue figures were missing for a significant portion of films (particularly older or non-English titles), and multi-genre entries needed to be expanded for genre-level analysis.

Methodology

1

Data loading & cleaning

The raw CSV was loaded with Pandas and inspected for missing values, duplicate entries, and formatting inconsistencies. Key cleaning steps:
  • Dropped records with missing IMDB ratings or vote counts below a minimum threshold (to filter out obscure entries with unreliable ratings)
  • Parsed multi-genre strings into individual genre tags using string splitting and explode()
  • Converted gross revenue to a consistent numeric format, replacing missing values with NaN rather than zero to avoid skewing revenue statistics
  • Standardized director name formatting to handle punctuation and encoding issues
2

Univariate analysis

Each variable was analyzed in isolation before examining relationships. This step included:
  • Distribution plots for ratings, runtime, and gross revenue
  • Frequency counts for genre, language, and country
  • Year-on-year counts to understand how the dataset’s coverage changes over time
  • Identifying the central tendency and spread of key numeric variables
3

Bivariate analysis

Pairs of variables were examined to identify meaningful correlations and patterns:
  • Rating vs. gross revenue (scatter plot with a log scale on revenue)
  • Runtime vs. rating (binned analysis to check whether longer films rate differently)
  • Release month vs. average rating and revenue (checking seasonal release windows)
  • Vote count vs. rating (popularity does not always equal quality — this tension was explored explicitly)
4

Genre & director analysis

Genre-level aggregations revealed which categories consistently earn high ratings and strong box office returns. Director-level analysis identified the filmmakers with the highest median ratings across a minimum number of films, filtering out directors with only one or two credits to keep comparisons fair.
5

Visualization

All findings were consolidated into a set of clear, labeled visualizations. Each chart was designed to answer a specific question and annotated with the key takeaway directly on the figure, making the notebook readable as a standalone analytical report.

Key findings

Genre impact on ratings: Documentary and Biography films earn consistently higher average IMDB ratings than Action or Horror films, but they attract far fewer votes and significantly lower gross revenue. Drama sits in the middle — reliably rated, reliably watched. The highest-grossing genre, Action/Adventure, averages a full point lower in rating than Documentary.
Seasonal release patterns: Films released in the November–December window average notably higher ratings and gross revenues, aligning with the awards season strategy studios use for prestige releases. The summer window (June–August) dominates in sheer volume and total box office gross but shows lower average ratings, reflecting the blockbuster vs. prestige split.
Further findings from the analysis:
  • Vote count as a proxy for reach: Films with more than 100,000 votes have a much tighter rating distribution (7.0–8.5), while films under 10,000 votes span nearly the full 1–10 range. High vote counts signal mainstream appeal and tend to compress ratings toward the mean.
  • Director consistency: A small group of directors (Christopher Nolan, Denis Villeneuve, and a handful of others) maintain median ratings above 8.0 across five or more films — a rare achievement that distinguishes them from directors with a single breakout film.
  • Revenue vs. rating disconnect: The correlation between IMDB rating and gross revenue is positive but moderate. Many critically beloved films underperform commercially, while several blockbusters with average ratings gross over $1 billion — marketing budget and franchise recognition matter far more for revenue than critical reception.

Visualizations

Running the project notebook produces the following charts:
  • Rating distribution histogram — bell-shaped, centered around 6.5–7.0, with a slight left skew at the low end
  • Genre popularity bar chart — ranked by number of films and separately by median rating
  • Top 20 directors by median rating (minimum 5 films) — horizontal bar chart with vote-count annotations
  • Revenue vs. rating scatter plot — log-scaled revenue axis, colored by genre, highlights the blockbuster/prestige divide
  • Monthly release heatmap — average rating and revenue by release month across years
  • Runtime distribution by genre — box plots showing that Drama and Biography skew longer while Horror and Comedy skew shorter

Technologies used

Python

Core language for all data manipulation and visualization tasks.

Pandas

Data loading, cleaning, aggregation, and multi-genre expansion with explode().

Matplotlib

Histograms, scatter plots, bar charts, and annotated figure exports.

Seaborn

Box plots, heatmaps, and statistical visualizations with consistent styling.

Jupyter Notebook

Self-contained analytical narrative combining code, outputs, and commentary.