Oil Spill Detection: Satellite Image Classification

Oil spills pose severe threats to marine ecosystems, coastal economies, and human health. Detecting them quickly and accurately is critical for limiting their impact. This project applies computer vision and machine learning techniques to satellite and aerial imagery to identify the visual signatures of oil spills — dark, irregular surface patches with characteristic spectral properties — and classify regions as spill or non-spill. By automating this analysis, you can scale environmental monitoring far beyond what manual inspection allows.

Dataset overview

The dataset consists of satellite and aerial imagery collected over open-water environments, with each image annotated to indicate regions containing oil spills. Key characteristics of the dataset include:

Property	Details
Source	Public remote sensing datasets (e.g., SAR and optical imagery archives)
Image format	Grayscale and RGB raster images (TIFF/PNG)
Labels	Binary pixel-level or bounding-box annotations (spill / no spill)
Resolution	Varies from 1 m to 10 m per pixel depending on sensor
Class distribution	Imbalanced — spill regions are a small fraction of total pixels

The class imbalance is a central challenge: oil spill pixels are rare relative to open water, so naively training a classifier on raw pixel counts would produce a model that always predicts “no spill” and still achieves high accuracy. You address this with targeted sampling and evaluation strategies.

Methodology

Image data preparation

You begin by loading raw images and their corresponding label masks using OpenCV and NumPy. Images are resized to a consistent resolution, normalized to the [0, 1] range, and split into fixed-size patches (e.g., 64×64 pixels). Each patch is assigned a binary label based on whether its corresponding mask region contains any annotated spill pixels. You apply oversampling to the minority (spill) class to balance training batches.

import cv2
import numpy as np

def load_image(path, size=(256, 256)):
    img = cv2.imread(path)
    img = cv2.resize(img, size)
    img = img.astype(np.float32) / 255.0
    return img

def extract_patches(img, mask, patch_size=64):
    patches, labels = [], []
    h, w = img.shape[:2]
    for y in range(0, h - patch_size, patch_size):
        for x in range(0, w - patch_size, patch_size):
            patch = img[y:y+patch_size, x:x+patch_size]
            label_region = mask[y:y+patch_size, x:x+patch_size]
            label = 1 if label_region.sum() > 0 else 0
            patches.append(patch)
            labels.append(label)
    return np.array(patches), np.array(labels)

Feature extraction

You extract handcrafted features from each image patch to represent its visual content compactly. Features include mean and standard deviation of pixel intensity, texture descriptors from the Gray-Level Co-occurrence Matrix (GLCM), and edge density computed via Canny edge detection. For RGB imagery, you also compute channel-wise statistics. These features are assembled into a flat feature vector per patch.

from skimage.feature import graycomatrix, graycoprops

def extract_features(patch):
    gray = cv2.cvtColor((patch * 255).astype(np.uint8), cv2.COLOR_BGR2GRAY)
    mean = gray.mean()
    std = gray.std()
    glcm = graycomatrix(gray, [1], [0], 256, symmetric=True, normed=True)
    contrast = graycoprops(glcm, "contrast")[0, 0]
    energy = graycoprops(glcm, "energy")[0, 0]
    edges = cv2.Canny(gray, 50, 150).mean()
    return [mean, std, contrast, energy, edges]

Classification model

You train a Random Forest classifier on the extracted feature vectors, selected for its interpretability, robustness to class imbalance (via class_weight="balanced"), and strong baseline performance on tabular feature sets. You also experiment with a Support Vector Machine (SVM) with an RBF kernel. Both models are trained using a stratified 80/20 train-test split, and hyperparameters are tuned via 5-fold cross-validation.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

clf = RandomForestClassifier(
    n_estimators=200,
    class_weight="balanced",
    random_state=42
)
clf.fit(X_train, y_train)
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring="f1")
print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}")

Evaluation

Because accuracy is misleading on an imbalanced dataset, you evaluate the model using precision, recall, F1 score, and the area under the precision-recall curve (AUC-PR). You also compute the confusion matrix to understand false-positive and false-negative trade-offs. In the environmental monitoring context, false negatives (missed spills) are costlier than false positives, so you tune the classification threshold to favor higher recall.

from sklearn.metrics import classification_report, confusion_matrix, PrecisionRecallDisplay

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["No Spill", "Spill"]))
print(confusion_matrix(y_test, y_pred))

PrecisionRecallDisplay.from_estimator(clf, X_test, y_test)

Environmental impact assessment

Beyond classification accuracy, you estimate the spatial extent of detected spill regions by mapping predicted patch labels back to image coordinates and computing the approximate affected area in square kilometers (using the known image resolution). This gives a tangible environmental metric that connects the model output to real-world impact and supports downstream response planning.

def estimate_spill_area(predicted_mask, resolution_m_per_px=10):
    spill_pixels = predicted_mask.sum()
    area_m2 = spill_pixels * (resolution_m_per_px ** 2)
    area_km2 = area_m2 / 1e6
    return area_km2

Key findings

The Random Forest classifier achieved an F1 score of 0.83 on the test set for the spill class, with a recall of 0.87 — meaning the model correctly identified 87% of actual spill patches. High recall is the priority in environmental monitoring scenarios where missing a spill has far greater consequences than investigating a false alarm.

GLCM texture features were the most discriminative — Oil spill surfaces have a characteristic low-texture, low-reflectance appearance in optical imagery. GLCM energy and contrast features contributed the most to classifier performance according to feature importance scores, outperforming raw intensity statistics. Edge density distinguished foam from spill — Sea foam and wave crests also appear as dark patches in certain lighting conditions. Higher edge density in foam patches helped the model separate them from the smoother surface texture of oil films. Class imbalance required careful handling — Without class_weight="balanced", the classifier defaulted to predicting “no spill” for ambiguous patches. Balanced weighting and threshold calibration were essential for achieving useful recall.

Applications

Automating oil spill detection from remote sensing data supports several practical use cases:

Early warning systems: Satellite passes over at-risk regions can be processed automatically, triggering alerts for unusual surface patterns before a spill spreads significantly.
Environmental monitoring: Regulatory agencies and environmental groups can use the model to screen large archives of historical imagery and track spill frequency and geography over time.
Emergency response planning: Rapid area estimation helps responders prioritize deployment of containment booms and skimmer vessels to the most affected zones.
Insurance and liability assessment: Documented spill extent from imagery provides objective evidence for post-incident damage assessments.

Technologies

Tool	Purpose
Python 3.10+	Primary programming language
OpenCV	Image loading, resizing, edge detection
scikit-image	GLCM texture feature extraction
Scikit-learn	Random Forest and SVM classifiers, evaluation metrics
NumPy	Array manipulation and patch extraction
Matplotlib	Prediction overlays and precision-recall curves

Stock Market Analysis

Price trend and volatility analysis using historical OHLCV data.

Call Volume Trend Analysis

Time series analysis of inbound call patterns and forecasting.

​Dataset overview

​Methodology

​Key findings

​Applications

​Technologies

​Related projects

Stock Market Analysis

Call Volume Trend Analysis

Dataset overview

Methodology

Key findings

Applications

Technologies

Related projects