Manoj's Blog

Statistical Thinking

Why a two-parameter model beat Random Forest

In our MATH4069 challenge, we tried Random Forests, per-type Random Forests, and feature-enriched ensembles. None of them worked as well as a simple linear regression with one predictor per impurity type.

The reason came down to structure. When you stratify by type, each channel–percentage relationship is near-perfectly linear. The within-type R² averaged 0.906 across all thirteen types. A Random Forest with 500 trees on five training points doesn’t stand a chance because it simply memorises the data.

The lesson: model selection must be guided by the structure of the data. A two-parameter linear model outperformed a Random Forest with thousands of implicit parameters because the problem’s structure was linear.

We scored 98. The next best group scored 113. Sometimes the simplest answer is the right one, but only if you’ve done the EDA to prove it.

April 8, 2026

Projects & Builds

Deploying DeltaConvert: Docker, SSL, and the joy of Nginx

DeltaConvert started as a simple Flask app. Getting it live on a Hetzner CX43 with Docker Compose, Certbot SSL, and proper file validation turned out to be the real engineering challenge.

The ACCEPT_MAP system for file type validation, lazy-loading EasyOCR to prevent startup crashes, rewriting nginx.conf via Python scripts because bash special characters are evil. Each problem taught me something that no tutorial covers.

The biggest lesson: production deployment is a completely different discipline from writing application code. The gap between “it works on my machine” and “it works on the internet” is where engineering lives.

March 28, 2026

Machine Learning

Increment vs. absolute: reframing the prediction target

For the train delay challenge, we spent three weeks trying to predict the final Nottingham delay directly. The breakthrough came when we stopped doing that.

Instead, we modelled the Sheffield–Nottingham increment, which is the delay change after the last observed station. The distribution was tighter, the variance lower, and the XGBoost model could focus on what actually mattered: downstream delay evolution.

Sometimes the best modelling decision is about the question itself.

Our MSE dropped from 32,505 in Week 2 to 28,078 in Week 5. The model class change (Random Forest to XGBoost) helped, but the target reformulation was the real driver.

March 20, 2026

Projects & Builds

Training 29 LSTM models inside Docker

Stochastix forecasts 30-day exchange rates for 29 currency pairs. Each pair gets its own LSTM model trained on ECB historical data.

Training all 29 models locally inside Docker was an exercise in patience and memory management. The trick was sequential training with explicit garbage collection between models. The procedure for each pair was simple: load one dataset, train, save weights, clear memory, then move on to the next.

A daily cron job now auto-resolves prediction audits against live ECB rates. The system scores itself every morning before I wake up.

March 15, 2026

Mathematics

The bias-variance tradeoff in practice

In Week 3 of our chemicals challenge, we trained per-type Random Forests. Each type had 5 to 10 observations. The score went from 147 to 161. Worse.

The problem was textbook: with n=5 training points, each bootstrap sample contains roughly 3.2 distinct observations. The OOB set has 1.8 points. The variance term in the bias-variance decomposition completely dominated.

R even warned us: “The response has five or fewer unique values. Are you sure you want to do regression?”

We should have listened. The fix was embarrassingly simple. We just had to replace the forest with a two-parameter linear model. Bias went up slightly. Variance collapsed. Net result: a 63-point improvement.

The bias-variance tradeoff is real. It’s the difference between first place and last.

March 8, 2026

University Life

From Delhi to Nottingham

Moving from a pure mathematics degree to an MSc in Data Science felt like switching from theory to practice overnight. The mathematical foundations of probability, linear algebra, and optimisation are the same, but the emphasis shifts entirely.

In Delhi, we proved theorems. In Nottingham, we build things that work. Both matter. The best data scientists I’ve met do both.

The South Asia Postgraduate Excellence Award helped make the move possible. The INSPIRE Scholarship before that kept me going through three years of mathematics in India. I don’t take either for granted.

February 15, 2026

Search

Subscribe

Topics

Connect