Project · Predict house prices end-to-end

project

hard

module project

Ship something real. Submit your work when you're done.

Brief

Build a complete ML pipeline on the Ames Housing dataset (or California Housing): load raw data, handle missing values, encode categoricals, scale numerics, train a Ridge regression inside a scikit-learn Pipeline, evaluate with 5-fold CV, and save the fitted pipeline to disk. The final model must beat a mean-prediction baseline by at least 20% on RMSE.

Deliverables

A Jupyter notebook or `train.py` script that runs from raw CSV to a saved `pipeline.pkl` with no manual steps.
A `predict.py` that loads `pipeline.pkl` and prints a prediction for a single JSON sample (passed as CLI argument or stdin).
A `results.md` reporting: CV RMSE, CV R², baseline RMSE (predicting train mean on every test sample), and improvement over baseline.
The saved `pipeline.pkl` committed alongside the code.

How we grade it

Pipeline includes at minimum: imputation, encoding, scaling, and a regularized linear model — all inside a single `Pipeline` object.
5-fold CV R² > 0.70 on California Housing (or > 0.80 on Ames if you use Ames).
RMSE beats the mean-baseline by at least 20%.
`predict.py` runs on a fresh virtualenv with only scikit-learn + joblib installed — no leaked preprocessing outside the pipeline.

Project · Predict house prices end-to-end

Hints

Expected output

Stretch goals