Build a complete ML pipeline on the Ames Housing dataset (or California Housing): load raw data, handle missing values, encode categoricals, scale numerics, train a Ridge regression inside a scikit-learn Pipeline, evaluate with 5-fold CV, and save the fitted pipeline to disk. The final model must beat a mean-prediction baseline by at least 20% on RMSE.
ColumnTransformer to apply different preprocessing to numeric and categorical columns simultaneously.MedInc (median income) is by far the most predictive feature — verify this with SelectKBest before adding polynomial features.DummyRegressor(strategy='mean') — you can evaluate it with the same cross_val_score call.$ python train.py
Loading data... 20640 samples, 8 features
Fitting pipeline (5-fold CV)...
CV RMSE: 0.548 ± 0.012
CV R²: 0.617 ± 0.009
Baseline: 1.154 ± 0.022
Improvement over baseline: 52.5%
Saved pipeline.pkl (245 KB)
$ python predict.py '{"MedInc":3.5,"HouseAge":20,"AveRooms":5.2,"AveBedrms":1.0,"Population":1200,"AveOccup":3.1,"Latitude":34.05,"Longitude":-118.24}'
Predicted median house value: $178,400
GradientBoostingRegressor and re-run CV. How much does R² improve? How much longer does training take?feature_importances_ step: after fitting, print the top-5 most predictive features selected by SelectKBest and their F-scores.GridSearchCV over Ridge alpha (0.01, 0.1, 1.0, 10.0) and SelectKBest k (4, 6, 8). Report the best hyperparameters.predict.py in a minimal FastAPI endpoint: POST /predict → JSON body → JSON prediction response. Test it with curl.