This project, which is actually a competition on Kaggle, focuses on predicting residential home prices in Ames, Iowa. Using a dataset with 79 explanatory variables, I implemented a full data science pipeline, from advanced imputation and feature engineering to a comparative analysis of linear and ensemble models.
So the "business" goal is to create a model for predicting house prices in Ames, Iowa.
The dataset presented a high-dimensional space where the number of features was significant relative to the number of observations. The main challenge was distinguishing between meaningful signals and statistical noise.
In this case I used Python in a Jupyter Notebook for its simplicity and versatility in cleaning, engineering and visualizing data but also, and mostly, for the amazing machine learning libraries I could use.
Data Engineering Strategy
The dataset presented a very high number of features compared to the number of observations so the main challenge was to distinguish between meaningful signals and statistical noise.
Instead of standard data cleaning, I applied domain-specific logic to preserve information:
Strategic Imputation: Leveraging the data documentation, I identified that NaN values for categorical features (e.g., PoolQC, Alley) represented the absence of a feature rather than missing data. These were explicitly encoded as "None"
Neighborhood-Based Imputation: For LotFrontage, I used the median value per neighborhood, assuming geographical proximity is a strong predictor of lot size.
Feature Engineering: * TotalSF: Consolidated multiple square footage variables into a single "Total Surface Area" feature.
HouseAge: Transformed construction years into a relative age at the time of sale.
Log-Transformation: Applied "np.log1p" to the target variable (SalePrice) to normalize its distribution and achieve homoscedasticity.
Model Performance & Benchmarking
I compared several algorithms, looking at RMSE in Log scale, to find the best balance between bias and variance. Surprisingly, simpler models outperformed complex ensembles in this specific context:
Linear Regression: RMSE = 0.1259; it's the top performer and captured the strong linear relationships effectively.
Ridge Regression: RMSE = 0.1328; increased error due to excessive regularization (bias).
XGBoost: RMSE = 0.1344; Slight overfitting; the model struggled with the noise-to-signal ratio.
Random Forest: RMSE = 0.1456; Less effective at capturing global linear trends.
Critical Insights
The Simplicity Paradox: In "Low-Row, High-Feature" datasets, Linear Regression is a formidable baseline. Its rigidity prevents it from "hallucinating" patterns in the noise, which affected XGBoost and Random Forest.
The Importance of "Small" Data: Reducing the dataset to only high-correlation features (r > 0.5) significantly worsened performance (RMSE = 0.1572). This proved that house prices are determined by a collection of minor details rather than just a few dominant factors.
Data Quality > Algorithm Tuning: Removing specific outliers (e.g., massive houses sold at low prices) had a more substantial impact on accuracy than hyperparameter optimization.