Predicting real estate prices is one of the most practical applications of machine learning. In this guide, we walk through a complete workflow used to build a price prediction model for apartments in Mexico City using Python, pandas, and scikit-learn.
Whether you are a beginner in data science or strengthening your applied machine learning skills, this article explains each stage clearly and professionally.
By the end, you will understand how to clean real estate data, explore patterns, build regression models, evaluate performance, and interpret feature importance.
What Is Machine Learning in Simple Terms?
Machine learning is a way of teaching computers to find patterns in data and use those patterns to make predictions.
For example:
If we show a computer thousands of apartments and their prices, it can learn:
- bigger apartments usually cost more
- some neighbourhoods are more expensive than others
- location affects property value
After learning this, the computer can estimate the price of a new apartment it has never seen before.
That is exactly what we built in this project.
What Was Our Goal?
Our goal was simple:
Build a system that predicts the price of an apartment in Mexico City using information like:
- apartment size
- location
- neighbourhood
This is similar to how real estate platforms estimate property values automatically.
Where Did the Data Come From?
To teach our model, we used a dataset containing real apartment listings.
Each listing included information such as:
- price in US dollars
- apartment size in square meters
- latitude and longitude (map location)
- neighbourhood name
- property type
Think of this dataset as a spreadsheet where each row represents one apartment.
Before we could build a prediction system, we first had to prepare this data.
Why Real Estate Price Prediction Matters
Property price prediction helps:
- investors identify opportunities
- buyers understand fair market value
- agencies estimate listing prices
- analysts detect market trends
- governments assess housing conditions
Machine learning allows us to automate this process using real data rather than assumptions.
Step 1: Importing and Preparing the Dataset
We started by loading apartment listing data stored across multiple CSV files. These files contained information such as:
- apartment size
- geographic coordinates
- neighbourhood location
- property type
- listing price in USD
However, raw datasets always contain noise. Before modelling, data must be cleaned carefully.
This step is called data wrangling.
Step 2: Cleaning the Dataset with a Wrangle Function
Instead of manually editing each file, we created a reusable function called wrangle() to automate cleaning.
The function performed several essential operations:
Filtering relevant properties
We kept only:
- apartments
- located in Mexico City
- priced below $100,000
This ensured our model focused on a consistent housing segment.
Removing extreme outliers
Outliers distort predictions. We removed apartments outside the middle 80 percent range of apartment size.
This improved model stability.
Extracting location coordinates
The dataset stored latitude and longitude inside a single column:
lat-lon
We split it into two numeric features:
lat
lon
These help the model learn geographic price patterns.
Creating neighbourhood features
Location strongly affects property price. We extracted borough information from hierarchical place names.
This created a categorical feature called:
borough
Removing irrelevant columns
We dropped columns that:
- contained mostly missing values
- had only one unique value
- had unique values for every row
- leaked information about the target variable
Examples included:
currency
operation
price
price_per_m2
properati_url
Removing leakage is critical because it prevents unrealistic model accuracy.
Step 3: Exploring the Data
After cleaning, we explored price distribution visually.
Histogram of apartment prices
The histogram showed that most apartments cluster at lower price levels, while a small number appear at higher prices.
This indicates a right-skewed distribution, which is typical for real estate markets.
Higher priced properties are fewer but significantly influence averages.
Scatter plot: price versus apartment size
Next, we plotted apartment size against price.
The chart revealed:
- larger apartments generally cost more
- some apartments with similar size had different prices
- location plays an important role beyond size alone
This confirms that multiple features are needed for accurate prediction.
Step 4: Creating Feature Matrix and Target Vector
Machine learning models require separating predictors from outcomes.
We defined:
X_train
as the feature matrix containing:
- surface_covered_in_m2
- lat
- lon
- borough
and
y_train
as the target variable:
price_aprox_usd
Step 5: Building a Baseline Model
Before training advanced models, we created a baseline predictor.
The baseline model simply predicts the average apartment price for every property.
We calculated baseline error using:
Mean Absolute Error (MAE)
This tells us how far predictions are from actual values on average.
Any machine learning model must perform better than this baseline to be useful.
Step 6: Building a Machine Learning Pipeline
Instead of manually transforming data step by step, we created a pipeline.
The pipeline included:
OneHotEncoder
Converts categorical borough names into numeric features.
SimpleImputer
Fills missing values automatically.
Ridge Regression
A linear regression model that reduces overfitting using regularisation.
Pipeline example:
OneHotEncoder ā SimpleImputer ā Ridge Regression
This ensures the model processes training and test data consistently.
Step 7: Training the Model
Once the pipeline was defined, we trained it using:
model.fit(X_train, y_train)
The model then learned relationships between:
- apartment size
- geographic location
- neighbourhood
- price
Step 8: Generating Predictions
After training, we loaded a separate dataset containing unseen apartments:
mexico-city-test-features.csv
The model predicted prices using:
model.predict(X_test)
These predictions simulate real world deployment scenarios.
Step 9: Understanding Feature Importance
Machine learning models are more valuable when interpretable.
We extracted coefficients from the regression model to measure feature influence.
Important insights included:
- apartment size strongly increases price
- some boroughs increase predicted value
- other boroughs reduce predicted value
- geographic coordinates capture spatial price variation
This helps explain how the model makes decisions.
Step 10: Visualising the Most Influential Features
Finally, we created a horizontal bar chart showing the ten most impactful predictors.
This visualisation highlights:
- strongest positive influences on price
- strongest negative influences on price
- relative importance of each location feature
Such charts are useful for stakeholders and analysts alike.
Key Lessons from This Project
This project demonstrates the full applied machine learning workflow:
Data preparation matters most
Clean data produces reliable predictions.
Location is critical in real estate models
Latitude, longitude, and borough information strongly influence price.
Baseline models are essential
They provide a benchmark for improvement.
Pipelines improve reliability
They ensure transformations remain consistent between training and prediction.
Feature importance improves trust
Understanding model decisions increases transparency.
Practical Applications of This Approach
This workflow can be adapted for:
- housing market forecasting
- rental price prediction
- investment opportunity screening
- valuation automation tools
- property recommendation systems
It forms the foundation of many commercial property analytics platforms.
Final Thoughts
Predicting apartment prices using machine learning combines statistics, domain knowledge, and programming into one powerful workflow.
By cleaning data carefully, exploring relationships visually, building regression pipelines, and interpreting model outputs, we created a realistic predictive system for Mexico City apartments.
This same approach can be extended to other cities, datasets, and property markets.
Mastering this workflow is a strong step toward becoming a professional data scientist in real world analytics
