Predicting Apartment Prices Using Machine Learning: A Step-by-Step Practical Guide

Predicting Apartment Prices Using Machine Learning

Predicting real estate prices is one of the most practical applications of machine learning. In this guide, we walk through a complete workflow used to build a price prediction model for apartments in Mexico City using Python, pandas, and scikit-learn.

Whether you are a beginner in data science or strengthening your applied machine learning skills, this article explains each stage clearly and professionally.

By the end, you will understand how to clean real estate data, explore patterns, build regression models, evaluate performance, and interpret feature importance.

What Is Machine Learning in Simple Terms?

Machine learning is a way of teaching computers to find patterns in data and use those patterns to make predictions.

For example:

If we show a computer thousands of apartments and their prices, it can learn:

  • bigger apartments usually cost more
  • some neighbourhoods are more expensive than others
  • location affects property value

After learning this, the computer can estimate the price of a new apartment it has never seen before.

That is exactly what we built in this project.

What Was Our Goal?

Our goal was simple:

Build a system that predicts the price of an apartment in Mexico City using information like:

  • apartment size
  • location
  • neighbourhood

This is similar to how real estate platforms estimate property values automatically.

Where Did the Data Come From?

To teach our model, we used a dataset containing real apartment listings.

Each listing included information such as:

  • price in US dollars
  • apartment size in square meters
  • latitude and longitude (map location)
  • neighbourhood name
  • property type

Think of this dataset as a spreadsheet where each row represents one apartment.

Before we could build a prediction system, we first had to prepare this data.

Why Real Estate Price Prediction Matters

Property price prediction helps:

  • investors identify opportunities
  • buyers understand fair market value
  • agencies estimate listing prices
  • analysts detect market trends
  • governments assess housing conditions

Machine learning allows us to automate this process using real data rather than assumptions.

Step 1: Importing and Preparing the Dataset

We started by loading apartment listing data stored across multiple CSV files. These files contained information such as:

  • apartment size
  • geographic coordinates
  • neighbourhood location
  • property type
  • listing price in USD

However, raw datasets always contain noise. Before modelling, data must be cleaned carefully.

This step is called data wrangling.

Step 2: Cleaning the Dataset with a Wrangle Function

Instead of manually editing each file, we created a reusable function called wrangle() to automate cleaning.

The function performed several essential operations:

Filtering relevant properties

We kept only:

  • apartments
  • located in Mexico City
  • priced below $100,000

This ensured our model focused on a consistent housing segment.

Removing extreme outliers

Outliers distort predictions. We removed apartments outside the middle 80 percent range of apartment size.

This improved model stability.

Extracting location coordinates

The dataset stored latitude and longitude inside a single column:

lat-lon

We split it into two numeric features:

lat
lon

These help the model learn geographic price patterns.

Creating neighbourhood features

Location strongly affects property price. We extracted borough information from hierarchical place names.

This created a categorical feature called:

borough

Removing irrelevant columns

We dropped columns that:

  • contained mostly missing values
  • had only one unique value
  • had unique values for every row
  • leaked information about the target variable

Examples included:

currency
operation
price
price_per_m2
properati_url

Removing leakage is critical because it prevents unrealistic model accuracy.

Step 3: Exploring the Data

After cleaning, we explored price distribution visually.

Histogram of apartment prices

The histogram showed that most apartments cluster at lower price levels, while a small number appear at higher prices.

This indicates a right-skewed distribution, which is typical for real estate markets.

Higher priced properties are fewer but significantly influence averages.

Scatter plot: price versus apartment size

Next, we plotted apartment size against price.

The chart revealed:

  • larger apartments generally cost more
  • some apartments with similar size had different prices
  • location plays an important role beyond size alone

This confirms that multiple features are needed for accurate prediction.

Step 4: Creating Feature Matrix and Target Vector

Machine learning models require separating predictors from outcomes.

We defined:

X_train

as the feature matrix containing:

  • surface_covered_in_m2
  • lat
  • lon
  • borough

and

y_train

as the target variable:

price_aprox_usd

Step 5: Building a Baseline Model

Before training advanced models, we created a baseline predictor.

The baseline model simply predicts the average apartment price for every property.

We calculated baseline error using:

Mean Absolute Error (MAE)

This tells us how far predictions are from actual values on average.

Any machine learning model must perform better than this baseline to be useful.

Step 6: Building a Machine Learning Pipeline

Instead of manually transforming data step by step, we created a pipeline.

The pipeline included:

OneHotEncoder

Converts categorical borough names into numeric features.

SimpleImputer

Fills missing values automatically.

Ridge Regression

A linear regression model that reduces overfitting using regularisation.

Pipeline example:

OneHotEncoder → SimpleImputer → Ridge Regression

This ensures the model processes training and test data consistently.

Step 7: Training the Model

Once the pipeline was defined, we trained it using:

model.fit(X_train, y_train)

The model then learned relationships between:

  • apartment size
  • geographic location
  • neighbourhood
  • price

Step 8: Generating Predictions

After training, we loaded a separate dataset containing unseen apartments:

mexico-city-test-features.csv

The model predicted prices using:

model.predict(X_test)

These predictions simulate real world deployment scenarios.

Step 9: Understanding Feature Importance

Machine learning models are more valuable when interpretable.

We extracted coefficients from the regression model to measure feature influence.

Important insights included:

  • apartment size strongly increases price
  • some boroughs increase predicted value
  • other boroughs reduce predicted value
  • geographic coordinates capture spatial price variation

This helps explain how the model makes decisions.

Step 10: Visualising the Most Influential Features

Finally, we created a horizontal bar chart showing the ten most impactful predictors.

This visualisation highlights:

  • strongest positive influences on price
  • strongest negative influences on price
  • relative importance of each location feature

Such charts are useful for stakeholders and analysts alike.

Key Lessons from This Project

This project demonstrates the full applied machine learning workflow:

Data preparation matters most

Clean data produces reliable predictions.

Location is critical in real estate models

Latitude, longitude, and borough information strongly influence price.

Baseline models are essential

They provide a benchmark for improvement.

Pipelines improve reliability

They ensure transformations remain consistent between training and prediction.

Feature importance improves trust

Understanding model decisions increases transparency.

Practical Applications of This Approach

This workflow can be adapted for:

  • housing market forecasting
  • rental price prediction
  • investment opportunity screening
  • valuation automation tools
  • property recommendation systems

It forms the foundation of many commercial property analytics platforms.

Final Thoughts

Predicting apartment prices using machine learning combines statistics, domain knowledge, and programming into one powerful workflow.

By cleaning data carefully, exploring relationships visually, building regression pipelines, and interpreting model outputs, we created a realistic predictive system for Mexico City apartments.

This same approach can be extended to other cities, datasets, and property markets.

Mastering this workflow is a strong step toward becoming a professional data scientist in real world analytics