A data science blog documenting learning, projects, concepts, and how-tos of this incredible field.

Apartment Pricing: Model Development, Training, and Predictions

Welcome to the final part of the project series “Apartment Pricing: Advanced Regression Techniques.” 

I highly recommend visiting previous posts in this series:

You can find all the working shown in this series at my GitHub repository here.

Introduction

So, this is all we’ve been building up to, Predicting prices. 

To reach this stage, we covered the following stages of a data science project lifecycle:

  1. Business Understanding
  2. Data Acquisition
  3. Data Preparation
  4. Data Visualization & Exploratory Analysis

In the final part of the series, we will see how we can:

  1. Final tuning of our dataset
  2. Divide our dataset into training and testing samples
  3. Training our machine learning models
  4. Testing those models
  5. Output our results

Final Tuning

Although our dataset was ready, we still had two categorical features that require further adjustment, as they are not helpful for model training in the current form.

These two features are neighborhood and quality. The neighborhood is a string feature containing the name of the locality in Dubai. The quality is also a string feature that we engineered in our previous stage to rate apartments based on their number of amenities offered.

Machine learning models require the inputs and outputs to be in number format, so categorical data with string labels, i.e., High, Ultra, etc., is not an acceptable form for model development.

There are a couple of ways we can handle such a situation, one being excluding features entirely from the dataset at this stage if they are not relevant to your business need. But in our case, neighborhood and apartment are very critical features to determine the worth of a property.

The other way is to transform your categorical data into a numeric representation. For example, by assigning 1 for Low, 2 for Medium, 3 for High, and 4 for Ultra in your dataset. This technique is called label encoding.

The approach I took is called One-Hot Encoding. I used Panda’s builtin get_dummies function.

In this methodology, all the unique categories are turned into individual features and are assigned binary data 0 or 1. For example, our Quality feature is split into four columns called Ultra, High, Medium, and Low, with an assigned value of 0 or 1.

Train/Test Split

Now that our dataset is ready, we can proceed with the next step of splitting the data into training and testing datasets.

When developing machine learning models, we have to train the models using a subset of our data, and when the model is trained enough, we test our model using a different subset same dataset. This practice ensures that we have identical features when training and testing our models.

For our model, we manually split our dataset into a 70/30 division. Seventy percent of the data is for training the model, and the rest is for the testing and validating the model.

We also made a copy of the test dataset and removed all the features except the ID and original Price feature. We will use this dataset for matching our predictions to original prices.

We then further split our training dataset into an 80/20 division.

Do note that the bigger the training sample is, the more accurate outcome you’ll get from your model.

Model Training

We are using the following regression-based models for our price predictions:

GradientBoostingRegressor

RandomForestRegressor

XGBRegressor

LGBMRegressor

I’ve mostly configured my models with default settings, but you can adjust the attributes to tune your model further.

Model Testing

Using the above-trained models and our testing dataset, we tried to predict the prices. We measured our predictions using the R2 score or often called R Squared. By definition, we can say that the R2 score is the proportion of the variance in the dependent variable that is predictable from the independent variable(s)variance (source).

The following are our R2 scores:

GradientBoostingRegressor

RandomForestRegressor

XGBRegressor

LGBMRegressor

Predictions

Based on the above testings, the highest R2 score was achieved by GradientBoostingRegressor, i.e., 0.9992177542939731. Therefore we will take the predictions from RandomForestRegressor and compare them with actual prices in our testing dataset.

For this, we appended the predictions into our prediction dataset, which was a copy of the testing dataset, but only containing price and property id.

We also calculated the difference between actual prices and predicted prices and appended that to the prediction dataset.

We did very well. You can see that predicted prices are very close to the actual prices. The below plot also reflects this and you can hardly see the blue data points for original prices:

Conclusion

This concludes our project series. We saw how we came up with our business need, and from there, we moved forward and started scavenging for required data extracted the information from the internet using scrapping techniques.

Once we had our data, we refined it for our use in performing visualizations and exploratory analysis. Ultimately, we used that very same data to train our model and extract predictions.

If we further move forward, we can roll out this model and interface it with our web application or a mobile app for users to provide inputs and receive estimated prices for a property.

2 Comments

  1. johnny

    Hi, my name is johnny,
    first ,congratulations !! your work on this project would be usefull for my beginning in this area of data science,i learned a lot .
    I think that i need your help to progress in this area, if it will be possible

    • Waqas Ahmed

      Hi Johny, thanks for the kind words.

      I will gladly help you. I would recommend going through Coursera and other MOOCs available online. That’s what helped me get started.

      Kaggle is also a very good resource for beginners.

      Thanks
      W.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© 2025 Data Regress.

Theme by Anders NorenUp ↑