
Introduction
This project is inspired by a famous Kaggle competition called House Prices: Advanced Regression Techniques. The original project on Kaggle is based on the Boston Housing dataset and is an ideal project for newbies to hone their skills on.
The original project on Kaggle gives you full opportunity to practice data cleaning, exploratory analysis, and modeling. However, one aspect of the data science project lifecycle was missing, i.e., data scavenging and extraction. The data is already extracted for you, and you don’t know how the information was gathered.
My objective was to achieve the same goals of the original project, but by doing so, using my dataset. Instead of Boston housing, I started looking into the Dubai real estate market and opted to use one of the prominent real estate property portals for my data extraction.
Following this approach would allow the data to be scavenged and extracted from the internet, and it would also add complexity to the task as I cannot mimic the already achieved models submitted for the original Kaggle project.
The project is broken into phases, representing each phase in the data science project lifecycle. The phases include:
- Business Understanding
- Data Acquisition
- Data Preparation
- Exploratory Analysis
- Modeling
This is part one of the project series covering business understanding, data extraction, and data preparation.
All project artifacts are available at GitHub. You can view the Jupiter notebook for data cleaning can here. The Python libraries used for data scrapping and data cleaning are at GitHub.
Business Understanding
All data science endeavors start with a question. In this case, How can I predict property worth in Dubai? To understand this question, we first have to see what drives any property’s price.
Let’s say we are looking for a two-bedroom apartment in Dubai. When you look into the market, you will find out different price ranges for a two-bedroom apartment in the same city. This difference is expected because of:
- Covered Area/Size of the Apartment
- Amenities
- Neighborhood
- Availability
Understanding the relationship between the above or more factors and market pricing is the business understanding you need to predict the price. In the real world, a real estate broker or agency would know this relationship and can give you the price estimates if you ask them about a two-bedroom apartment in the Emaar building in downtown Dubai because they understand the business.
This business understanding is what we want to teach our machine learning algorithms: To correlate factors and predict the apartment’s estimated pricing.
Data Acquisition
It’s all about Data.

For this project, I required a data source that could offer currently listed properties in the market and all the attributes a property has, and the best place for this is a real estate portal or classifieds website.
I opted for a major real estate portal in Dubai and scrapped the publicly available data of around 2000 apartment units currently listed on that portal. Each property comprised of specific details like:
- Price
- No of bedrooms
- No of bathrooms
- Covered Area
- Location details (Area name, latitude, and longitude)
- List of amenities

I split my scrapping method into two steps. The first part was to collect all the URLs of the property details pages. Because the portal employed a pagination mechanism to navigate different properties, I used Selenium’s chrome driver to harvest all the URLs. Once I had all 2000 properties, I saved them into a CSV file for the second step.

In the second step, I used pandas to load the CSV containing all 2000 URLs and then used URLLib + BeautifulSoup to extract the property details into another CSV file.

There were challenges in doing so like when I was harvesting URLs; random advertisements appeared in the property listing, and many other traps.
We have to remember that scrapping is not an ideal affair, and it can get frustrating at times, like when the process timed out after harvesting 600 URLs because of an odd add appearing. It’s a recursive activity where you have to refine your script every time you run.
The whole process scrapping was done with the help of tools like Selenium and BeautifulSoup in Python.
Data Preparation
Our data acquisition phase resulted in a CSV file containing details of approximately 2000 properties. Once we loaded the data into Pandas dataframe, the following features were available to us:

The raw data in the dataframe looked like below:

Duplicate Properties
The first thing I checked was for duplicate properties. For this purpose, I queried the dataset for duplicates using the ID field.

It returned us 112 duplicates, meaning we had to remove 56 redundant properties from our dataset.
Now, let’s look at each feature separately:
Title
The title field does not add value to any exploratory & visualization phase and modeling. Therefore we decided to remove it.
Price
When we performed the value_counts for the price feature, we found the below details:

We can see, 27 of the properties were marked as “Ask for price.” As we are targetting for predicting the price of a potential property, all our training data should have price defined. Therefore I ended up removing the 27 properties from my dataset.
In the real world project, you have to weigh in your options in such circumstances. You can either remove the data or populate the price of missing properties using the median or any other agreed-upon value with the stakeholders.
No of Bathrooms
We also found 19 apartments having 0 bathrooms listed. I retrieved the median for the bathroom feature in our dataset and the updated the 19 apartments with the median value.

No of Bedrooms
Upon listing the value counts for bedrooms feature:

Based on the above, I performed the following two actions on the bedrooms feature:
- We will remove the suffix “+ Maid” from apartments showing with the number of bedrooms and will add it as a True/False feature named maid_room, indicating if the apartment has a maid room or not.
- The studio apartment will be marked as 0 bedrooms, as setting it up as one bedroom will make the information inaccurate.
Amenities
The amenities feature in the dataset was the most condensed one. It contained vital information in a comma-separated format. The following is an example from the dataset.
“Unfurnished, Balcony, Barbecue Area, Built in Wardrobes, Central A/C, Children’s Play Area, Children’s Pool, Concierge, Covered Parking, Kitchen Appliances, Lobby in Building, Maid Service, Networked, Pets Allowed, Private Garden, Private Gym, Private Jacuzzi, Private Pool, Security, Shared Gym, Shared Pool, Shared Spa, Study, Vastu-compliant, View of Landmark, View of Water, Walk-in Closet, “
Above is the amenities feature from one of the properties having the most number of amenities.
To make it useful, I split the above list into individual features of datatype True/False. In this way, we can easily set True or False amenity if the apartment has it or not.

Apartment Quality
Now that we had 28 additional features describing the apartment’s amenities, I engineered a quality index feature. The purpose was to develop a categorical feature that can be used to classify the apartment’s quality.
I used the below denominations for marking the quality of the apartment:
- 01 – 07 Amenities: Low
- 08 – 14 Amenities: Medium
- 15 – 21 Amenities: High
- 22 – 28 Amenities: Ultra
Though there is a flaw in such calculation as there can be an ultra-luxurious property in the market, it might get miscategorized because not all amenities were recorded.
Size
The size feature consists of information that was not usable, so I ended up cleaning it. It contained both the measures in SQFT and SQM. For our analysis, I preferred SQFT.

After cleaning up, I was left with an integer value for SQFT per apartment.

Price per SQFT
Now that we had apartment size and the price feature in a required format, it was easier to calculate the price per SQFT. Price per SQFT is a significant number when assessing a neighborhood’s value and its properties.
It was done by dividing the apartment price with size in SQFT.

Completion Status
After examining this feature, I found out that 165 properties were missing this information.

This feature was categorical and could have been helpful in an exploratory analysis, but I had to drop this feature because of a significant number of missing values.
In a real-world project, missing information in categorical features is usually replaced with majority value or stakeholders’ feedback.
Location
Location feature comprised of values in a format “Dubai, [Neighborhood], [Building/Project Name],” and in some cases, “Dubai, [Neighborhood].”

To standardize the locations across the dataset following refinements were done:
- Remove the “Dubai, ” prefix from the location, as it is established that the dataset only contains data from Dubai
- Remove the Building/Project name where it exists

Now that our dataset is ready for exploratory analysis and visualizing, we have to keep in mind that the decision we made above for rectifying the data might not be valid in a real-life project, and should involve stakeholders when cleaning and preparing the data.
Continue…
I’m splitting this project post into multiple parts. You can follow the links below for further reading:
Part 2: Apartment Pricing: Data Visualization & Exploratory Analysis
Part 3: Apartment Pricing: Model Development, Training, and Predictions
Leave a Reply