Introduction

For my first data science project, I am presenting a scenario for an entrepreneur who wants to open a Pakistani restaurant in Dubai.

I’ve chosen Dubai first because I’ve spent a fair amount of time there, and I know how Dubai’s localities are segmented. Second, being a melting pot of 200+ nationalities, it offers world-class dining opportunities for people visiting or residing in Dubai.

Due to its world-class status as a tourism destination, Dubai is also the best contender for the Foursquare platform, an essential requirement for this project. 

Data

In our hypothetical scenario, we are looking for an optimal place wherein Dubai we can open our restaurant. For this purpose, we need to see how Dubai is segmented into different communities and neighborhoods.

To know about the communities in Dubai, I required a list of communities from a credible source. I came across the Dubai Statistics Center website, which is a government organization. Due to the OpenData policies of the Dubai government, they publish vital data on their website. I managed to found a list of communities in Dubai, along with population details circa 2019.

The above data can be found here.

Another critical piece of information required was location data for each community, i.e., latitudes and longitudes. This piece of information was not available at the source, so we had to use Nominatim through Geocoder to retrieve the latitudes and longitude for each community.

Methodology

To approach this scenario, I followed the below methodology:

  1. Extract all communities along with their population from the Dubai Statistics Center
  2. Clean and purify the data based on the following set of rules.
    1. Remove all industrial locations from the list, as we only want to open a restaurant in residential or commercial localities. These locations were identified based on overall knowledge about Dubai and indicated in the data source with a suffix like Industrial, IND, etc.
    2. Opening a restaurant only in a populated area, therefore, the cleansed list was sorted by population, and then only the top 100 communities were selected for this analysis.
  3. Shortlisted community names were then verified with Nominatim using its website here. It was necessary because the primary data source’s spelling was different from the listing in OpenStreetMaps. For example, Al Quoz is spelled Al Goze in the data source, due to direct translation from an Arabic dialect.
  4. After the naming corrections, extract the coordinates for each community correctly using Nominatim through geocoder libraries.
  5. Visualize the extracted coordinates on the map using Folium to verify the information retrieved.
  6.  Access Foursquare APIs to retrieve a list of venues for each community. Perform exploratory analysis by visualizing:
    1. Number of venues per community
    2. Summize each community based on type or category of venues available
    3. List down venues for the targeted category, i.e., a Pakistani restaurant, for each community.
  7. Filter out noise from the data by excluding all venues belonging categories other than Pakistani restaurant.
  8. Perform KMean clustering to identify and segment communities into clusters where Pakistani restaurants are listed and where not.
  9. List down clusters to explore which communities are ideal for opening a business.

Findings

Based on our above approach, we found out that:

  1. Data wrangling and correction is a crucial aspect of achieving results. The key is to find the correct data source and transform it into a form that you can use to achieve your objectives.
  2. We found out that there were only eight restaurants in Foursquare categorized as Pakistani in 100 communities across Dubai (Possible causes define in below section)
  3. Because there were such low number of restaurants listed, the majority of the clusters were rendered empty (Red)

Shortcomings

Certain shortcomings were identified throughout data extracting and analysis, which impacted the results and decision making.

  • The only information available to us was a list of the community name, and it’s population. Although we all know that when trying to decide where to open a specific ethnic restaurant, we have to see those areas’ demographics. In our case, if we had information regarding communities with a high population of Pakistanis, that would have made a significant impact on decision making.
  • Another weakness in data was identified when we retrieved the list of venues from Foursquare. We have only managed to pull out eight places marked as ‘Pakistani Restaurant’ in our top 100 communities based on their total population. This low number of restaurants could be due to miscategorization or mere the fact that Pakistani restaurants are not extensively listed on Foursquare.

Conclusion

This scenario can easily be applied to any business you are planning to open in any part of the world. The approach will pre-dominantly remain the same. The only adjustment in the data selection and some parameter adjustment based on geographical location will be required.

So, Let it be about opening a grocery store, a pharmacy, or even a barbershop.

All of the above work can be viewed from below links: