The Battle of Neighborhoods

Introduction:

Toronto is one of the most densely populated area in Canada. Being the land of Opportunity, it brings in a variety of people from different ethnic backgrounds to the core city of Canada, Toronto. Being the largest city in Canada with an estimated population of over 6 million, there is no doubt about the diversity of the population. The multiculturalism is seen through the various neighborhoods including; Chinatown, Corso Italia, Little India, Kensington Market, Little Italy, Koreatown and many more. Downtown Toronto being the hub of interactions between ethnicities, brings many opportunities for entrepreneurs to start or grow their business. It is a place where people can try the best of each culture, either while they work or just passing through. Toronto is well known for its great food.

The objective of this project is to use Foursquare location data and regional clustering of venue information to determine what might be the ‘best’ neighborhood in Toronto to open a restaurant. Pizza and Pasta are one of the most bought dishes in Toronto originating from Italy. Toronto being the fourth largest home to Italians with a population over 500k, there are numerous opportunities to open a new Italian restaurant. Through this project we will find the most suitable location for an entrepreneur to open a new Italian restaurant in Toronto, Canada.

Target Audience:

• Entrepreneurs who want to open an Italian Restaurant in Toronto

Data Overview:

The data that will be required will be a combination of CSV files that have been prepared for the purposes of the analysis from multiple sources which will provide the list of neighborhoods in Toronto (via Wikipedia), the Geographical location of the neighborhoods (via Geocoder package) and Venue data pertaining to Italian restaurants (via Foursquare). The Venue data will help find which neighborhood is best suitable to open an Italian restaurant.

Methodology:

First, we will need to extract the data from the data sources:

Source 1: Toronto Neighborhoods via Wikipedia

The Wikipedia site (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) shown above, provided almost all the information about the neighborhoods. It included the postal code, borough and the name of the neighborhoods present in Toronto. Since the data is not in a format that is suitable for analysis, scraping of the data was done from this site.

Source 2: Geographical Location data using Geocoder Package

Figure 2: Geographical data of Neighborhoods in Toronto

Figure 3: Conversion of file into dataframe

The second source of data provided (https://cocl.us/Geospatial_data) us with the Geographical coordinates of the neighborhoods with the respective Postal Codes. The file was in CSV format, so attaching it to a Pandas data frame was simple (shown in figure 3).

Source 3: Venue Data using Foursquare

The retrieval of the location, name and category about the various venues in Toronto was collected through the Foursquare explore API. To obtain the data, it was required to make an account where it would provide a ‘Secret Key’ as well as a ‘Client ID’ which would allow me to pull any data.

Figure 4: Venue data pulled from Foursquare explore API

It is seen through that the neighborhoods are grouped by the neighborhood, so data clustering is made easier later on.

After all the data was collected and put into data frames, cleansing and merging of the data was required to start the process of analysis. When getting the data from Wikipedia, there were Boroughs that were not assigned to any neighborhood therefore, the following assumptions were made:

  1. Only the cells that have an assigned borough will be processed. Borough that is not assigned are ignored.
  2. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in Figure2 row 4.
  3. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. After the implementation of the following assumptions, the rows were grouped based on borough as shown below.

Figure 5: Rows grouped together based on Borough

Using the Latitude and Longitude collected from the Geocoder package, we merged the two tables together based on Postal Code.

Figure 6: Merging table together based on Postal Code

After, the venue data pulled from the Foursquare API was merged with the table above providing us with the local venue within a 500-meter radius shown below.

Figure 7: Local Venues near the respective Neighborhood

Now after cleansing the data, the next step was to analyze it. We then created a map using folium and color coded each Neighborhood depending on what Borough it was located in.

Figure 8: Toronto Neighborhoods

Next, we used the Foursquare API to get a list of all the Venues in Toronto which included Parks, Schools, Café Shops, Asian Restaurants etc. Getting this data was crucial to analyzing the number of Italian Restaurants all over Toronto. There was a total of 45 Italian Restaurants in Toronto. We then merged the Foursquare Venue data with the Neighborhood data which then gave us the nearest Venue for each of the Neighborhoods.

Figure 9: Venue table merged with Neighborhood data

Then to analyze the data we performed a technique in which Categorical Data is transformed into Numerical Data for Machine Learning algorithms. This technique is called One hot encoding. For each of the neighborhoods, individual venues were turned into the frequency at how many of those Venues were located in each neighborhood.

Figure 10: One hot encoding

Then we grouped those rows by Neighborhood and by taking the Average of the frequency of occurrence of each Venue Category.

Figure 11: Grouped Neighborhood by the average of the frequency of each Venue

After, we created a new data frame which only stored the Neighborhood names as well as the mean frequency of Italian Restaurants in that Neighborhood. This allowed the data to be summarized based on each individual Neighborhood and made the data much simpler to analyze.

Figure 12: New dataframe storing Neighborhoods and the average Italian Restaurant in that neighborhood

To make the analysis more interesting, we wanted to cluster the neighborhoods based on the neighborhoods that had similar averages of Italian Restaurants in that Neighborhood. To do this we used K-Means clustering. To get our optimum K value that was neither overfitting or underfitting the model, we used the Elbow Point Technique. In this technique we ran a test with different number of K values and measured the accuracy and then chose the best K value. The best K value is chosen at the point in which the line has a sharpest turn. In our case we had the Elbow Point at K = 4. That means we will have a total of 4 clusters.

Figure 13: Finding the K vs Error Values

Figure 14: Finding the right K using Elbow Point technique

We integrated a model which would fit the error and calculate the distortion score. From the dotted line, we see that the Elbow is at K=4. Moreover, in K-Means clustering, objects that are similar based on a certain variable are put into the same cluster. Neighborhoods that had similar mean frequency of Italian Restaurants were divided into 4 clusters. Each of these clusters were labelled from 0 to 3 as the indexing of labels begin with 0 instead of 1.

Figure 15: Appropriate cluster labels were added

After, we merged the venue data with the table above creating a new table which would be the basis for analyzing new opportunities for opening a new Italian Restaurant in Toronto. Then we created a map using the Folium package in Python and each neighborhood was colored based on the cluster label. For example, cluster 2 was purple and cluster 3 was blue.

Figure 16: Map with different cluster

The map above shows the different clusters that had similar mean frequency of Italian restaurants.

Analysis:

We have a total of 4 clusters (0,1,2,3). Before we analyze them one by one lets check the total amount of neighborhoods in each cluster and the average Italian Restaurants in that cluster. From the bar graph that was made using Matplotlib (figure 18) , we can compare the number of Neighborhoods per Cluster. We see that Cluster 1 has the least neighborhoods (1) while cluster 2 has the most (70). Cluster 3 has 14 neighborhoods and cluster 4 has only 8. Then we compared the average Italian Restaurants per cluster.

Figure 17: Number of Neighborhood per cluster

Figure 18: Average Italian Restaurant in each Neighborhood

Discussion:

Most of the Italian Restaurants are in cluster 1 represented by the red clusters. The Neighborhoods located in the North York area that have the highest average of Italian Restaurants are Bedford Park and Lawrence Manor East. Even though there is a huge number of Neighborhoods in cluster 2, there is little to no Italian Restaurant. We see that in the Downtown Toronto area (cluster 3) has the second last average of Italian Restaurants. Looking at the nearby venues, the optimum place to put a new Italian Restaurant is in Downtown Toronto as there are many Neighborhoods in the area but little to no Italian Restaurants therefore, eliminating any competition. The second-best Neighborhoods that have a great opportunity would be in areas such as Adelaide and King, Fairview, etc. which is in Cluster 2. Having 70 neighborhoods in the area with no Italian Restaurants gives a good opportunity for opening a new restaurant. Some of the drawback of this analysis are – the clustering is completely based on data obtained from Foursquare API. Also, the analysis does not take into consideration of the Italian population across neighborhoods as this can play a huge factor while choosing which place to open a new Italian restaurant. This concludes the optimal findings for this project and recommends the entrepreneur to open an authentic Italian restaurant in these locations with little to no competition.