• Collin Cherubim

Fighting Crime with Art: Pittsburgh

This report was posted to fulfill the capstone requirement for the IBM Applied Data Science Specialization course offered by Coursera. The content is a bit of a departure from the typical topics I like to write about, but employs some pretty cool data science tools in machine learning and various Python libraries. Here's a link to the full report including all code: https://github.com/cjcollin37/Coursera_Capstone/blob/master/Battle_of_Neighborhoods_Project.ipynb

The purpose of this project is to propose an optimal neighborhood in Pittsburgh, Pennsylvania to create a community art piece with the intention of lowering crime. Data from Foursquare and Pittsburgh's Regional Data Center were leveraged to cluster Pittsburgh neighborhoods according to venue categories and crime rates. The optimal location for the art project was selected to maximize community accessibility in a relatively high-crime area.


The Iron City was once a smoggy, grey town built on the steel production industry. It is now home to a smattering of research universities, a thriving music and arts scene, and has recently been considered the most livable city in the United States. Although Pittsburgh is considered one of the safest U.S. cities with historically less crime compared to similar sized cities, there is room for improvement.

Evidence suggests that murals and public art lowers crime rates and foster pride in residents. The aim of this project is to identify a high traffic area with a relatively high crime rate. Installing a public art project in such a location allows for frequent interaction with the piece and has the greatest potential to positively impact the community.

Pittsburgh mural by Nic Marlton and Logan Randolph

The target audience includes urban planners and local Pittsburgh visual artists. Many organizations such as Pulse and the Pittsburgh Cultural Trust are hard at work on projects with similar goals. Such groups can benefit from a data driven approach to optimizing their resources and their approach to positive transformation.


Foursquare data are employed to obtain venue categories such as restaurants, parks, libraries, etc. A k-means clustering algorithm is used to group neighborhoods according to venue category. Concurrently, crime data from the Regional Data Center are utilized for comparison with the clusters. Maps of the city are generated with the folium Python library, and the venue clusters are analyzed in conjunction with the crime distribution to identify the highest leverage location for the art project. Descriptive statistics are also presented regarding prevalence of violent and non-violent crime in each neighborhood and crime rates for each cluster.

The Nominatim tool in the geopy Python package is utilized to retrieve location data for each neighborhood. First, an official list of neighborhoods from visitpittsburgh.com was used to manually create a spreadsheet for geocoding. The Foursquare API is then called to cluster neighborhoods by venue category.

The crime data are imported from the Western Pennsylvania Regional Data Center, managed by the University of Pittsburgh in partnership with Allegheny County and the City of Pittsburgh. The selected dataset only contains information reported by City of Pittsburgh Police in the year 2010. It does not contain information about incidents that solely involve other police departments operating within the city (for example, campus police or Port Authority police). These data are scraped from the spreadsheet available on the Regional Data Center's website into a pandas dataframe for descriptive statistical analysis and subsequent geocoding. These data are visualized using a geojson file containing boundaries for Pittsburgh neighborhoods borrowed from David Blackman on Github.


This section contains the bulk of the code output and statistical analysis. See the link at the top of the page for all original coding. The data shown here are interpreted and discussed in the 'Results & Discussion' section.

A csv file was created in excel using an official list of 90 Pittsburgh neighborhoods from visitpittsburgh.com. It was then written into a dataframe. The dataframe below displays the first 5 neighborhoods in the dataset. The next step was to geocode the neighborhoods (i.e. retrieve their geographical coordinates).

The neighborhoods were geocoded using Nominatim. The result is shown below.

The Python tool, folium, was then used to plot each neighborhood in Pittsburgh as a blue dot on the map below.

After the neighborhoods were plotted, Foursquare data were retrieved. Only venue category information was analyzed to cluster the neighborhoods. The clusters represent groups of neighborhoods in Pittsburgh with similar types of venues (e.g. bars, restaurants, parks, etc). The data were entered into another dataframe. The code below shows that 1067 different venues were retrieved within the city, each falling into 218 unique categories.

print('There are {} unique categories.'.format(len(pgh_venues['Venue Category'].unique())))
There are 218 unique categories.

One hot encoding was then used to determine the frequency of each venue category. Then the dataframe was grouped by neighborhood to determine the most common types of venues within each neighborhood. The top 5 most common venue categories for the first three neighborhoods are displayed below as a sample.

A final dataframe was created that displays the top 10 venue categories for each neighborhood and is shown below. These are the data used to cluster the neighborhoods.

K-means clustering

A k-means clustering algorithm was used to group the neighborhoods into 3 clusters. This was found to be the most sensible number of clusters for identifying similarities within clusters and noticeable differences between clusters.

After clustering the neighborhoods, four neighborhoods (Beltzhoover, Lincoln-Lemington-Belmar, New Homestead, and St. Clair) were found to have missing venue and/or coordinate data. They were excluded from the study.

The clusters were finally visualized by plotting them on a map generated with the folium library. The result is shown below. Each cluster corresponds to a unique color - green, purple, and red.

Pittsburgh Crime Data

The following data were obtained from the Western Pennsylvania Regional Data Center, managed by the University of Pittsburgh in partnership with Allegheny County and the City of Pittsburgh. It describes crime in Pittsburgh for the year of 2010. To simplify the analysis, only violent and property crimes were included. Aggravated assault, forcible rape, murder, and robbery are classified as violent while arson, burglary, larceny-theft, and motor vehicle theft are classified as property crimes. These crimes are collectively known as Index crimes; this name is used because the crimes are considered quite serious, tend to be reported more reliably than others, and are reported directly to the police and not to a separate agency. While this is likely a representative dataset to analyze crime, please note that it is not the full picture. It is limited to only the crimes described above, only reported crimes, and only in the year 2010.

The 'Crime Rate' column indicates the number of reported crimes per 100 residents. Each other column indicates the net number of reported crimes for each category. A sample of the crime dataframe is shown below, displaying the first 5 neighborhoods.

The data below give descriptive statistics showing the mean, standard deviation, minimum, quartiles, and maximum values for all neighborhoods in Pittsburgh.

Note: Southshore and Chateau neighborhoods had outlier data points for crime rate, 145.5 and 194.7 respectively. They were both reduced to the maximum, 25 (Strip District), in order to properly visualize the data in the heat map displayed below.

In order to visualize the distribution of crime within the city, a choropleth map, or 'heat map,' was created to show the crimes committed per 100 residents in each neighborhood ('Crime Rate' column of the dataframe).

The heat map displaying crime in the city was then overlaid with the clusters below.

Finally, some further descriptive statistics were performed to analyze the crime rate for each cluster individually using boxplots.

Results & Discussion

In order to identify the highest leverage neighborhood for the art project, the clusters must be analyzed. By observing the most common venue categories for each, we can better understand the characteristics, personality, and ultimately the amount of foot and road traffic for each. The ideal location has the greatest amount of traffic and venues that encourage residents to move about the neighborhood.

The first 5 neighborhoods of Cluster 0 are displayed below. Cluster 0 is by far the largest cluster. It is not surprising that, as a result, it appears to have the most varied venue categories from tea rooms to breweries, paintball fields to scenic lookouts, and Greek restaurants to sports bars. Compare this to the other clusters, all of which share far more similarities in their most common venue categories. That said, the most common venues in cluster 0 are generally food and drink related - an assortment of restaurant styles, cafes, and bars.

The first six neighborhoods of Cluster 1 are displayed below. Cluster 1 is characterized by neighborhoods with the highest representation of bars among their venues. Overall, it seems to host venues mostly related to food and nightlife - bars, restaurants, markets - with some entertainment and health venues like rec centers, music venues, and gyms. Neighborhoods in this cluster appear to be ideal candidates for the art project with the potential for high community engagement with the piece.

All Cluster 2 neighborhoods are displayed below. The four neighborhoods share a striking similarity of hosting baseball fields, yoga studios, Ethiopian restaurants, flower shops, and flea and fish markets. With so many baseball fields, it may be that these neighborhoods have more open, green spaces on average compared to the other clusters. This suggests a suboptimal location for community engagement with the art piece.

Regarding the crime data, the results indicate that the most crime is committed closest to the center of the city - Downtown (Central Business District), Southshore, Northshore, Chateau, and the Strip District. These are all Cluster 0 neighborhoods. The boxplot also demonstrates that the most crime per capita is committed in Cluster 0. Note that this measure is the number of crimes committed per 100 residents, and therefore does not reflect a bias due to Cluster 0 being the largest.

Upon closer inspection of the types of venues in the neighborhoods with the most crime, it is evident that these neighborhoods do not contain as many permanent residents as most other neighborhoods. Rather they are characterized by a more transient population and tourism with venues like hotels, stadiums, baseball fields, etc.

Furthermore, the types of crimes committed in these areas (shown below) suggest that they may not be committed by the residents themselves. For example, robberies and burglaries make up the majority of the crimes in the downtown area. Downtown is an expensive area to live and does not host many permanent residents. Those committing robberies and burglaries are likely not living downtown.

Community art projects in such areas may be low impact. It is suggested that community art is linked with reducing crime in some cities by instilling in the residents a sense of pride of their neighborhood. It is crucial that local residents themselves are involved in the project to make their home beautiful. Therefore, public art is likely a low leverage solution to reducing crime in neighborhoods with high tourism. There are several other better Cluster 0 candidates in the surrounding areas such as Bloomfield, East Allegheny, Larimer, Garfield, and many more.

Two Cluster 1 neighborhoods - Homewood North and Homewood South - may potentially be the highest leverage neighborhoods to invest in a community art project. Both have relatively higher crime rates, as seen below. There is a variety of crimes, and both neighborhoods are far from the city center. It therefore seems likely that a higher percentage of all crimes are committed by residents, unlike the neighborhoods closer to city center discussed above. A community art project may therefore serve to reduce crime committed by locals.

Furthermore, it seems that both neighborhoods may already have a relationship with the arts as evidenced by the top three most common venues for each, shown below (bars, concert halls, music stores, and music venues). They may be more receptive to a community art project. Also, the high occurrence of bars in each neighborhoods may lead to more frequent interaction and contact with the piece if there is more foot and automobile traffic. The venue data and crime data satisfy both of the major criteria laid out in the introduction.


The aim of this proposal is to identify the optimal neighborhood in Pittsburgh, PA to invest in a community art project with the aim of reducing crime. The key criteria are high crime rate per capita, and the potential for frequent interaction with the public art piece. A k-means clustering algorithm was employed in conjunction with Foursquare data in order to group all the neighborhoods into 3 clusters based on venue categories. Crime data from 2010 were used to generate a heat map of crime per capita in Pittsburgh. The two maps were analyzed together, and it was found that the highest occurrence of crime is in Cluster 0 - the largest cluster for which there is too much variability in venue category to discern any sensible characteristics. While the highest crime rates were found to occur close to city center, these Cluster 0 neighborhoods were deemed poor candidates for the project given the transient nature of their population.

Two neighborhoods - Homewood North and Homewood South - were identified as the optimal candidates. These neighborhoods belong to Cluster 1, characterized primarily by bars, restaurants, and entertainment. They are also home to many music venues and music stores, suggesting a potential receptiveness to a public art project. Homewood South is the single most ideal candidate, given its higher crime rate per capita of 10.4%.

It is important to note that this proposal has many limitations. First, four neighborhoods were excluded from the analysis due to lack of access to reliable data: St. Clair, Beltzhoover, Lincoln-Lemington-Belmar, and New Homestead. Second, the crime heat map only reflects reported crimes in 2010. Third, as mentioned before, Southshore and Chateau neighborhoods had anomalously high reported crime rate rates, and were lowered to the maximum of 25% for ease of visualization. Finally, many assumptions were made regarding the behavior of residents and criminal activity, as well as the number of permanent residents in various neighborhoods. These assumptions are not based on research. Rather, they are interpretations of the data informed by the experience of a previous, 6-year Pittsburgh resident.