CalgaryCrime.Rmd

---
title: "Calgary Crime Analysis"
author: "Marc Boulet & Calgary R User Group"
date: "2017-10-17"
output: 
  html_document: 
    theme: spacelab
---

```{r setup, message=FALSE, error=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(knitr)
library(ggplot2)
```

## Calgary Crime Statistics - Summary

We will perform a data analysis on the Calgary Crime Statistics dataset, retrieved from the following website: http://www.calgary.ca/cps/Documents/statistical-reports/2017%20Community%20Crime%20Statistics.xls.


A few changes were made in Excel before loading into R:  
1. date column names were changed in Excel (collapsed two cells into one)  
2. blank lines were removed  

```{r dataload, error=FALSE, message=FALSE, cache=TRUE}
library(readr) # needed to load .csv files
CalgaryCrime <- read_csv("~/Dropbox/RStudio/CalgaryCrime/CalgaryCrime/2017 Community Crime Statistics by category.csv") # change URL to point to data
```

## Data conditioning

Before exploring the data, certain data conditioning steps were taken:  
1. Replace NA values with zeroes  
2. removed unused columns (Sept 2017 - December 2017).  

```{r NAreplacement, message=FALSE}
library(knitr) # used to display data using kable
library(dplyr) # used to wrangle data
CalgaryCrime[is.na(CalgaryCrime)] <- 0 # add zeroes to NA cells
CalgaryCrime <- select(CalgaryCrime, -SEP:-DEC) # remove unused columns
kable(head(CalgaryCrime)) # display resulting data
```


### Data wrangling using the Tidy Data methodology  
http://vita.had.co.nz/papers/tidy-data.html  
In a tidy data set:  
1. Each **variable** is in a column.  
2. Each **observation** is a row.  
3. Each **value** is a cell.  

*Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When using tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.*

```{r tidydata, message=FALSE}
library(tidyr)
library(knitr)
CalgaryCrimeTidy <- CalgaryCrime %>% gather(Date, Cases, 3:70) # move data columns into one column
CalgaryCrimeTidy$Date <- as.Date(CalgaryCrimeTidy$Date, format = "%m/%d/%Y") # convert Date column to date format
kable(head(CalgaryCrimeTidy)) # display resulting data

```

### Add Calgary census data

In order to perform more meaningful data analysis, the 2017 Civic Census Results, consisting of community populations from 2013-2017, were joined to the existing dataset. The raw data is located here: http://www.calgary.ca/CA/city-clerks/Documents/Election-and-information-services/Census2017/2017_Census_Results_Community_Tables.xlsx. 

```{r censusdata, error=FALSE, message=FALSE, cache=FALSE}
library(dplyr)
library(readr)
library(knitr)

CalgaryCensus <- read_csv("~/Dropbox/RStudio/CalgaryCrime/CalgaryCrime/2017_Census_Results_Community_Population.csv")
CalgaryCensus$AvgPop <- rowMeans(subset(CalgaryCensus, select = c(2:6))) # calc 5 yr pop average 
CalgaryData <- left_join(CalgaryCrimeTidy, CalgaryCensus, by = "Community") # add pop data to crime data
kable(head(CalgaryData)) # display resulting data
```

## Exploratory data analysis

We will generate a few plots to gain insight into the data.  

####Plot total crime stats by category  

```{r plottotalcategories, fig.width=12}
library(knitr)
library(ggplot2)
library(dplyr)
CatTotal <-   CalgaryData %>% 
                group_by(Category) %>% 
                summarise(TotalByCategory = sum(Cases)) %>%
                arrange(desc(TotalByCategory))

plot <- ggplot(CatTotal, aes(x=Category,y=TotalByCategory)) +
        geom_bar(stat="identity", fill="blue") +
        theme(axis.text.x = element_text(angle=45, hjust=1, size=10))
plot
```

#### Plot total crime stats by community 

```{r plottotalcases, fig.height=8, fig.width=12}
library(knitr)
library(ggplot2)
library(dplyr)
CasesTotal <-   CalgaryData %>% 
                group_by(Community) %>% 
                summarise(TotalByCommunity = sum(Cases)) %>%
                arrange(desc(TotalByCommunity))

plot <- ggplot(CasesTotal[1:25,], aes(x=reorder(Community, TotalByCommunity), y=TotalByCommunity)) +
        geom_bar(stat="identity", fill="red") +
        theme(axis.text.y = element_text(size=12)) +
        geom_text(aes(label=TotalByCommunity), hjust=1.2, size=5, colour="white") +
        coord_flip()
plot
```

#### Normalize total crime stats by population  

Next, we'll normalize the data according to average population in each community:

```{r normalize, fig.width=12, message=FALSE}

CasesTotal <- left_join(CasesTotal, CalgaryCensus, by = "Community")
CasesTotal$Per100 <- CasesTotal$TotalByCommunity / CasesTotal$AvgPop * 100
kable(head(CasesTotal, 15))
```

#### Outliers

However, there are a number of outliers that occur:  
1. data with NA values (missing population data)  
2. data with zero values (zero population data)    
3. data with exceptionally high normalized values (due to high crime rate & low average population, eg. industrial parks)  

```{r outliers}
Outliers <- CasesTotal %>% 
            filter(!complete.cases(Per100) | Per100 =="Inf" | Per100 > 500)
kable(head(Outliers, 10))
```

#### Plot total crime stats by population (removing outliers)

```{r plotnormalisedtotals, fig.height=12, fig.width=12, message=FALSE}
library(ggplot2)

CasesTotalClean <-  CasesTotal %>% 
                    filter(complete.cases(Per100) & Per100 != "Inf" & Per100 < 500 & AvgPop > 500) %>%
                    arrange(desc(Per100))

plot <- ggplot(CasesTotalClean[1:25,], aes(x=reorder(Community, Per100), y=Per100)) +
        geom_bar(stat="identity", fill = "darkblue") +
        theme(axis.text.y = element_text(size=12)) +
        geom_text(aes(label= round(Per100)), hjust=1.2, size=5, colour="white") +
        geom_text(aes(label=round(AvgPop)), hjust=-0.2) +
        coord_flip()
plot
kable(head(CasesTotalClean,25))
```

## Community investigation

Compare the total crime counts, by category, of one community to another community. The mean crime counts are also included. To compare other communities, just change the values for Comm1 and Comm2 (make sure the communities are in upper case).

```{r communitycompare, error=FALSE, fig.height=15, fig.width=12, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(lubridate)

CalgaryData$Date <- as.Date(CalgaryData$Date, format = "%m/%d/%Y")

Comm1 <- "BRIDGELAND/RIVERSIDE" # enter 1st community
Comm2 <- "BANFF TRAIL" # enter 2nd community

CommCat <-  CalgaryData %>% 
            filter(Community == Comm1) %>%
            group_by(Category, year(Date)) %>%
            summarise(sum(Cases))

CommCat2 <-  CalgaryData %>% 
            filter(Community == Comm2) %>%
            group_by(Category, year(Date)) %>%
            summarise(sum(Cases))

MeanCat <-  CalgaryData %>% 
            filter(complete.cases(AvgPop) & AvgPop != "Inf" & AvgPop > 500) %>%
            group_by(Category, year(Date)) %>%
            summarise(mean(Cases))
MeanCat$`mean(Cases)` <- MeanCat$`mean(Cases)` * 12 # convert data from monthly to yearly averages

TotalCat <- merge(CommCat ,CommCat2,by=c("Category","year(Date)")) # merge Community 1 and 2 together
TotalCat <- merge(TotalCat, MeanCat, by=c("Category","year(Date)")) # merge mean data

CommPlot <- ggplot(TotalCat, aes(x=`year(Date)`)) + 
            geom_line(aes(y=`mean(Cases)`, colour = "Average")) +
            geom_line(aes(y=`sum(Cases).x`, colour = Comm1)) +
            geom_line(aes(y=`sum(Cases).y`, colour = Comm2)) +
            geom_point(aes(y=`mean(Cases)`)) +
            geom_point(aes(y=`sum(Cases).x`)) +
            geom_point(aes(y=`sum(Cases).y`)) +
            facet_wrap(~ Category, ncol=2 ) +
            scale_y_log10() +
            scale_colour_manual(values=c("black", "red", "blue")) +
            theme_bw() + 
            theme(legend.position="top", legend.text=element_text(size=20),
                    strip.text = element_text(face="bold", size=rel(1.5)),
                    strip.background = element_rect(fill="lightblue", colour="black", size=1))
CommPlot
```


### Interactive map  

Load the Calgary communities shapefile and merge the crime stats to the shape file dataframe.

```{r shapefile, cache=TRUE, message=FALSE, warning=FALSE, error=FALSE}

# load shapefile data (thanks Chel!)
download.file(url="https://data.calgary.ca/api/geospatial/ab7m-fwn6?method=export&format=Shapefile", 
              destfile="~/community_boundary.zip", method="wget")
unzip(zipfil="~/community_boundary.zip", exdir="~/")
flist <- list.files("~/")
fname <- flist[grep(".shp", flist)]
cb <- maptools::readShapePoly(file.path("~/", fname))

# merge crime stats to shapefile dataframe (cb@data)
colnames(cb@data)[5] <- "Community" # rename column to match crime stats dataframe
cb.df <- left_join(cb@data, CasesTotal, by = "Community") # add pop data to crime data

```

This is an interactive map to show the total and normalized crime stats overlaid on all the communities. There is also a heat map indicating the areas of highest crime, normalized by population.


```{r generatemap, echo=TRUE, message=FALSE, warning=FALSE}
library(leaflet)

# make map labels
labels <-   sprintf("<strong>%s</strong><br/> Community structure: %s<br/>
                    Class: %s <br/> Population: %s <br/>
                    Total crime: %s, Per 100 people: %s",
                    cb.df$Community, cb@data$comm_struc, 
                    cb@data$class, cb.df$AvgPop, 
                    cb.df$TotalByCommunity, round(cb.df$Per100)) %>%
            lapply(htmltools::HTML)

# code to generate cloropleth (heat) colour palette
bins <- c(0, 50, 100, 200, 300, 400, 500)
pal <- colorBin("Reds", domain = round(cb.df$Per100), bins = bins)

# generate map using leaflet
leaflet(width="100%", height = 1000)  %>% 
    addTiles() %>% 
    setView(lng = -114.062019, lat=51.044270,zoom=11) %>% 
    addPolygons(data=cb,weight=2,col = 'blue', fillOpacity = 0) %>%
    addPolygons(data = cb, fillColor = ~pal(round(cb.df$Per100)),
        weight = 2,
        dashArray = "3",
        fillOpacity = 0.7,
        highlight = highlightOptions(
                    weight = 10,
                    color = "#666",
                    dashArray = "",
                    fillOpacity = 0,
                    bringToFront = TRUE),
        label = labels,
                labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "15px",
                direction = "auto"))
```
 
## To do List
* Rewrite the *dataload* module to automate loading the Excel data from the Calgary Crime site.
* Recast the data to a more analysis-friendly format, according to the tidy data methodology (http://vita.had.co.nz/papers/tidy-data.html) **DONE**  
* Load Census data to normalize absolute crime rates: http://www.calgary.ca/CA/city-clerks/Pages/Election-and-information-services/Civic-Census/2017-Results.aspx **DONE**   
* Georeference data to a map of the city of Calgary by integrating the work of Chel Hee Lee, as shown at the September 27th Calgary R User Group (CRUG) meeting.  (http://people.ucalgary.ca/~chelhee.lee/pages/crug.html) 
    - dplyr join crime data to shapefile dataframe (cb@data) **DONE**
    - cloropleth for crime values (http://rstudio.github.io/leaflet/choropleths.html) **DONE** 
* Write Shiny app to allow interactive comparisons between communities, categories, years, etc.
* Investigate potential crime predictors (population size, aerial size, proximity to certain downtown, industrial parks, etc.)