-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCalgaryCrime.Rmd
282 lines (221 loc) · 11.4 KB
/
CalgaryCrime.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: "Calgary Crime Analysis"
author: "Marc Boulet & Calgary R User Group"
date: "2017-10-17"
output:
html_document:
theme: spacelab
---
```{r setup, message=FALSE, error=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(knitr)
library(ggplot2)
```
## Calgary Crime Statistics - Summary
We will perform a data analysis on the Calgary Crime Statistics dataset, retrieved from the following website: http://www.calgary.ca/cps/Documents/statistical-reports/2017%20Community%20Crime%20Statistics.xls.
A few changes were made in Excel before loading into R:
1. date column names were changed in Excel (collapsed two cells into one)
2. blank lines were removed
```{r dataload, error=FALSE, message=FALSE, cache=TRUE}
library(readr) # needed to load .csv files
CalgaryCrime <- read_csv("~/Dropbox/RStudio/CalgaryCrime/CalgaryCrime/2017 Community Crime Statistics by category.csv") # change URL to point to data
```
## Data conditioning
Before exploring the data, certain data conditioning steps were taken:
1. Replace NA values with zeroes
2. removed unused columns (Sept 2017 - December 2017).
```{r NAreplacement, message=FALSE}
library(knitr) # used to display data using kable
library(dplyr) # used to wrangle data
CalgaryCrime[is.na(CalgaryCrime)] <- 0 # add zeroes to NA cells
CalgaryCrime <- select(CalgaryCrime, -SEP:-DEC) # remove unused columns
kable(head(CalgaryCrime)) # display resulting data
```
### Data wrangling using the Tidy Data methodology
http://vita.had.co.nz/papers/tidy-data.html
In a tidy data set:
1. Each **variable** is in a column.
2. Each **observation** is a row.
3. Each **value** is a cell.
*Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When using tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.*
```{r tidydata, message=FALSE}
library(tidyr)
library(knitr)
CalgaryCrimeTidy <- CalgaryCrime %>% gather(Date, Cases, 3:70) # move data columns into one column
CalgaryCrimeTidy$Date <- as.Date(CalgaryCrimeTidy$Date, format = "%m/%d/%Y") # convert Date column to date format
kable(head(CalgaryCrimeTidy)) # display resulting data
```
### Add Calgary census data
In order to perform more meaningful data analysis, the 2017 Civic Census Results, consisting of community populations from 2013-2017, were joined to the existing dataset. The raw data is located here: http://www.calgary.ca/CA/city-clerks/Documents/Election-and-information-services/Census2017/2017_Census_Results_Community_Tables.xlsx.
```{r censusdata, error=FALSE, message=FALSE, cache=FALSE}
library(dplyr)
library(readr)
library(knitr)
CalgaryCensus <- read_csv("~/Dropbox/RStudio/CalgaryCrime/CalgaryCrime/2017_Census_Results_Community_Population.csv")
CalgaryCensus$AvgPop <- rowMeans(subset(CalgaryCensus, select = c(2:6))) # calc 5 yr pop average
CalgaryData <- left_join(CalgaryCrimeTidy, CalgaryCensus, by = "Community") # add pop data to crime data
kable(head(CalgaryData)) # display resulting data
```
## Exploratory data analysis
We will generate a few plots to gain insight into the data.
####Plot total crime stats by category
```{r plottotalcategories, fig.width=12}
library(knitr)
library(ggplot2)
library(dplyr)
CatTotal <- CalgaryData %>%
group_by(Category) %>%
summarise(TotalByCategory = sum(Cases)) %>%
arrange(desc(TotalByCategory))
plot <- ggplot(CatTotal, aes(x=Category,y=TotalByCategory)) +
geom_bar(stat="identity", fill="blue") +
theme(axis.text.x = element_text(angle=45, hjust=1, size=10))
plot
```
#### Plot total crime stats by community
```{r plottotalcases, fig.height=8, fig.width=12}
library(knitr)
library(ggplot2)
library(dplyr)
CasesTotal <- CalgaryData %>%
group_by(Community) %>%
summarise(TotalByCommunity = sum(Cases)) %>%
arrange(desc(TotalByCommunity))
plot <- ggplot(CasesTotal[1:25,], aes(x=reorder(Community, TotalByCommunity), y=TotalByCommunity)) +
geom_bar(stat="identity", fill="red") +
theme(axis.text.y = element_text(size=12)) +
geom_text(aes(label=TotalByCommunity), hjust=1.2, size=5, colour="white") +
coord_flip()
plot
```
#### Normalize total crime stats by population
Next, we'll normalize the data according to average population in each community:
```{r normalize, fig.width=12, message=FALSE}
CasesTotal <- left_join(CasesTotal, CalgaryCensus, by = "Community")
CasesTotal$Per100 <- CasesTotal$TotalByCommunity / CasesTotal$AvgPop * 100
kable(head(CasesTotal, 15))
```
#### Outliers
However, there are a number of outliers that occur:
1. data with NA values (missing population data)
2. data with zero values (zero population data)
3. data with exceptionally high normalized values (due to high crime rate & low average population, eg. industrial parks)
```{r outliers}
Outliers <- CasesTotal %>%
filter(!complete.cases(Per100) | Per100 =="Inf" | Per100 > 500)
kable(head(Outliers, 10))
```
#### Plot total crime stats by population (removing outliers)
```{r plotnormalisedtotals, fig.height=12, fig.width=12, message=FALSE}
library(ggplot2)
CasesTotalClean <- CasesTotal %>%
filter(complete.cases(Per100) & Per100 != "Inf" & Per100 < 500 & AvgPop > 500) %>%
arrange(desc(Per100))
plot <- ggplot(CasesTotalClean[1:25,], aes(x=reorder(Community, Per100), y=Per100)) +
geom_bar(stat="identity", fill = "darkblue") +
theme(axis.text.y = element_text(size=12)) +
geom_text(aes(label= round(Per100)), hjust=1.2, size=5, colour="white") +
geom_text(aes(label=round(AvgPop)), hjust=-0.2) +
coord_flip()
plot
kable(head(CasesTotalClean,25))
```
## Community investigation
Compare the total crime counts, by category, of one community to another community. The mean crime counts are also included. To compare other communities, just change the values for Comm1 and Comm2 (make sure the communities are in upper case).
```{r communitycompare, error=FALSE, fig.height=15, fig.width=12, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(lubridate)
CalgaryData$Date <- as.Date(CalgaryData$Date, format = "%m/%d/%Y")
Comm1 <- "BRIDGELAND/RIVERSIDE" # enter 1st community
Comm2 <- "BANFF TRAIL" # enter 2nd community
CommCat <- CalgaryData %>%
filter(Community == Comm1) %>%
group_by(Category, year(Date)) %>%
summarise(sum(Cases))
CommCat2 <- CalgaryData %>%
filter(Community == Comm2) %>%
group_by(Category, year(Date)) %>%
summarise(sum(Cases))
MeanCat <- CalgaryData %>%
filter(complete.cases(AvgPop) & AvgPop != "Inf" & AvgPop > 500) %>%
group_by(Category, year(Date)) %>%
summarise(mean(Cases))
MeanCat$`mean(Cases)` <- MeanCat$`mean(Cases)` * 12 # convert data from monthly to yearly averages
TotalCat <- merge(CommCat ,CommCat2,by=c("Category","year(Date)")) # merge Community 1 and 2 together
TotalCat <- merge(TotalCat, MeanCat, by=c("Category","year(Date)")) # merge mean data
CommPlot <- ggplot(TotalCat, aes(x=`year(Date)`)) +
geom_line(aes(y=`mean(Cases)`, colour = "Average")) +
geom_line(aes(y=`sum(Cases).x`, colour = Comm1)) +
geom_line(aes(y=`sum(Cases).y`, colour = Comm2)) +
geom_point(aes(y=`mean(Cases)`)) +
geom_point(aes(y=`sum(Cases).x`)) +
geom_point(aes(y=`sum(Cases).y`)) +
facet_wrap(~ Category, ncol=2 ) +
scale_y_log10() +
scale_colour_manual(values=c("black", "red", "blue")) +
theme_bw() +
theme(legend.position="top", legend.text=element_text(size=20),
strip.text = element_text(face="bold", size=rel(1.5)),
strip.background = element_rect(fill="lightblue", colour="black", size=1))
CommPlot
```
### Interactive map
Load the Calgary communities shapefile and merge the crime stats to the shape file dataframe.
```{r shapefile, cache=TRUE, message=FALSE, warning=FALSE, error=FALSE}
# load shapefile data (thanks Chel!)
download.file(url="https://data.calgary.ca/api/geospatial/ab7m-fwn6?method=export&format=Shapefile",
destfile="~/community_boundary.zip", method="wget")
unzip(zipfil="~/community_boundary.zip", exdir="~/")
flist <- list.files("~/")
fname <- flist[grep(".shp", flist)]
cb <- maptools::readShapePoly(file.path("~/", fname))
# merge crime stats to shapefile dataframe (cb@data)
colnames(cb@data)[5] <- "Community" # rename column to match crime stats dataframe
cb.df <- left_join(cb@data, CasesTotal, by = "Community") # add pop data to crime data
```
This is an interactive map to show the total and normalized crime stats overlaid on all the communities. There is also a heat map indicating the areas of highest crime, normalized by population.
```{r generatemap, echo=TRUE, message=FALSE, warning=FALSE}
library(leaflet)
# make map labels
labels <- sprintf("<strong>%s</strong><br/> Community structure: %s<br/>
Class: %s <br/> Population: %s <br/>
Total crime: %s, Per 100 people: %s",
cb.df$Community, cb@data$comm_struc,
cb@data$class, cb.df$AvgPop,
cb.df$TotalByCommunity, round(cb.df$Per100)) %>%
lapply(htmltools::HTML)
# code to generate cloropleth (heat) colour palette
bins <- c(0, 50, 100, 200, 300, 400, 500)
pal <- colorBin("Reds", domain = round(cb.df$Per100), bins = bins)
# generate map using leaflet
leaflet(width="100%", height = 1000) %>%
addTiles() %>%
setView(lng = -114.062019, lat=51.044270,zoom=11) %>%
addPolygons(data=cb,weight=2,col = 'blue', fillOpacity = 0) %>%
addPolygons(data = cb, fillColor = ~pal(round(cb.df$Per100)),
weight = 2,
dashArray = "3",
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 10,
color = "#666",
dashArray = "",
fillOpacity = 0,
bringToFront = TRUE),
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto"))
```
## To do List
* Rewrite the *dataload* module to automate loading the Excel data from the Calgary Crime site.
* Recast the data to a more analysis-friendly format, according to the tidy data methodology (http://vita.had.co.nz/papers/tidy-data.html) **DONE**
* Load Census data to normalize absolute crime rates: http://www.calgary.ca/CA/city-clerks/Pages/Election-and-information-services/Civic-Census/2017-Results.aspx **DONE**
* Georeference data to a map of the city of Calgary by integrating the work of Chel Hee Lee, as shown at the September 27th Calgary R User Group (CRUG) meeting. (http://people.ucalgary.ca/~chelhee.lee/pages/crug.html)
- dplyr join crime data to shapefile dataframe (cb@data) **DONE**
- cloropleth for crime values (http://rstudio.github.io/leaflet/choropleths.html) **DONE**
* Write Shiny app to allow interactive comparisons between communities, categories, years, etc.
* Investigate potential crime predictors (population size, aerial size, proximity to certain downtown, industrial parks, etc.)