forked from statsthinking21/statsthinking21-core
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-SummarizingData.Rmd
480 lines (351 loc) · 27.9 KB
/
03-SummarizingData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
---
output:
pdf_document: default
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
html_document: default
---
# Summarizing data
I mentioned in the Introduction that one of the big discoveries of statistics is the idea that we can better understand the world by throwing away information, and that's exactly what we are doing when we summarize a dataset.
In this Chapter we will discuss why and how to summarize data.
```{r echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(cowplot)
library(knitr)
options(digits = 2)
```
## Why summarize data?
When we summarize data, we are necessarily throwing away information, and one might plausibly object to this. As an example, let's go back to the PURE study that we discussed in Chapter 1. Are we not supposed to believe that all of the details about each individual matter, beyond those that are summarized in the dataset? What about the specific details of how the data were collected, such as the time of day or the mood of the participant? All of these details are lost when we summarize the data.
One reason that we summarize data is that it provides us with a way to *generalize* - that is, to make general statements that extend beyond specific observations. The importance of generalization was highlighted by the writer Jorge Luis Borges in his short story "Funes the Memorious", which describes an individual who loses the ability to forget. Borges focuses in on the relation between generalization (i.e. throwing away data) and thinking: "To think is to forget a difference, to generalize, to abstract. In the overly replete world of Funes, there were nothing but details."
Psychologists have long studied all of the ways in which generalization is central to thinking. One example is categorization: We are able to easily recognize different examples of the category of "birds" even though the individual examples may be very different in their surface features (such as an ostrich, a robin, and a chicken). Importantly, generalization lets us make predictions about these individuals -- in the case of birds, we can predict that they can fly and eat seeds, and that they probably can't drive a car or speak English. These predictions won't always be right, but they are often good enough to be useful in the world.
## Summarizing data using tables
A simple way to summarize data is to generate a table representing counts of various types of observations. This type of table has been used for thousands of years (see Figure \@ref(fig:salesContract)).
```{r salesContract,echo=FALSE,fig.cap="A Sumerian tablet from the Louvre, showing a sales contract for a house and field. Public domain, via Wikimedia Commons.",fig.width=4,fig.height=4,out.height='30%'}
knitr::include_graphics("images/Sales_contract_Shuruppak_Louvre_AO3760.jpg")
```
```{r LoadNHANES, echo=FALSE}
# load the NHANES data library
library(NHANES)
# drop duplicated IDs within the NHANES dataset
NHANES <-
NHANES %>%
distinct(ID, .keep_all = TRUE)
# open the help page for the dataset
# help(NHANES)
```
Let's look at some examples of the use of tables, using a more realistic dataset. Throughout this book we will use the [National Health and Nutrition Examination Survey (NHANES)](https://www.cdc.gov/nchs/nhanes/index.htm) dataset. This is an ongoing study that assesses the health and nutrition status of a sample of individuals from the United States on many different variables. We will use a version of the dataset that is available for the R statistical software package. For this example, we will look at a simple variable, called *PhysActive* in the dataset. This variable contains one of three different values: "Yes" or "No" (indicating whether or not the person reports doing "moderate or vigorous-intensity sports, fitness or recreational activities"), or "NA" if the data are missing for that individual. There are different reasons that the data might be missing; for example, this question was not asked of children younger than 12 years of age, while in other cases an adult may have declined to answer the question during the interview, or the interviewer's recording of the answer on their form might be unreadable.
### Frequency distributions {#frequency-distributions}
A *distribution* describes how data are divided between different possible values. For this example, let's look at how many people fall into each of the physical activity categories.
```{r MakePhysActiveTable, echo=FALSE, warning=FALSE}
# summarize physical activity data
PhysActive_table <- NHANES %>%
dplyr::select(PhysActive) %>% # select the variable
group_by(PhysActive) %>% # group by values of the variable
summarize(AbsoluteFrequency = n()) # count the values
```
```{r PhysActiveTable, echo=FALSE}
kable(PhysActive_table, digits=3, caption='Frequency distribution for PhysActive variable')
```
Table \@ref(tab:PhysActiveTable) shows the frequencies of each of the different values; there were `r I(PhysActive_table %>% subset(PhysActive=='No') %>% dplyr::select(AbsoluteFrequency))` individuals who responded "No" to the question, `r I(PhysActive_table %>% subset(PhysActive=='Yes') %>% dplyr::select(AbsoluteFrequency))` who responded "Yes", and `r I(PhysActive_table %>% subset(is.na(PhysActive)) %>% dplyr::select(AbsoluteFrequency))` for whom no response was given. We call this a *frequency distribution* because it tells us how frequent each of the possible values is within our sample.
This shows us the absolute frequency of the two responses, for everyone who actually gave a response. We can see from this that there are more people saying "Yes" than "No", but it can be hard to tell from absolute numbers how big the difference is in relative terms. For this reason, we often would rather present the data using *relative frequency*, which is obtained by dividing each frequency by the sum of all frequencies:
$$
relative\ frequency_i = \frac{absolute\ frequency_i}{\sum_{j=1}^N absolute\ frequency_j}
$$
The relative frequency provides a much easier way to see how big the imbalance is. We can also interpret the relative frequencies as percentages by multiplying them by 100. In this example, we will drop the NA values as well, since we would like to be able to interpret the relative frequencies of active versus inactive people. However, for this to make sense we have to assume that the NA values are missing "at random", meaning that their presence or absence is not related to the true value of the variable for that person. For example, if inactive participants were more likely to refuse to answer the question than active participants, then that would *bias* our estimate of the frequency of physical activity, meaning that our estimate would be different from the true value.
```{r echo=FALSE}
# compute percentages for physical activity categories
PhysActive_table_filtered <- NHANES %>%
drop_na(PhysActive) %>%
dplyr::select(PhysActive) %>%
group_by(PhysActive) %>%
summarize(AbsoluteFrequency = n()) %>%
mutate(
RelativeFrequency = AbsoluteFrequency / sum(AbsoluteFrequency),
Percentage = RelativeFrequency * 100
)
```
```{r PhysActiveTableFiltered, echo=FALSE}
kable(PhysActive_table_filtered, caption='Absolute and relative frequencies and percentages for PhysActive variable')
```
Table \@ref(tab:PhysActiveTableFiltered) lets us see that `r formatC(I(PhysActive_table_filtered %>% subset(PhysActive=='No') %>% dplyr::select(Percentage) %>% pull()), digits=1, format='f')` percent of the individuals in the NHANES sample said "No" and `r formatC(I(PhysActive_table_filtered %>% subset(PhysActive=='Yes') %>% dplyr::select(Percentage) %>% pull()), digits=1, format='f')` percent said "Yes".
### Cumulative distributions {#cumulative-distributions}
The *PhysActive* variable that we examined above only had two possible values, but often we wish to summarize data that can have many more possible values. When those values are quantitative, then one useful way to summarize them is via what we call a *cumulative* frequency representation: rather than asking how many observations take on a specific value, we ask how many have a value some specific value *or less*.
Let's look at another variable in the NHANES dataset, called *SleepHrsNight* which records how many hours the participant reports sleeping on usual weekdays. Table \@ref(tab:sleepTable) shows a frequency table created as we did above, after removing anyone with missing data for this question. We can already begin to summarize the dataset just by looking at the table; for example, we can see that most people report sleeping between 6 and 8 hours. To see this even more clearly, we can plot a *histogram* which shows the number of cases having each of the different values; see left panel of Figure \@ref(fig:sleepHist). We can also plot the relative frequencies, which we will often refer to as *densities* - see the right panel of Figure \@ref(fig:sleepHist).
```{r echo=FALSE}
# create summary table for relative frequency of different
# values of SleepHrsNight
sleepTable <- NHANES %>%
drop_na(SleepHrsNight) %>%
dplyr::select(SleepHrsNight) %>%
group_by(SleepHrsNight) %>%
summarize(AbsoluteFrequency = n()) %>%
mutate(
RelativeFrequency = AbsoluteFrequency / sum(AbsoluteFrequency),
Percentage = RelativeFrequency * 100
)
```
```{r sleepTable, echo=FALSE}
kable(sleepTable, caption='Frequency distribution for number of hours of sleep per night in the NHANES dataset')
```
```{r sleepHist,echo=FALSE,fig.cap="Left: Histogram showing the number (left) and proportion (right) of people reporting each possible value of the SleepHrsNight variable.",fig.width=8,fig.height=4,out.height='33%'}
SleepHrsNight_data_filtered <-
NHANES %>%
drop_na(SleepHrsNight) %>%
dplyr::select(SleepHrsNight)
# setup breaks for sleep variable
scalex <-
scale_x_continuous(
breaks = c(
min(NHANES$SleepHrsNight, na.rm = TRUE):max(NHANES$SleepHrsNight, na.rm = TRUE)
)
) # set the break points in the graph
p1 <- SleepHrsNight_data_filtered %>%
ggplot(aes(SleepHrsNight)) +
geom_histogram(binwidth = 1) +
scalex
p2 <- SleepHrsNight_data_filtered %>%
ggplot(aes(SleepHrsNight)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
scalex
plot_grid(p1,p2)
```
What if we want to know how many people report sleeping 5 hours or less? To find this, we can compute a *cumulative distribution*. To compute the cumulative frequency for some value j, we add up the frequencies for all of the values up to and including j:
$$
cumulative\ frequency_j = \sum_{i=1}^{j}{absolute\ frequency_i}
$$
```{r echo=FALSE}
# create cumulative frequency distribution of SleepHrsNight data
SleepHrsNight_cumulative <-
NHANES %>%
drop_na(SleepHrsNight) %>%
dplyr::select(SleepHrsNight) %>%
group_by(SleepHrsNight) %>%
summarize(AbsoluteFrequency = n()) %>%
mutate(CumulativeFrequency = cumsum(AbsoluteFrequency))
```
\newpage
```{r echo=FALSE}
kable(SleepHrsNight_cumulative, caption='Absolute and cumulative frequency distributions for SleepHrsNight variable')
```
Let's do this for our sleep variable, computing the absolute and cumulative frequency. In the left panel of Figure \@ref(fig:sleepAbsCumulRelFreq) we plot the data to see what these representations look like; the absolute frequency values are plotted in solid lines, and the cumulative frequencies are plotted in dashed lines We see that the cumulative frequency is *monotonically increasing* -- that is, it can only go up or stay constant, but it can never decrease. Again, we usually find the relative frequencies to be more useful than the absolute; those are plotted in the right panel of Figure \@ref(fig:sleepAbsCumulRelFreq). Importantly, the shape of the relative frequency plot is exactly the same as the absolute frequency plot -- only the size of the values has changed.
```{r sleepAbsCumulRelFreq,echo=FALSE,fig.cap="A plot of the relative (solid) and cumulative relative (dashed) values for frequency (left) and proportion (right) for the possible values of SleepHrsNight.",fig.width=8,fig.height=4,out.height='33%'}
p1 <- SleepHrsNight_cumulative %>%
ggplot(aes(SleepHrsNight, AbsoluteFrequency)) +
geom_line(size = 1.25) +
geom_line(
aes(SleepHrsNight, CumulativeFrequency),
linetype = "dashed",
size = 1.25
) +
scalex +
labs(y = "Frequency")
SleepHrsNight_cumulative <-
NHANES %>%
drop_na(SleepHrsNight) %>%
dplyr::select(SleepHrsNight) %>%
group_by(SleepHrsNight) %>%
summarize(AbsoluteFrequency = n()) %>%
mutate(
RelativeFrequency = AbsoluteFrequency / sum(AbsoluteFrequency),
CumulativeDensity = cumsum(RelativeFrequency)
)
p2 <- SleepHrsNight_cumulative %>%
ggplot(aes(SleepHrsNight, RelativeFrequency)) +
geom_line( size = 1.25) +
geom_line(
aes(SleepHrsNight, CumulativeDensity),
linetype = "dashed",
size = 1.25) +
scalex +
labs(
y = "Proportion"
)
plot_grid(p1,p2)
```
### Plotting histograms {#plotting-histograms}
```{r ageHist,echo=FALSE,fig.cap="A histogram of the Age (left) and Height (right) variables in NHANES.",fig.width=8,fig.height=4,out.height='33%'}
p1 <- NHANES %>%
ggplot(aes(Age)) +
geom_histogram(binwidth = 1) +
ggtitle('Age')
p2 <- NHANES %>%
select(Height) %>%
drop_na() %>%
ggplot(aes(Height)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
ggtitle('Height')
plot_grid(p1,p2)
```
The variables that we examined above were fairly simple, having only a few possible values. Now let's look at a more complex variable: Age. First let's plot the *Age* variable for all of the individuals in the NHANES dataset (see left panel of Figure \@ref(fig:ageHist)). What do you see there? First, you should notice that the number of individuals in each age group is declining over time. This makes sense because the population is being randomly sampled, and thus death over time leads to fewer people in the older age ranges. Second, you probably notice a large spike in the graph at age 80. What do you think that's about?
If were were to look up the information about the NHANES dataset, we would see the following definition for the *Age* variable: "Age in years at screening of study participant. Note: Subjects 80 years or older were recorded as 80." The reason for this is that the relatively small number of individuals with very high ages would make it potentially easier to identify the specific person in the dataset if you knew their exact age; researchers generally promise their participants to keep their identity confidential, and this is one of the things they can do to help protect their research subjects. This also highlights the fact that it's always important to know where one's data have come from and how they have been processed; otherwise we might interpret them improperly, thinking that 80-year-olds had been somehow overrepresented in the sample.
Let's look at another more complex variable in the NHANES dataset: Height. The histogram of height values is plotted in the right panel of Figure \@ref(fig:ageHist). The first thing you should notice about this distribution is that most of its density is centered around about 170 cm, but the distribution has a "tail" on the left; there are a small number of individuals with much smaller heights. What do you think is going on here?
You may have intuited that the small heights are coming from the children in the dataset. One way to examine this is to plot the histogram with separate colors for children and adults (left panel of Figure \@ref(fig:heightHistSep)). This shows that all of the very short heights were indeed coming from children in the sample. Let's create a new version of NHANES that only includes adults, and then plot the histogram just for them (right panel of Figure \@ref(fig:heightHistSep)). In that plot the distribution looks much more symmetric. As we will see later, this is a nice example of a *normal* (or *Gaussian*) distribution.
```{r heightHistSep,echo=FALSE,fig.cap="Histogram of heights for NHANES. A: values plotted separately for children (gray) and adults (black). B: values for adults only. C: Same as B, but with bin width = 0.1",fig.width=8,fig.height=8,out.height='50%'}
# first create a new variable in NHANES that tell us whether
# each individual is a child
NHANES <-
NHANES %>%
mutate(isChild = Age < 18)
NHANES_adult <-
NHANES %>%
drop_na(Age, Height) %>%
dplyr::filter(Age > 17)
p1 <- NHANES %>%
dplyr::select(Height, isChild) %>%
drop_na() %>%
ggplot(aes(Height, fill = isChild)) +
scale_fill_grey() +
geom_histogram(aes(y = ..density..), binwidth = 1) +
theme(legend.position = c(0,0.8)) +
ggtitle('A: All individuals')
p2 <- NHANES_adult %>%
ggplot(aes(Height)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
ggtitle('B: Adults only')
p3 <- NHANES_adult %>%
drop_na(Height) %>%
ggplot(aes(Height)) +
geom_histogram(aes(y = ..density..), binwidth = .1) +
ggtitle('C: Adults only (bin width=.1)')
plot_grid(p1,p2,p3,ncol=2)
```
### Histogram bins
In our earlier example with the sleep variable, the data were reported in whole numbers, and we simply counted the number of people who reported each possible value. However, if you look at a few values of the Height variable in NHANES (as shown in Table \@ref(tab:heightVals)), you will see that it was measured in centimeters down to the first decimal place.
```{r heightVals, echo=FALSE}
# take a slice of a few values from the full data frame
nhanes_slice <- NHANES_adult %>%
dplyr::select(Height) %>%
slice(45:50)
kable(nhanes_slice %>% mutate(Height=formatC(Height, digits=1, format='f')), caption='A few values of Height from the NHANES data frame.', digits=1)
```
Panel C of Figure \@ref(fig:heightHistSep) shows a histogram that counts the density of each possible value down the first decimal place. That histogram looks really jagged, which is because of the variability in specific decimal place values. For example, the value 173.2 occurs `r I(sum(NHANES_adult$Height==173.2,na.rm=TRUE))` times, while the value 173.3 only occurs `r I(sum(NHANES_adult$Height==173.3,na.rm=TRUE))` times. We probably don't think that there is really such a big difference between the prevalence of these two heights; more likely this is just due to random variability in our sample of people.
In general, when we create a histogram of data that are continuous or where there are many possible values, we will *bin* the values so that instead of counting and plotting the frequency of every specific value, we count and plot the frequency of values falling within specific ranges. That's why the plot looked less jagged above in Panel B of \@ref(fig:heightHistSep); in this panel we set the bin width to 1, which means that the histogram is computed by combining values within bins with a width of one; thus, the values 1.3, 1.5, and 1.6 would all count toward the frequency of the same bin, which would span from values equal to one up through values less than 2.
Note that once the bin size has been selected, then the number of bins is determined by the data:
$$
number\, of\, bins = \frac{range\, of\, scores}{bin\, width}
$$
There is no hard and fast rule for how to choose the optimal bin width. Occasionally it will be obvious (as when there are only a few possible values), but in many cases it would require trial and error. There are methods that try to find an optimal bin size automatically, such as the Freedman-Diaconis method that we will use in some later examples.
## Idealized representations of distributions
Datasets are like snowflakes, in that every one is different, but nonetheless there are patterns that one often sees in different types of data. This allows us to use idealized representations of the data to further summarize them. Let's take the adult height data plotted in \@ref(fig:heightHistSep), and plot them alongside a very different variable: pulse rate (heartbeats per minute), also measured in NHANES (see Figure \@ref(fig:NormalDistPlotsWithDist)).
```{r NormalDistPlotsWithDist, echo=FALSE,fig.cap='Histograms for height (left) and pulse (right) in the NHANES dataset, with the normal distribution overlaid for each dataset.',fig.width=8,fig.height=4,out.height='50%'}
# first update the summary to include the mean and standard deviation of each
# dataset
pulse_summary <-
NHANES_adult %>%
drop_na(Pulse) %>%
summarize(
nbins = nclass.FD(Pulse),
maxPulse = max(Pulse),
minPulse = min(Pulse),
meanPulse = mean(Pulse), #computing mean
sdPulse = sd(Pulse) #computing SD
)
height_summary <-
NHANES_adult %>%
drop_na(Height) %>%
summarize(
nbins = nclass.FD(Height),
maxHeight = max(Height),
minHeight = min(Height),
binwidth = (maxHeight - minHeight) / nbins,
meanHeight = mean(Height), #computing mean
sdHeight = sd(Height) #computing SD
)
# create data for plotting normal distribution curves data based on our computed means and SDs
heightDist <-
tibble(
x = seq(height_summary$minHeight, height_summary$maxHeight, 0.1)
) %>%
mutate(
y = dnorm(
x,
mean = height_summary$meanHeight,
sd = height_summary$sdHeight
)
)
pulseDist <-
tibble(
x = seq(pulse_summary$minPulse, pulse_summary$maxPulse, 0.1)
) %>%
mutate(
y = dnorm(
x,
mean = pulse_summary$meanPulse,
sd = pulse_summary$sdPulse)
)
#plot the normal distribution curves on top of histograms of the data
h1 <-
NHANES_adult %>%
drop_na(Height) %>%
ggplot(aes(Height)) +
geom_histogram(
aes(y = ..density..),
binwidth = height_summary$binwidth
) +
geom_line(
data = heightDist,
aes(x = x, y = y),
color = "blue",
size = 1.2
)
h2 <-
NHANES_adult %>%
drop_na(Pulse) %>%
ggplot(aes(Pulse)) +
geom_histogram(
aes(y = ..density..),
binwidth = 2
) +
geom_line(
data = pulseDist,
aes(x = x, y = y),
color = "blue",
size = 1.2
)
plot_grid(h1, h2)
```
While these plots certainly don't look exactly the same, both have the general characteristic of being relatively symmetric around a rounded peak in the middle. This shape is in fact one of the commonly observed shapes of distributions when we collect data, which we call the *normal* (or *Gaussian*) distribution. This distribution is defined in terms of two values (which we call *parameters* of the distribution): the location of the center peak (which we call the *mean*) and the width of the distribution (which is described in terms of a parameter called the *standard deviation*). Figure \@ref(fig:NormalDistPlotsWithDist) shows the appropriate normal distribution plotted on top of each of the histrograms.You can see that although the curves don't fit the data exactly, they do a pretty good job of characterizing the distribution -- with just two numbers!
As we will see later when we discuss the central limit theorem, there is a deep mathematical reason why many variables in the world exhibit the form of a normal distribution.
### Skewness
The examples in Figure \@ref(fig:NormalDistPlotsWithDist) followed the normal distribution fairly well, but in many cases the data will deviate in a systematic way from the normal distribution. One way in which the data can deviate is when they are asymmetric, such that one tail of the distribution is more dense than the other. We refer to this as "skewness". Skewness commonly occurs when the measurement is constrained to be non-negative, such as when we are counting things or measuring elapsed times (and thus the variable can't take on negative values).
An example of relatively mild skewness can be seen in the average waiting times at the airport security lines at San Francisco International Airport, plotted in the left panel of Figure \@ref(fig:SFOWaitTimes). You can see that while most wait times are less than 20 minutes, there are a number of cases where they are much longer, over 60 minutes! This is an example of a "right-skewed" distribution, where the right tail is longer than the left; these are common when looking at counts or measured times, which can't be less than zero. It's less common to see "left-skewed" distributions, but they can occur, for example when looking at fractional values that can't take a value greater than one.
```{r SFOWaitTimes,echo=FALSE,fig.cap="Examples of right-skewed and long-tailed distributions. Left: Average wait times for security at SFO Terminal A (Jan-Oct 2017), obtained from https://awt.cbp.gov/ . Right: A histogram of the number of Facebook friends amongst 3,663 individuals, obtained from the Stanford Large Network Database. The person with the maximum number of friends is indicated by the diamond.",fig.width=8,fig.height=4,out.height='50%', message=FALSE,warning=FALSE}
waittimes <-
read_csv("data/04/sfo_wait_times_2017.csv")
p1 <- waittimes %>%
ggplot(aes(waittime)) +
geom_histogram(binwidth = 1)
fbdata <-
read.table("data/04/facebook_combined.txt")
# count how many friends each individual has
friends_table <-
fbdata %>%
group_by(V1) %>%
summarize(nfriends = n())
p2 <- friends_table %>%
ggplot(aes(nfriends)) +
geom_histogram(aes(y = ..density..), binwidth = 2) +
xlab("Number of friends") +
annotate(
"point",
x = max(friends_table$nfriends),
y = 0, shape=18,
size = 4
)
plot_grid(p1,p2)
```
### Long-tailed distributions
Historically, statistics has focused heavily on data that are normally distributed, but there are many data types that look nothing like the normal distribution. In particular, many real-world distributions are "long-tailed", meaning that the right tail extends far beyond the most typical members of the distribution; that is, they are extremely skewed. One of the most interesting types of data where long-tailed distributions occur arises from the analysis of social networks. For an example, let's look at the Facebook friend data from the [Stanford Large Network Database](https://snap.stanford.edu/data/egonets-Facebook.html) and plot the histogram of number of friends across the 3,663 people in the database (see right panel of Figure \@ref(fig:SFOWaitTimes)). As we can see, this distribution has a very long right tail -- the average person has `r I(mean(friends_table$nfriends))` friends, while the person with the most friends (denoted by the blue dot) has `r I(max(friends_table$nfriends))`!
Long-tailed distributions are increasingly being recognized in the real world. In particular, many features of complex systems are characterized by these distributions, from the frequency of words in text, to the number of flights in and out of different airports, to the connectivity of brain networks. There are a number of different ways that long-tailed distributions can come about, but a common one occurs in cases of the so-called "Matthew effect" from the Christian Bible:
> For to every one who has will more be given, and he will have abundance; but from him who has not, even what he has will be taken away. — Matthew 25:29, Revised Standard Version
This is often paraphrased as "the rich get richer". In these situations, advantages compound, such that those with more friends have access to even more new friends, and those with more money have the ability to do things that increase their riches even more.
As the course progresses we will see several examples of long-tailed distributions, and we should keep in mind that many of the tools in statistics can fail when faced with long-tailed data. As Nassim Nicholas Taleb pointed out in his book "The Black Swan", such long-tailed distributions played a critical role in the 2008 financial crisis, because many of the financial models used by traders assumed that financial systems would follow the normal distribution, which they clearly did not.
## Learning objectives
Having read this chapter, you should be able to:
* Compute absolute, relative, and cumulative frequency distributions for a given dataset
* Generate a graphical representation of a frequency distribution
* Describe the difference between a normal and a long-tailed distribution, and describe the situations that commonly give rise to each
## Suggested readings
- *The Black Swan: The Impact of the Highly Improbable*, by Nassim Nicholas Taleb