This course provides practical materials and background information on the analysis of multivariate data.
+
+
+
+
+
+
+Learning Objectives
+
+
+
+
After this course participants should be able to:
+
+
perform Principal Component Analysis (PCA)
+
perform K-means clustering
+
perform basic hierarchical clustering
+
+
+
+
+
Target Audience
+
Participants are generally those who need to analyse data that contain many different variables. They can be graduate or postgraduate students, or researchers who find themselves having to do this.
+
+
+
Prerequisites
+
Participants should have some knowledge regarding statistics. Attending the Core statistics course is recommended.
+
More importantly, participants should have a working knowledge of either R or Python, since the topics that are covered require data wrangling and visualisation.
+
+
+
+
Exercises
+
Exercises in these materials are labelled according to their level of difficulty:
+
+
+
+
+
+
+
+
Level
+
Description
+
+
+
+
+
+
Exercises in level 1 are simpler and designed to get you familiar with the concepts and syntax covered in the course.
+
+
+
+
Exercises in level 2 combine different concepts together and apply it to a given task.
+
+
+
+
Exercises in level 3 require going beyond the concepts and syntax introduced to solve new problems.
+
+
+
+
+
+
+
Authors
+
+
About the authors:
+
+
Martin van Rongen
+Affiliation: Bioinformatics Training Facility, University of Cambridge
+Roles: writing - review & editing; conceptualisation; coding
This is a statistical technique for reducing the dimensionality of a data set. The technique aims to find a new set of variables for describing the data. These new variables are made from a weighted sum of the old variables. The weighting is chosen so that the new variables can be ranked in terms of importance in that the first new variable is chosen to account for as much variation in the data as possible. Then the second new variable is chosen to account for as much of the remaining variation in the data as possible, and so on until there are as many new variables as old variables.
+
+
+
4.2 K-means clustering
+
This is a method for grouping observations into clusters. It groups data based on similarity and is an often-used unsupervised machine learning algorithm.
+
It groups the data into a fixed number of clusters (\(k\)) and the ultimate aim is to discover patterns in the data.
+
+
+
4.3 Hierarchical clustering
+
A clustering method that tries to create a hierarchy across data, often displayed as a dendogram. This visualises the way the clusters are arranged.
This is a method for grouping observations into clusters. It groups data based on similarity and is an often-used unsupervised machine learning algorithm.
+
It groups the data into a fixed number of clusters (\(k\)) and the ultimate aim is to discover patterns in the data.
+
K-means clustering is an iterative process. It follows the following steps:
+
+
Select the number of clusters to identify (e.g. K = 3)
+
Create centroids
+
Place centroids randomly in your data
+
Assign each data point to the closest centroid
+
Calculate the centroid of each new cluster
+
Repeat steps 4-5 until the clusters do not change
+
+
+
+
6.3 Data
+
Once again we’ll be using the data from data/penguins.csv. These data are from the palmerpenguins package (for more information, see the GitHub page).
Rows: 344 Columns: 8
+── Column specification ────────────────────────────────────────────────────────
+Delimiter: ","
+chr (3): species, island, sex
+dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
+
+ℹ Use `spec()` to retrieve the full column specification for this data.
+ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
head(penguins)
+
+
# A tibble: 6 × 8
+ species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
+ <chr> <chr> <dbl> <dbl> <dbl> <dbl>
+1 Adelie Torgersen 39.1 18.7 181 3750
+2 Adelie Torgersen 39.5 17.4 186 3800
+3 Adelie Torgersen 40.3 18 195 3250
+4 Adelie Torgersen NA NA NA NA
+5 Adelie Torgersen 36.7 19.3 193 3450
+6 Adelie Torgersen 39.3 20.6 190 3650
+# ℹ 2 more variables: sex <chr>, year <dbl>
+
+
+
+
+
+
# load the data
+penguins_py = pd.read_csv("data/penguins.csv")
+
+# inspect the data
+penguins_py.head()
+
+
species island bill_length_mm ... body_mass_g sex year
+0 Adelie Torgersen 39.1 ... 3750.0 male 2007
+1 Adelie Torgersen 39.5 ... 3800.0 female 2007
+2 Adelie Torgersen 40.3 ... 3250.0 female 2007
+3 Adelie Torgersen NaN ... NaN NaN 2007
+4 Adelie Torgersen 36.7 ... 3450.0 female 2007
+
+[5 rows x 8 columns]
+
+
+
+
+
+
There are a variety of variables in this data set. The following example focuses on the two variables bill_length_mm and bill_depth_mm across the various species recorded.,
We can already see that the data appear to cluster quite closely by species. A great example to illustrate K-means clustering (you’d almost think I chose the example on purpose)!
+
+
+
6.5 Perform K-means clustering
+
To do the clustering, we’ll need to do a bit of data wrangling, since we can only do the clustering on numerical data.
+
As we did with the PCA, we also need to scale the data. Although it is not required in this case, because both variables have the same unit (millimetres), it is good practice. In other scenarios it could be that there are different units that are being compared, which could affect the clustering.
The initial centroids get randomly placed in the data. This, combined with the iterative nature of the process, means that the values that you will see are going to be slightly different from the values here. That’s normal!
Next, we want to visualise to which data points belong to which cluster. We can do that as follows:
+
+
kclust %>%# take clustering data
+augment(penguins_scaled) %>%# combine with original data
+ggplot(aes(x = bill_depth_scaled, # plot the scaled data
+y = bill_length_scaled)) +
+geom_point(aes(colour = .cluster)) +# colour by classification
+geom_point(data = tidy_clust,
+size =7, shape ='x') # add the cluster centers
# remove missing values
+penguins_py = penguins_py.dropna()
+# add an ID column
+penguins_py['id'] =range(1, len(penguins_py) +1)
+
+# add an ID column to the scaled data
+# so we can match the observations
+penguins_scaled_py['id'] =range(1, len(penguins_scaled_py) +1)
+
+# merge the data by ID
+penguins_augment_py = (pd.merge(penguins_py.dropna(),
+ penguins_scaled_py,
+ on ='id'))
+
+# add the cluster designation
+penguins_augment_py['cluster'] = kmeans.fit_predict(penguins_scaled_py)
+
+# and convert it into a factor
+penguins_augment_py['cluster'] = (penguins_augment_py['cluster']
+ .astype('category'))
In the example we set the number of clusters to 3. This made sense, because the data already visually separated in roughly three groups - one for each species.
+
However, it might be that the cluster number to choose is a lot less obvious. In that case it would be helpful to explore clustering your data into a range of clusters.
+
In short, we determine which values of \(k\) we want to explore and then loop through these values, repeating the workflow we looked at previously.
Reiterating over a range of \(k\) values is reasonably straightforward using tidyverse. Although we could write our own function to loop through these \(k\) values, tidyverse has a series of map() functions that can do this for you. More information on them here.
+
In short, the map() function spits out a list which contains the output. When we do this on our data, we can create a table that contains lists with all of the information that we need.
+
Here we calculate the following:
+
+
the kclust column contains a list with all the kmeans() output, for each value of \(k\)
+
the tidied column contains the information on a per-cluster basis
+
the glanced column contains single-row summary for each \(k\) - we’ll use the tot.withinss values a little bit later on
+
the augmented column contains the original data, augmented with the classification that was calculated by the kmeans() function
+
+
+
kclusts <-
+# check for k = 1 to 6
+tibble(k =1:6) %>%
+mutate(
+# perform clustering for each k
+kclust =map(k, ~kmeans(penguins_scaled, .x)),
+# summary at per-cluster level
+tidied =map(kclust, tidy),
+# get single-row summary
+glanced =map(kclust, glance),
+# add classification to data set
+augmented =map(kclust, augment, penguins_scaled))
+
+kclusts
Lists can sometimes be a bit tricky to get your head around, so it’s worthwhile exploring the output. RStudio is particularly useful for this, since you can just left-click on the object in your Environment panel and look.
+
The way I see lists in this context is as containers. We have one huge table kclusts that contains all of the information that we need. Each ‘cell’ in this table has a container with the relevant data. The kclust column is a list with kmeans objects (the output of our kmeans() for each of the \(k\) values), whereas the other columns are lists of tibbles (because the tidy(), glance() and augment() functions output a tibble with the information for each value of \(k\)).
+
For us to use the data in the lists, it makes sense to extract them on a column-by-column basis. We’re ignoring the kclust column, because we don’t need the actual kmeans() output any more.
+
To extract the data from the lists we use the unnest() function.
Next, we can visualise some of the data. We’ll start by plotting the scaled data and colouring the data points based on the final cluster it has been assigned to by the kmeans() function.
+
The (augmented) data are in assignments. Have a look at the structure of the table.
+
We facet the data by \(k\), so we get a single panel for each value of \(k\).
+
We also add the calculated cluster centres, which are stored in clusters.
+
+
ggplot(assignments,
+aes(x = bill_depth_scaled, # plot data
+y = bill_length_scaled)) +
+geom_point(aes(color = .cluster), # colour by cluster
+alpha =0.8) +# add transparency
+facet_wrap(vars(k)) +# facet for each k
+geom_point(data = clusters, # add centers
+size =7,
+shape ="x")
+
+
+
+
+
+
+
+
+
+
+
+
+
Looking at this plot shows what we already knew (if only things were this easy all the time!): three clusters is a pretty good choice for these data. Remember that you’re looking for clusters that are distinct, i.e. are separated from one another. For example, using k = 4 gives you four nice groups, but two of them are directly adjacent, suggesting that they would do equally well in a single cluster.
+
+
6.6.1 Elbow plot
+
Visualising the data like this can be helpful but at the same time it can also be a bit subjective (hoorah!). To find another subjective way of interpreting these clusters (remember, statistics isn’t this YES/NO magic mushroom and we should be comfortable wandering around in the murky grey areas of statistics by now), we can plot the total within-cluster variation for each value of k.
+
Intuitively, if you keep adding clusters then the total amount of variation that can be explained by these clusters will increase. The most extreme case would be where each data point is its own cluster and we can then explain all of the variation in the data.
+
Of course that is not a very sensible approach - hence us balancing the number of clusters against how much variation they can capture.
+
A practical approach to this is creating an “elbow” plot where the cumulative amount of variation explained is plotted against the number of clusters.
The output of the kmeans() function includes tot.withinss - this is the total within-cluster sum of squares.
+
+
ggplot(clusterings,
+aes(x = k, # for each k plot...
+y = tot.withinss)) +# total within variance
+geom_line() +
+geom_point() +
+scale_x_continuous(
+breaks =seq(1, 6, 1)) # set the x-axis breaks
+
+
+
+
+
+
+
+
+
+
+
+
+
We can see that the total within-cluster sum of squares decreases as the number of clusters increases. We can also see that from k = 3 onwards the slope of the line becomes much shallower. This “elbow” or bending point is a useful gauge to find the optimum number of clusters.
+
From the exploration we can see that three clusters are optimal in this scenario.
+
+
+
+
6.7 Exercises
+
+
6.7.1 Finch beaks
+
+
+
+
+
+
+Exercise
+
+
+
+
+
+
+
+
Level:
+
For this exercise we’ll be using the data from data/finch_beaks.csv.
+
The finches data has been adapted from the accompanying website to 40 years of evolution. Darwin’s finches on Daphne Major Island by Peter R. Grant and Rosemary B. Grant.
+
Practice running through the clustering workflow using the finches dataset. Try doing the following:
kclust %>%# take clustering data
+augment(finches_scaled) %>%# combine with original data
+ggplot(aes(x = bdepth_scaled, # plot the original data
+y = blength_scaled)) +
+geom_point(aes(colour = .cluster)) +# colour by classification
+geom_point(data = tidy_clust,
+size =7, shape ='x') # add the cluster centers
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Optimise clusters
+
It looks like two clusters is a reasonable choice. But let’s explore this a bit more.
kclusts <-
+# check for k = 1 to 6
+tibble(k =1:6) %>%
+mutate(
+# perform clustering for each k
+kclust =map(k, ~kmeans(finches_scaled, .x)),
+# summary at per-cluster level
+tidied =map(kclust, tidy),
+# get single-row summary
+glanced =map(kclust, glance),
+# add classification to data set
+augmented =map(kclust, augment, finches_scaled)
+ )
ggplot(assignments,
+aes(x = bdepth_scaled, # plot data
+y = blength_scaled)) +
+geom_point(aes(color = .cluster), # colour by cluster
+alpha =0.8) +# add transparency
+facet_wrap(~ k) +# facet for each k
+geom_point(data = clusters, # add centers
+size =7,
+shape ='x')
+
+
+
+
+
+
+
Create an elbow plot to have a closer look.
+
+
ggplot(clusterings,
+aes(x = k, # for each k plot...
+y = tot.withinss)) +# total within variance
+geom_line() +
+geom_point() +
+scale_x_continuous(
+breaks =seq(1, 6, 1)) # set the x-axis breaks
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Conclusions
+
Our initial clustering was done using two clusters, basically capturing the two different finch species.
+
Redoing the analysis with different numbers of clusters seems to reasonably support that decision. The elbow plot suggests that k = 3 would not be such a terrible idea either.
+
In the example above we used data that were collected at two different time points: 1975 and 2012.
+
In the analysis we’ve kept these data together. However, the original premises of these data was to see if there is any indication of evolution going on in these species of finches. Think about how you would approach this question!
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
6.8 Summary
+
+
+
+
+
+
+Key points
+
+
+
+
+
k-means clustering partitions data into clusters
+
the k defines the number of clusters
+
cluster centers or centroids get assigned randomly
+
each data point gets assigned to the closest centroid
+
the centroid of the new clusters gets calculated and the process of assignment and recalculation repeats until the cluster do no longer change
+
the optimal number of clusters can be determined with an ‘elbow’ plot
This is a statistical technique for reducing the dimensionality of a data set. The technique aims to find a new set of variables for describing the data. These new variables are made from a weighted sum of the old variables. The weighting is chosen so that the new variables can be ranked in terms of importance in that the first new variable is chosen to account for as much variation in the data as possible. Then the second new variable is chosen to account for as much of the remaining variation in the data as possible, and so on until there are as many new variables as old variables.
+
+
+
5.3 Data
+
The example in this section uses the following data set:
+
data/finches_hybridisation.csv
+
These data come from an analysis of gene flow across two finch species (Grant and Grant 2020). They are slightly adapted here for illustrative purposes (the changes are documented in materials/data_preparation.R if you are interested).
+
Long story short: these data are part of a hybridisation study on finches at the Galapagos Islands. Here, we’ve only included the non-hybrid observations, but feel free to rerun the analysis with all the hybrids!
+
From many years of studies, going back to Darwin, we now know that the beak shape of the finches are important in their chances for survival. Changes in beak shape have led to new species and this study explored how movement of genes from a rare immigrant species (Geospiza fuliginosa) occurred through two other species (G. fortis and G. scandens). The researchers recorded various morphological traits.
We have 7 columns. We’re not going to recreate the published analysis exactly, but for the purpose of this section we will be looking if differences in the measured morphological traits can be attributed to specific categories.
We have 6 continuous response variables, which we’re not going to visualise independently! However, you could look at some of the potential relationships within the data.
+
For example, looking at body weight (weight) against beak length (blength):
(ggplot(finches_hybrid_py,
+ aes(x ="category", y ="weight")) +
+ geom_boxplot())
+
+
+
+
+
+
+
+
+
+
The number of combinations are rather large, given the number of variables we have. Hence it’s a good idea to see if we can “condense” some of variables into new ones.
+
What we’re doing with a PCA is trying to capture as much variance that exists in the data using a Principal Component (PC). The PC therefore explains some of the variance for each of the individual variables.
+
I like to compare the PCs to a smoothie: a smoothie might consist of 4 parts apple, 3 parts strawberry, 2 parts blueberry and 1 part grape. Along the lines of that delicious metaphor, one of our Principal components may consist of 4 parts blength, 3 parts weight, 2 parts bdepth and 1 part wing. We don’t know this yet, so let’s go find out.
+
+
+
5.5 Performing the PCA
+
To perform a PCA, we need to keep a few things in mind:
+
+
We can only calculate principal components for numeric data
+
The numeric data need to be on the same scale
+
+
This last point of scaling is very important. Measurements can take place at different scales. You shouldn’t compare milimetres and kilogrammes directly, for example. Or seconds and height. That simply does not make sense.
+
This issue is even more prevalent when comparing gene expression data, for example. Genes can be active at varying levels, but some genes only need (an indeed do) change a tiny amount to elicit an effect, whereas other genes might have to change several fold-changes before something happens.
+
To compensate for this, we bring all of our data onto the same scale.
In R we can scale our data with the scale() function. We perform the PCA using the prcomp() function. Here we store the output into an object called pca_fit, because we’ll be working with the output later on.
+
+
pca_fit <- finches_hybrid %>%
+# keep only the numeric columns
+select(where(is.numeric)) %>%
+# scale the data
+scale() %>%
+# perform the PCA
+prcomp()
+
+
This is a bit of daunting looking output, but not to worry. We’ll unpack things along the way!
In Python we can scale our data with the StandardScaler() function from sklearn.preprocessing. We can only scale numerical data, so we need to get those first.
+
+
from sklearn.preprocessing import StandardScaler
+
+# select the numeric values
+X = finches_hybrid_py.select_dtypes(include = ['float64', 'int64'])
+
+# define the scaler
+std_scaler = StandardScaler()
+
+# scale the numeric values
+finches_scaled = std_scaler.fit_transform(X)
+
+
Now that we have the scaled values, we can perform the PCA. We do this using the PCA() function from sklearn.decomposition.
+
We need to tell it how many principal components we want it to return. We set it to 6 here, which is the number of numerical variables.
+
+
from sklearn.decomposition import PCA
+
+# define the number of principal components
+n_components =6
+
+# set the number of principal components
+pca = PCA(n_components = n_components)
+
+# perform the PCA
+principal_components = pca.fit_transform(finches_scaled)
+
+# create a data frame containing the information
+# changing the column names based on the PC
+pca_fit_py = pd.DataFrame(data = principal_components, columns=[f'PC{i}'for i inrange(1, n_components +1)])
+
+
+
+
+
+
5.5.1 Visualising the principal components
+
We can figure out how much each principal component is contributing to the amount of variance that is being explained. This is called a screeplot.
This means that PC1 is able to explain around 65% of the variance in the data. PC2 is able to explain around 21% of the variance in the data, and so forth.
+
+
+
5.5.2 Loadings
+
Let’s think back to our smoothy metaphor. Remember how the smoothy was made up of various fruits - just like our PCs are made up of parts of our original variables.
+
Let’s, for the sake of illustrating this, assume the following for PC1:
+
+
+
+
parts
+
variable
+
+
+
+
+
4
+
blength
+
+
+
1
+
tarsus
+
+
+
+
Each PC has something called an eigenvector, which in simplest terms is a line with a certain direction and length.
+
If we want to calculate the length of the eigenvector for PC1, we can employ Pythagoras (well, not directly, just his legacy). This gives:
+
\(eigenvector \, PC1 = \sqrt{4^2 + 1^2} = 4.12\)
+
The loading scores for PC1 are the “parts” scaled for this length, i.e.:
+
+
+
+
scaled parts
+
variable
+
+
+
+
+
4 / 4.12 = 0.97
+
blength
+
+
+
1 / 4.12 = 0.24
+
tarsus
+
+
+
+
What we can do with these values is plot the loadings for each of the original variables.
We’ll have to reformat this a little, so that we have the values in separate columns. First, we rename some of the columns / values to make things more consistent and clearer:
We can then plot this. This is a little bit involved, unfortunately. And not something I’d recommend remembering the code by heart, but we’re doing the following:
+
+
Take the PCA output and add the original data
+
Plot this
+
Create a line from the origin (x = 0, y = 0) for each loading
After all that we end up with a rather unclear plot. The 6 variables that contribute to each principal component have very overlapping contributions in PC1. As such, it’s difficult to see which variable contributes what!
+
The reason why I’m still showing it here is that this kind of plot is very often used in PCA, so at least you can recognise it. If the variables have well-separated contributions across the two principal components, then it can be quite informative.
+
A better way would be to plot the individual contributions to each principal component as an ordered bar plot. This does require some rejigging of the data again (sorry!).
First, we convert our loadings data back to a “long” format. We also add an extra column, direction, which indicates if the contribution to the principal component is positive or negative.
+
+
pca_loadings <- pca_loadings %>%
+pivot_longer(cols =-terms,
+names_to ="component",
+values_to ="value") %>%
+# add a column that indicates direction
+# (positive or negative)
+mutate(direction =ifelse(value <0, "positive", "negative"))
+
+# have a look at the data
+head(pca_loadings)
We can now visualise this. Here we are using some additional functionality offered by the tidytext library. Make sure to install it, if needed. Then load it.
+
+
# we need this library
+library(tidytext)
+
+pca_loadings %>%
+mutate(terms = tidytext::reorder_within(terms,
+abs(value),
+ component)) %>%
+ggplot(aes(x =abs(value), y = terms, fill = direction)) +
+geom_col() +
+facet_wrap(vars(component), scales ="free_y") +
+ tidytext::scale_y_reordered()
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Important
+
+
+
+
It is important to keep the amount of variance explained by each principal component in mind. For example, PC3 only explains around 4.7 of the variance. So although several variables contribute substantially to PC3, the contribution of PC3 itself remains small.
+
+
+
+
+
+
5.6 Exercises
+
+
5.6.1 Penguins
+
+
+
+
+
+
+Exercise
+
+
+
+
+
+
+
+
Level:
+
For this exercise we’ll be using the data from data/penguins.csv. These data are from the palmerpenguins package (for more information, see the GitHub page).
+
I would like you to do the following:
+
+
Load and visualise the data.
+
Create a screeplot and see how many PCs would be best.
+
Calculate the loadings for PC1 and PC2 and visualise them.
Rows: 344 Columns: 8
+── Column specification ────────────────────────────────────────────────────────
+Delimiter: ","
+chr (3): species, island, sex
+dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
+
+ℹ Use `spec()` to retrieve the full column specification for this data.
+ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
+
+
head(penguins)
+
+
# A tibble: 6 × 8
+ species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
+ <chr> <chr> <dbl> <dbl> <dbl> <dbl>
+1 Adelie Torgersen 39.1 18.7 181 3750
+2 Adelie Torgersen 39.5 17.4 186 3800
+3 Adelie Torgersen 40.3 18 195 3250
+4 Adelie Torgersen NA NA NA NA
+5 Adelie Torgersen 36.7 19.3 193 3450
+6 Adelie Torgersen 39.3 20.6 190 3650
+# ℹ 2 more variables: sex <chr>, year <dbl>
+
+
+
We can see that there are different kinds of variables, both categorical and numerical. Also, there appear to be some missing data in the data set, so we probably have to deal with that.
+
Lastly, we should be careful with the year column: it is recognised as a numerical column (because it contains numbers), but we should view it as a factor, since the years have an ordered, categorical meaning.
+
To get a better sense of our data we could plot all the numerical variables against each other, to see if there is any possible correlation between them. Doing that one-by-one in ggplot is tedious, so I’m using the pairs() function here. Pretty? No. Effective? Yes.
So we see that there is some possible grouping going on and possibly some linear relationships, too. However, there are multiple groups and closely related measurements, so it is not very surprising that there are possible relationships within the data.
+
+
+
Perform the PCA
+
First, we need to do a little bit of data tidying. We convert year to a factor and deal with the missing values. We’re not dealing with them in a particularly subtle way here, removing all the rows that contain at least one missing value.
+
In your own research you may want to be more careful and only remove missing values from variables that you using in the PCA (here we include everything).
It looks like using the first two principal components is probably capturing the majority of the variance in the data. Combined they account for 90% of the variance.
+
+
+
Loadings
+
Next, we get the loadings: how much is each original variable contributing to the individual principal components?
We can also visualise these contributions using a bar plot. We need the data in a slightly different format before we can do this:
+
+
pca_loadings <- pca_loadings %>%
+pivot_longer(cols =-terms,
+names_to ="component",
+values_to ="value") %>%
+# add a column that indicates direction
+# (positive or negative)
+mutate(direction =ifelse(value <0, "positive", "negative"))
+
+# have a look at the data
+head(pca_loadings)
# we need this library
+library(tidytext)
+
+pca_loadings %>%
+mutate(terms = tidytext::reorder_within(terms,
+abs(value),
+ component)) %>%
+ggplot(aes(x =abs(value), y = terms, fill = direction)) +
+geom_col() +
+facet_wrap(vars(component), scales ="free_y") +
+ tidytext::scale_y_reordered()
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Conclusions
+
+
Load the data.
+
Create a screeplot and see how many PCs would be best.
+
Calculate the loadings for PC1 and PC2 and visualise them.
+
Any conclusions you might draw from the analysis.
+
+
Taken together, we can conclude/comment on a few things:
+
+
Endlessly looking at pairwise comparisons between continuous variables probably becomes a bit tedious. An alternative would be to calculate correlations between the variables to get some insight into your data. In the end it depends on how many variables you have and how much you (want to) know about them.
+
In this case, I’d say that the first two principal components capture most of the variance in the data.
+
The largest contributing variables mostly differ between the first two principal components. The variables that make up PC1 are very similar in terms of contribution, whereas two variables more or less make up PC2 entirely.
+
+
The variables flipper_length_mm and body_mass_g contribute pretty much only to PC1 (they are horizontal in the loadings plot), whereas bill_length_mm is contributing a reasonable amount to both PC1 and PC2.
+
From the data itself we can see that there is clear separation between the three species. The Gentoo penguins separate further once again from the other two species.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
5.7 Summary
+
+
+
+
+
+
+Key points
+
+
+
+
+
PCA allows you to reduce a large number of variables into fewer principal components.
+
Each PC is made up of a combination of the original variables and captures as much of the variance within the data as possible.
+
The loadings tell you how much each original variable contributes to each PC.
+
A screeplot is a graphical representation of the amount of variance explained by each PC.
+
+
+
+
+
+
+
+Grant, Peter, and B. Grant. 2020. “Triad Hybridization via a Conduit Species.”Proceedings of the National Academy of Sciences 117 (March): 202000388. https://doi.org/10.1073/pnas.2000388117.
+
+Grant, Peter, and B. Grant. 2020. “Triad Hybridization via a
+Conduit Species.”Proceedings of the National Academy of
+Sciences 117 (March): 202000388. https://doi.org/10.1073/pnas.2000388117.
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/search.json b/search.json
new file mode 100644
index 0000000..7d03b96
--- /dev/null
+++ b/search.json
@@ -0,0 +1,288 @@
+[
+ {
+ "objectID": "index.html",
+ "href": "index.html",
+ "title": "Multivariate analysis",
+ "section": "",
+ "text": "Overview\n** UNDER DEVELOPMENT **\nThis course provides practical materials and background information on the analysis of multivariate data.",
+ "crumbs": [
+ "Welcome",
+ "1Overview"
+ ]
+ },
+ {
+ "objectID": "index.html#overview",
+ "href": "index.html#overview",
+ "title": "Multivariate analysis",
+ "section": "",
+ "text": "Learning Objectives\n\n\n\nAfter this course participants should be able to:\n\nperform Principal Component Analysis (PCA)\nperform K-means clustering\nperform basic hierarchical clustering\n\n\n\n\nTarget Audience\nParticipants are generally those who need to analyse data that contain many different variables. They can be graduate or postgraduate students, or researchers who find themselves having to do this.\n\n\nPrerequisites\nParticipants should have some knowledge regarding statistics. Attending the Core statistics course is recommended.\nMore importantly, participants should have a working knowledge of either R or Python, since the topics that are covered require data wrangling and visualisation.\n\n\n\nExercises\nExercises in these materials are labelled according to their level of difficulty:\n\n\n\n\n\n\n\nLevel\nDescription\n\n\n\n\n \nExercises in level 1 are simpler and designed to get you familiar with the concepts and syntax covered in the course.\n\n\n \nExercises in level 2 combine different concepts together and apply it to a given task.\n\n\n \nExercises in level 3 require going beyond the concepts and syntax introduced to solve new problems.",
+ "crumbs": [
+ "Welcome",
+ "1Overview"
+ ]
+ },
+ {
+ "objectID": "index.html#authors",
+ "href": "index.html#authors",
+ "title": "Multivariate analysis",
+ "section": "Authors",
+ "text": "Authors\n\nAbout the authors:\n\nMartin van Rongen \nAffiliation: Bioinformatics Training Facility, University of Cambridge\nRoles: writing - review & editing; conceptualisation; coding\n\n\n\n\n\n\n“Mastodon.” n.d. https://genomic.social/@BioInfoCambs.",
+ "crumbs": [
+ "Welcome",
+ "1Overview"
+ ]
+ },
+ {
+ "objectID": "setup.html",
+ "href": "setup.html",
+ "title": "Data & Setup",
+ "section": "",
+ "text": "Data\nThe data used in these materials is provided as a zip file. Download and unzip the folder to your Desktop to follow along with the materials.\nDownload",
+ "crumbs": [
+ "Welcome",
+ "2Data & Setup"
+ ]
+ },
+ {
+ "objectID": "setup.html#software",
+ "href": "setup.html#software",
+ "title": "Data & Setup",
+ "section": "Software",
+ "text": "Software\n\nR and RStudio\n\n\nWindows\nDownload and install all these using default options:\n\nR\nRTools\nRStudio\n\n\n\nMac OS\nDownload and install all these using default options:\n\nR\nRStudio\n\n\n\nLinux\n\nGo to the R installation folder and look at the instructions for your distribution.\nDownload the RStudio installer for your distribution and install it using your package manager.\n\n\n\n\n\nPython\nFor this course we’ll be using Visual Studio Code. This provides support for various programming languages (including Python and R). It works on Windows, MacOS and Linux. It’s also open-source and free.\nPlease refer to the installation instructions and make sure that you verify that Python code will run.\nA brief sequence of events:\n\nInstall Visual Studio Code\nInstall the VS Code Python extension\nInstall a Python interpreter\n\nWindows: install from Python.org or use the Microsoft Store\nMacOS: install the Homebrew package manager, then use this to install Python\nLinux: comes with Python 3, but needs pip to install additional packages\n\n\n\n\n\n\n\n\n\n\n“Mastodon.” n.d. https://genomic.social/@BioInfoCambs.",
+ "crumbs": [
+ "Welcome",
+ "2Data & Setup"
+ ]
+ },
+ {
+ "objectID": "references.html",
+ "href": "references.html",
+ "title": "3 References",
+ "section": "",
+ "text": "Grant, Peter, and B. Grant. 2020. “Triad Hybridization via a\nConduit Species.” Proceedings of the National Academy of\nSciences 117 (March): 202000388. https://doi.org/10.1073/pnas.2000388117.",
+ "crumbs": [
+ "Welcome",
+ "3References"
+ ]
+ },
+ {
+ "objectID": "materials/mva-intro.html",
+ "href": "materials/mva-intro.html",
+ "title": "4 Background",
+ "section": "",
+ "text": "4.1 Principal component analysis\nThis is a statistical technique for reducing the dimensionality of a data set. The technique aims to find a new set of variables for describing the data. These new variables are made from a weighted sum of the old variables. The weighting is chosen so that the new variables can be ranked in terms of importance in that the first new variable is chosen to account for as much variation in the data as possible. Then the second new variable is chosen to account for as much of the remaining variation in the data as possible, and so on until there are as many new variables as old variables.",
+ "crumbs": [
+ "Multivariate analysis",
+ "4Background"
+ ]
+ },
+ {
+ "objectID": "materials/mva-intro.html#k-means-clustering",
+ "href": "materials/mva-intro.html#k-means-clustering",
+ "title": "4 Background",
+ "section": "4.2 K-means clustering",
+ "text": "4.2 K-means clustering\nThis is a method for grouping observations into clusters. It groups data based on similarity and is an often-used unsupervised machine learning algorithm.\nIt groups the data into a fixed number of clusters (\\(k\\)) and the ultimate aim is to discover patterns in the data.",
+ "crumbs": [
+ "Multivariate analysis",
+ "4Background"
+ ]
+ },
+ {
+ "objectID": "materials/mva-intro.html#hierarchical-clustering",
+ "href": "materials/mva-intro.html#hierarchical-clustering",
+ "title": "4 Background",
+ "section": "4.3 Hierarchical clustering",
+ "text": "4.3 Hierarchical clustering\nA clustering method that tries to create a hierarchy across data, often displayed as a dendogram. This visualises the way the clusters are arranged.\n\n\n\n\n“Mastodon.” n.d. https://genomic.social/@BioInfoCambs.",
+ "crumbs": [
+ "Multivariate analysis",
+ "4Background"
+ ]
+ },
+ {
+ "objectID": "materials/mva-pca.html",
+ "href": "materials/mva-pca.html",
+ "title": "5 Principal component analysis",
+ "section": "",
+ "text": "5.1 Libraries and functions",
+ "crumbs": [
+ "Multivariate analysis",
+ "5Principal component analysis"
+ ]
+ },
+ {
+ "objectID": "materials/mva-pca.html#libraries-and-functions",
+ "href": "materials/mva-pca.html#libraries-and-functions",
+ "title": "5 Principal component analysis",
+ "section": "",
+ "text": "Click to expand\n\n\n\n\n\n\nRPython\n\n\n\n5.1.1 Libraries\n\n\n5.1.2 Functions\n\n\n\n\n5.1.3 Libraries\n\n\n5.1.4 Functions",
+ "crumbs": [
+ "Multivariate analysis",
+ "5Principal component analysis"
+ ]
+ },
+ {
+ "objectID": "materials/mva-pca.html#purpose-and-aim",
+ "href": "materials/mva-pca.html#purpose-and-aim",
+ "title": "5 Principal component analysis",
+ "section": "5.2 Purpose and aim",
+ "text": "5.2 Purpose and aim\nThis is a statistical technique for reducing the dimensionality of a data set. The technique aims to find a new set of variables for describing the data. These new variables are made from a weighted sum of the old variables. The weighting is chosen so that the new variables can be ranked in terms of importance in that the first new variable is chosen to account for as much variation in the data as possible. Then the second new variable is chosen to account for as much of the remaining variation in the data as possible, and so on until there are as many new variables as old variables.",
+ "crumbs": [
+ "Multivariate analysis",
+ "5Principal component analysis"
+ ]
+ },
+ {
+ "objectID": "materials/mva-pca.html#data",
+ "href": "materials/mva-pca.html#data",
+ "title": "5 Principal component analysis",
+ "section": "5.3 Data",
+ "text": "5.3 Data\nThe example in this section uses the following data set:\ndata/finches_hybridisation.csv\nThese data come from an analysis of gene flow across two finch species (Grant and Grant 2020). They are slightly adapted here for illustrative purposes (the changes are documented in materials/data_preparation.R if you are interested).\nLong story short: these data are part of a hybridisation study on finches at the Galapagos Islands. Here, we’ve only included the non-hybrid observations, but feel free to rerun the analysis with all the hybrids!\nFrom many years of studies, going back to Darwin, we now know that the beak shape of the finches are important in their chances for survival. Changes in beak shape have led to new species and this study explored how movement of genes from a rare immigrant species (Geospiza fuliginosa) occurred through two other species (G. fortis and G. scandens). The researchers recorded various morphological traits.",
+ "crumbs": [
+ "Multivariate analysis",
+ "5Principal component analysis"
+ ]
+ },
+ {
+ "objectID": "materials/mva-pca.html#load-and-visualise-the-data",
+ "href": "materials/mva-pca.html#load-and-visualise-the-data",
+ "title": "5 Principal component analysis",
+ "section": "5.4 Load and visualise the data",
+ "text": "5.4 Load and visualise the data\nFirst, we load the data:\n\nRPython\n\n\n\nfinches_hybrid <- read_csv(\"data/finches_hybridisation.csv\")\n\n\n\n\nfinches_hybrid_py = pd.read_csv(\"data/finches_hybridisation.csv\")\n\n\n\n\nNext, we visualise the data:\n\nRPython\n\n\n\nhead(finches_hybrid)\n\n# A tibble: 6 × 7\n category weight wing tarsus blength bdepth bwidth\n <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 fortis 15.8 67.1 19.6 10.3 8.95 8.32\n2 fortis 16 67 17.7 10.4 9.3 8.6 \n3 fortis 19.2 72.4 19.4 12.1 10.8 10.1 \n4 fortis 16.1 68 19.1 11.3 10 8.2 \n5 fortis 18.6 67.5 20.0 11.2 9.5 8.9 \n6 fortis 17.3 71.1 19.6 11.8 10.3 9.59\n\n\n\n\n\nfinches_hybrid_py.head()\n\n category weight wing tarsus blength bdepth bwidth\n0 fortis 15.84 67.13 19.55 10.30 8.95 8.32\n1 fortis 16.00 67.00 17.70 10.40 9.30 8.60\n2 fortis 19.25 72.38 19.38 12.10 10.85 10.13\n3 fortis 16.10 68.00 19.10 11.30 10.00 8.20\n4 fortis 18.60 67.50 19.95 11.15 9.50 8.90\n\n\n\n\n\nWe have 7 columns. We’re not going to recreate the published analysis exactly, but for the purpose of this section we will be looking if differences in the measured morphological traits can be attributed to specific categories.\nWe have 3 different category values:\n\nRPython\n\n\n\nfinches_hybrid %>% distinct(category)\n\n# A tibble: 3 × 1\n category \n <chr> \n1 fortis \n2 fuliginosa\n3 scandens \n\n\n\n\n\nfinches_hybrid_py['category'].unique()\n\narray(['fortis', 'fuliginosa', 'scandens'], dtype=object)\n\n\n\n\n\nWe have 6 continuous response variables, which we’re not going to visualise independently! However, you could look at some of the potential relationships within the data.\nFor example, looking at body weight (weight) against beak length (blength):\n\nRPython\n\n\n\nggplot(finches_hybrid, aes(x = blength, y = weight,\n colour = category)) +\n geom_point()\n\n\n\n\n\n\n\n\nor just weight across the different categories:\n\nggplot(finches_hybrid, aes(x = category, y = weight)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\n\n\n\n(ggplot(finches_hybrid_py,\n aes(x = \"blength\", y = \"weight\",\n colour = \"category\")) +\n geom_point())\n\n\n\n\n\n\n\n\nor just weight across the different categories:\n\n(ggplot(finches_hybrid_py,\n aes(x = \"category\", y = \"weight\")) +\n geom_boxplot())\n\n\n\n\n\n\n\n\n\n\n\nThe number of combinations are rather large, given the number of variables we have. Hence it’s a good idea to see if we can “condense” some of variables into new ones.\nWhat we’re doing with a PCA is trying to capture as much variance that exists in the data using a Principal Component (PC). The PC therefore explains some of the variance for each of the individual variables.\nI like to compare the PCs to a smoothie: a smoothie might consist of 4 parts apple, 3 parts strawberry, 2 parts blueberry and 1 part grape. Along the lines of that delicious metaphor, one of our Principal components may consist of 4 parts blength, 3 parts weight, 2 parts bdepth and 1 part wing. We don’t know this yet, so let’s go find out.",
+ "crumbs": [
+ "Multivariate analysis",
+ "5Principal component analysis"
+ ]
+ },
+ {
+ "objectID": "materials/mva-pca.html#performing-the-pca",
+ "href": "materials/mva-pca.html#performing-the-pca",
+ "title": "5 Principal component analysis",
+ "section": "5.5 Performing the PCA",
+ "text": "5.5 Performing the PCA\nTo perform a PCA, we need to keep a few things in mind:\n\nWe can only calculate principal components for numeric data\nThe numeric data need to be on the same scale\n\nThis last point of scaling is very important. Measurements can take place at different scales. You shouldn’t compare milimetres and kilogrammes directly, for example. Or seconds and height. That simply does not make sense.\nThis issue is even more prevalent when comparing gene expression data, for example. Genes can be active at varying levels, but some genes only need (an indeed do) change a tiny amount to elicit an effect, whereas other genes might have to change several fold-changes before something happens.\nTo compensate for this, we bring all of our data onto the same scale.\n\nRPython\n\n\nIn R we can scale our data with the scale() function. We perform the PCA using the prcomp() function. Here we store the output into an object called pca_fit, because we’ll be working with the output later on.\n\npca_fit <- finches_hybrid %>% \n # keep only the numeric columns\n select(where(is.numeric)) %>%\n # scale the data\n scale() %>%\n # perform the PCA\n prcomp()\n\nThis is a bit of daunting looking output, but not to worry. We’ll unpack things along the way!\n\npca_fit\n\nStandard deviations (1, .., p=6):\n[1] 1.9815712 1.1261455 0.5329781 0.4847379 0.4170304 0.3349943\n\nRotation (n x k) = (6 x 6):\n PC1 PC2 PC3 PC4 PC5 PC6\nweight 0.4585415 0.1513150 0.04230031 -0.44900961 0.74497043 -0.09199756\nwing 0.4333116 0.1946944 -0.84096840 0.20843413 -0.12117514 0.09475803\ntarsus 0.4174676 0.3293946 0.46092626 0.70300481 0.08141532 0.06263195\nblength 0.4338157 0.2986440 0.26067134 -0.49214612 -0.64184633 -0.02217340\nbdepth 0.3370530 -0.6256540 0.10181858 -0.03331599 -0.02041369 0.69502424\nbwidth 0.3548249 -0.5916637 0.01460313 0.13195474 -0.10641317 -0.70362223\n\n\n\n\nIn Python we can scale our data with the StandardScaler() function from sklearn.preprocessing. We can only scale numerical data, so we need to get those first.\n\nfrom sklearn.preprocessing import StandardScaler\n\n# select the numeric values\nX = finches_hybrid_py.select_dtypes(include = ['float64', 'int64'])\n\n# define the scaler\nstd_scaler = StandardScaler()\n\n# scale the numeric values\nfinches_scaled = std_scaler.fit_transform(X)\n\nNow that we have the scaled values, we can perform the PCA. We do this using the PCA() function from sklearn.decomposition.\nWe need to tell it how many principal components we want it to return. We set it to 6 here, which is the number of numerical variables.\n\nfrom sklearn.decomposition import PCA\n\n# define the number of principal components\nn_components = 6\n\n# set the number of principal components\npca = PCA(n_components = n_components)\n\n# perform the PCA\nprincipal_components = pca.fit_transform(finches_scaled)\n\n# create a data frame containing the information\n# changing the column names based on the PC\npca_fit_py = pd.DataFrame(data = principal_components, columns=[f'PC{i}' for i in range(1, n_components + 1)])\n\n\n\n\n\n5.5.1 Visualising the principal components\nWe can figure out how much each principal component is contributing to the amount of variance that is being explained. This is called a screeplot.\n\nRPython\n\n\n\npca_fit %>% \n tidy(matrix = \"eigenvalues\") %>% \n ggplot(aes(x = PC,\n y = percent)) +\n geom_point() +\n geom_line()\n\n\n\n\n\n\n\n\n\n\nFirst, we extract the amount of variance explained by each principal component. Next, we convert this to a DataFrame:\n\nexplained_variance_pc = pca.explained_variance_ratio_\n\npercent = (\n pd.DataFrame({'variance_explained':\n (explained_variance_pc * 100),\n 'PC': [f'PC{i+1}' for i in range(n_components)]})\n )\n \npercent.head()\n\n variance_explained PC\n0 65.443738 PC1\n1 21.136728 PC2\n2 4.734428 PC3\n3 3.916180 PC4\n4 2.898573 PC5\n\n\nNext, we can plot this:\n\n(ggplot(percent,\n aes(x = \"PC\", y = \"variance_explained\")) +\n geom_point() +\n geom_line(group = 1))\n\n\n\n\n\n\n\n\n\n\n\nThis means that PC1 is able to explain around 65% of the variance in the data. PC2 is able to explain around 21% of the variance in the data, and so forth.\n\n\n5.5.2 Loadings\nLet’s think back to our smoothy metaphor. Remember how the smoothy was made up of various fruits - just like our PCs are made up of parts of our original variables.\nLet’s, for the sake of illustrating this, assume the following for PC1:\n\n\n\nparts\nvariable\n\n\n\n\n4\nblength\n\n\n1\ntarsus\n\n\n\nEach PC has something called an eigenvector, which in simplest terms is a line with a certain direction and length.\nIf we want to calculate the length of the eigenvector for PC1, we can employ Pythagoras (well, not directly, just his legacy). This gives:\n\\(eigenvector \\, PC1 = \\sqrt{4^2 + 1^2} = 4.12\\)\nThe loading scores for PC1 are the “parts” scaled for this length, i.e.:\n\n\n\nscaled parts\nvariable\n\n\n\n\n4 / 4.12 = 0.97\nblength\n\n\n1 / 4.12 = 0.24\ntarsus\n\n\n\nWhat we can do with these values is plot the loadings for each of the original variables.\n\nRPython\n\n\nIt might be helpful to visualise this in context of the original data. We can easily add the original data to the fitted PCs as follows (and plot it):\n\npca_fit %>%\n # add the original data\n augment(finches_hybrid) %>%\n ggplot(aes(.fittedPC1, .fittedPC2, colour = category)) + \n geom_point(size = 1.5)\n\n\n\n\n\n\n\n\nThis gives us the individual contributions to PC1 and PC2 for each observation.\nIf we wanted to know how much each variable is contributing to PC1 and PC2 we would have to use the loadings.\nWe can extract all the loadings as follows:\n\npca_fit %>% \n tidy(matrix = \"loadings\")\n\n# A tibble: 36 × 3\n column PC value\n <chr> <dbl> <dbl>\n 1 weight 1 0.459 \n 2 weight 2 0.151 \n 3 weight 3 0.0423\n 4 weight 4 -0.449 \n 5 weight 5 0.745 \n 6 weight 6 -0.0920\n 7 wing 1 0.433 \n 8 wing 2 0.195 \n 9 wing 3 -0.841 \n10 wing 4 0.208 \n# ℹ 26 more rows\n\n\nWe’ll have to reformat this a little, so that we have the values in separate columns. First, we rename some of the columns / values to make things more consistent and clearer:\n\npca_loadings <- pca_fit %>% \n tidy(matrix = \"loadings\") %>% \n rename(terms = column,\n component = PC) %>% \n mutate(component = paste0(\"PC\", component))\n\nhead(pca_loadings)\n\n# A tibble: 6 × 3\n terms component value\n <chr> <chr> <dbl>\n1 weight PC1 0.459 \n2 weight PC2 0.151 \n3 weight PC3 0.0423\n4 weight PC4 -0.449 \n5 weight PC5 0.745 \n6 weight PC6 -0.0920\n\n\nNow we can reformat the data:\n\npca_loadings <- pca_loadings %>% \n pivot_wider(names_from = \"component\",\n values_from = \"value\")\n\npca_loadings\n\n# A tibble: 6 × 7\n terms PC1 PC2 PC3 PC4 PC5 PC6\n <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 weight 0.459 0.151 0.0423 -0.449 0.745 -0.0920\n2 wing 0.433 0.195 -0.841 0.208 -0.121 0.0948\n3 tarsus 0.417 0.329 0.461 0.703 0.0814 0.0626\n4 blength 0.434 0.299 0.261 -0.492 -0.642 -0.0222\n5 bdepth 0.337 -0.626 0.102 -0.0333 -0.0204 0.695 \n6 bwidth 0.355 -0.592 0.0146 0.132 -0.106 -0.704 \n\n\nWe can then plot this. This is a little bit involved, unfortunately. And not something I’d recommend remembering the code by heart, but we’re doing the following:\n\nTake the PCA output and add the original data\nPlot this\nCreate a line from the origin (x = 0, y = 0) for each loading\nMake it an arrow\nAdd the variable name as a label\n\nWe define the arrow as follows:\n\n# define arrow style\narrow_style <- arrow(length = unit(2, \"mm\"),\n type = \"closed\")\n\n\npca_fit %>%\n # add the original data\n augment(finches_hybrid) %>%\n ggplot() + \n geom_point(aes(.fittedPC1, .fittedPC2, colour = category), size = 1.5) +\n geom_segment(data = pca_loadings,\n aes(xend = PC1, yend = PC2),\n x = 0, y = 0,\n arrow = arrow_style) +\n geom_text(data = pca_loadings,\n aes(x = PC1, y = PC2, label = terms), \n hjust = 0, \n vjust = 1,\n size = 5) \n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAfter all that we end up with a rather unclear plot. The 6 variables that contribute to each principal component have very overlapping contributions in PC1. As such, it’s difficult to see which variable contributes what!\nThe reason why I’m still showing it here is that this kind of plot is very often used in PCA, so at least you can recognise it. If the variables have well-separated contributions across the two principal components, then it can be quite informative.\nA better way would be to plot the individual contributions to each principal component as an ordered bar plot. This does require some rejigging of the data again (sorry!).\n\nRPython\n\n\nFirst, we convert our loadings data back to a “long” format. We also add an extra column, direction, which indicates if the contribution to the principal component is positive or negative.\n\npca_loadings <- pca_loadings %>% \n pivot_longer(cols = -terms,\n names_to = \"component\",\n values_to = \"value\") %>% \n # add a column that indicates direction\n # (positive or negative)\n mutate(direction = ifelse(value < 0, \"positive\", \"negative\"))\n\n# have a look at the data\nhead(pca_loadings)\n\n# A tibble: 6 × 4\n terms component value direction\n <chr> <chr> <dbl> <chr> \n1 weight PC1 0.459 negative \n2 weight PC2 0.151 negative \n3 weight PC3 0.0423 negative \n4 weight PC4 -0.449 positive \n5 weight PC5 0.745 negative \n6 weight PC6 -0.0920 positive \n\n\nWe can now visualise this. Here we are using some additional functionality offered by the tidytext library. Make sure to install it, if needed. Then load it.\n\n# we need this library\nlibrary(tidytext)\n\npca_loadings %>% \n mutate(terms = tidytext::reorder_within(terms,\n abs(value),\n component)) %>%\n ggplot(aes(x = abs(value), y = terms, fill = direction)) +\n geom_col() +\n facet_wrap(vars(component), scales = \"free_y\") +\n tidytext::scale_y_reordered()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nImportant\n\n\n\nIt is important to keep the amount of variance explained by each principal component in mind. For example, PC3 only explains around 4.7 of the variance. So although several variables contribute substantially to PC3, the contribution of PC3 itself remains small.",
+ "crumbs": [
+ "Multivariate analysis",
+ "5Principal component analysis"
+ ]
+ },
+ {
+ "objectID": "materials/mva-pca.html#exercises",
+ "href": "materials/mva-pca.html#exercises",
+ "title": "5 Principal component analysis",
+ "section": "5.6 Exercises",
+ "text": "5.6 Exercises\n\n5.6.1 Penguins\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/penguins.csv. These data are from the palmerpenguins package (for more information, see the GitHub page).\nI would like you to do the following:\n\nLoad and visualise the data.\nCreate a screeplot and see how many PCs would be best.\nCalculate the loadings for PC1 and PC2 and visualise them.\nAny conclusions you might draw from the analysis.\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\n\nLoad and visualise the data\n\nRPython\n\n\n\npenguins <- read_csv(\"data/penguins.csv\")\n\nRows: 344 Columns: 8\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (3): species, island, sex\ndbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\n\n\nhead(penguins)\n\n# A tibble: 6 × 8\n species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n1 Adelie Torgersen 39.1 18.7 181 3750\n2 Adelie Torgersen 39.5 17.4 186 3800\n3 Adelie Torgersen 40.3 18 195 3250\n4 Adelie Torgersen NA NA NA NA\n5 Adelie Torgersen 36.7 19.3 193 3450\n6 Adelie Torgersen 39.3 20.6 190 3650\n# ℹ 2 more variables: sex <chr>, year <dbl>\n\n\nWe can see that there are different kinds of variables, both categorical and numerical. Also, there appear to be some missing data in the data set, so we probably have to deal with that.\nLastly, we should be careful with the year column: it is recognised as a numerical column (because it contains numbers), but we should view it as a factor, since the years have an ordered, categorical meaning.\nTo get a better sense of our data we could plot all the numerical variables against each other, to see if there is any possible correlation between them. Doing that one-by-one in ggplot is tedious, so I’m using the pairs() function here. Pretty? No. Effective? Yes.\n\npenguins %>% \n select(where(is.numeric)) %>% \n pairs()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSo we see that there is some possible grouping going on and possibly some linear relationships, too. However, there are multiple groups and closely related measurements, so it is not very surprising that there are possible relationships within the data.\n\n\nPerform the PCA\nFirst, we need to do a little bit of data tidying. We convert year to a factor and deal with the missing values. We’re not dealing with them in a particularly subtle way here, removing all the rows that contain at least one missing value.\nIn your own research you may want to be more careful and only remove missing values from variables that you using in the PCA (here we include everything).\n\nRPython\n\n\n\npenguins <- penguins %>% \n mutate(year = factor(year)) %>% \n drop_na() # remove all rows with missing values\n\n\npca_fit <- penguins %>% \n # keep only the numeric columns\n select(where(is.numeric)) %>%\n # scale the data\n scale() %>%\n # perform the PCA\n prcomp()\n\n\n\n\n\n\n\n\n\nVisualise the PCs\n\nRPython\n\n\nWe can visualise the principal components:\n\npca_fit %>% \n tidy(matrix = \"eigenvalues\") %>% \n ggplot(aes(x = PC,\n y = percent)) +\n geom_point() +\n geom_line()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIt looks like using the first two principal components is probably capturing the majority of the variance in the data. Combined they account for 90% of the variance.\n\n\nLoadings\nNext, we get the loadings: how much is each original variable contributing to the individual principal components?\n\nRPython\n\n\nWe start with some data tidying and wrangling, since we need the data in a “wide” format for the next plot to work:\n\npca_loadings <- pca_fit %>% \n tidy(matrix = \"loadings\") %>% \n rename(terms = column,\n component = PC) %>% \n mutate(component = paste0(\"PC\", component)) %>% \n pivot_wider(names_from = \"component\",\n values_from = \"value\")\n\nhead(pca_loadings)\n\n# A tibble: 4 × 5\n terms PC1 PC2 PC3 PC4\n <chr> <dbl> <dbl> <dbl> <dbl>\n1 bill_length_mm 0.454 -0.600 -0.642 0.145\n2 bill_depth_mm -0.399 -0.796 0.426 -0.160\n3 flipper_length_mm 0.577 -0.00579 0.236 -0.782\n4 body_mass_g 0.550 -0.0765 0.592 0.585\n\n\nArrow style:\n\n# define arrow style\narrow_style <- arrow(length = unit(2, \"mm\"),\n type = \"closed\")\n\n\npca_fit %>%\n # add the original data\n augment(penguins) %>%\n ggplot() + \n geom_point(aes(.fittedPC1, .fittedPC2, colour = species), size = 1.5) +\n geom_segment(data = pca_loadings,\n aes(xend = PC1, yend = PC2),\n x = 0, y = 0,\n arrow = arrow_style) +\n geom_text(data = pca_loadings,\n aes(x = PC1, y = PC2, label = terms), \n hjust = 0, \n vjust = 1,\n size = 5) \n\n\n\n\n\n\n\n\nWe can also visualise these contributions using a bar plot. We need the data in a slightly different format before we can do this:\n\npca_loadings <- pca_loadings %>% \n pivot_longer(cols = -terms,\n names_to = \"component\",\n values_to = \"value\") %>% \n # add a column that indicates direction\n # (positive or negative)\n mutate(direction = ifelse(value < 0, \"positive\", \"negative\"))\n\n# have a look at the data\nhead(pca_loadings)\n\n# A tibble: 6 × 4\n terms component value direction\n <chr> <chr> <dbl> <chr> \n1 bill_length_mm PC1 0.454 negative \n2 bill_length_mm PC2 -0.600 positive \n3 bill_length_mm PC3 -0.642 positive \n4 bill_length_mm PC4 0.145 negative \n5 bill_depth_mm PC1 -0.399 positive \n6 bill_depth_mm PC2 -0.796 positive \n\n\nBut after that, we can visualise it as follows:\n\n# we need this library\nlibrary(tidytext)\n\npca_loadings %>% \n mutate(terms = tidytext::reorder_within(terms,\n abs(value),\n component)) %>%\n ggplot(aes(x = abs(value), y = terms, fill = direction)) +\n geom_col() +\n facet_wrap(vars(component), scales = \"free_y\") +\n tidytext::scale_y_reordered()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nConclusions\n\nLoad the data.\nCreate a screeplot and see how many PCs would be best.\nCalculate the loadings for PC1 and PC2 and visualise them.\nAny conclusions you might draw from the analysis.\n\nTaken together, we can conclude/comment on a few things:\n\nEndlessly looking at pairwise comparisons between continuous variables probably becomes a bit tedious. An alternative would be to calculate correlations between the variables to get some insight into your data. In the end it depends on how many variables you have and how much you (want to) know about them.\nIn this case, I’d say that the first two principal components capture most of the variance in the data.\nThe largest contributing variables mostly differ between the first two principal components. The variables that make up PC1 are very similar in terms of contribution, whereas two variables more or less make up PC2 entirely.\n\nThe variables flipper_length_mm and body_mass_g contribute pretty much only to PC1 (they are horizontal in the loadings plot), whereas bill_length_mm is contributing a reasonable amount to both PC1 and PC2.\nFrom the data itself we can see that there is clear separation between the three species. The Gentoo penguins separate further once again from the other two species.",
+ "crumbs": [
+ "Multivariate analysis",
+ "5Principal component analysis"
+ ]
+ },
+ {
+ "objectID": "materials/mva-pca.html#summary",
+ "href": "materials/mva-pca.html#summary",
+ "title": "5 Principal component analysis",
+ "section": "5.7 Summary",
+ "text": "5.7 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nPCA allows you to reduce a large number of variables into fewer principal components.\nEach PC is made up of a combination of the original variables and captures as much of the variance within the data as possible.\nThe loadings tell you how much each original variable contributes to each PC.\nA screeplot is a graphical representation of the amount of variance explained by each PC.\n\n\n\n\n\n\n\nGrant, Peter, and B. Grant. 2020. “Triad Hybridization via a Conduit Species.” Proceedings of the National Academy of Sciences 117 (March): 202000388. https://doi.org/10.1073/pnas.2000388117.\n\n\n“Mastodon.” n.d. https://genomic.social/@BioInfoCambs.",
+ "crumbs": [
+ "Multivariate analysis",
+ "5Principal component analysis"
+ ]
+ },
+ {
+ "objectID": "materials/mva-kmeans.html",
+ "href": "materials/mva-kmeans.html",
+ "title": "6 K-means clustering",
+ "section": "",
+ "text": "6.1 Libraries and functions",
+ "crumbs": [
+ "Multivariate analysis",
+ "6K-means clustering"
+ ]
+ },
+ {
+ "objectID": "materials/mva-kmeans.html#libraries-and-functions",
+ "href": "materials/mva-kmeans.html#libraries-and-functions",
+ "title": "6 K-means clustering",
+ "section": "",
+ "text": "Click to expand\n\n\n\n\n\n\nRPython\n\n\n\n6.1.1 Libraries\n\n\n6.1.2 Functions\n\n\n\n\n6.1.3 Libraries\n\n\n6.1.4 Functions",
+ "crumbs": [
+ "Multivariate analysis",
+ "6K-means clustering"
+ ]
+ },
+ {
+ "objectID": "materials/mva-kmeans.html#purpose-and-aim",
+ "href": "materials/mva-kmeans.html#purpose-and-aim",
+ "title": "6 K-means clustering",
+ "section": "6.2 Purpose and aim",
+ "text": "6.2 Purpose and aim\nThis is a method for grouping observations into clusters. It groups data based on similarity and is an often-used unsupervised machine learning algorithm.\nIt groups the data into a fixed number of clusters (\\(k\\)) and the ultimate aim is to discover patterns in the data.\nK-means clustering is an iterative process. It follows the following steps:\n\nSelect the number of clusters to identify (e.g. K = 3)\nCreate centroids\nPlace centroids randomly in your data\nAssign each data point to the closest centroid\nCalculate the centroid of each new cluster\nRepeat steps 4-5 until the clusters do not change",
+ "crumbs": [
+ "Multivariate analysis",
+ "6K-means clustering"
+ ]
+ },
+ {
+ "objectID": "materials/mva-kmeans.html#data",
+ "href": "materials/mva-kmeans.html#data",
+ "title": "6 K-means clustering",
+ "section": "6.3 Data",
+ "text": "6.3 Data\nOnce again we’ll be using the data from data/penguins.csv. These data are from the palmerpenguins package (for more information, see the GitHub page).",
+ "crumbs": [
+ "Multivariate analysis",
+ "6K-means clustering"
+ ]
+ },
+ {
+ "objectID": "materials/mva-kmeans.html#load-and-visualise-the-data",
+ "href": "materials/mva-kmeans.html#load-and-visualise-the-data",
+ "title": "6 K-means clustering",
+ "section": "6.4 Load and visualise the data",
+ "text": "6.4 Load and visualise the data\nIf you haven’t done so already, load the data.\n\nRPython\n\n\n\npenguins <- read_csv(\"data/penguins.csv\")\n\nRows: 344 Columns: 8\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (3): species, island, sex\ndbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\nhead(penguins)\n\n# A tibble: 6 × 8\n species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n1 Adelie Torgersen 39.1 18.7 181 3750\n2 Adelie Torgersen 39.5 17.4 186 3800\n3 Adelie Torgersen 40.3 18 195 3250\n4 Adelie Torgersen NA NA NA NA\n5 Adelie Torgersen 36.7 19.3 193 3450\n6 Adelie Torgersen 39.3 20.6 190 3650\n# ℹ 2 more variables: sex <chr>, year <dbl>\n\n\n\n\n\n# load the data\npenguins_py = pd.read_csv(\"data/penguins.csv\")\n\n# inspect the data\npenguins_py.head()\n\n species island bill_length_mm ... body_mass_g sex year\n0 Adelie Torgersen 39.1 ... 3750.0 male 2007\n1 Adelie Torgersen 39.5 ... 3800.0 female 2007\n2 Adelie Torgersen 40.3 ... 3250.0 female 2007\n3 Adelie Torgersen NaN ... NaN NaN 2007\n4 Adelie Torgersen 36.7 ... 3450.0 female 2007\n\n[5 rows x 8 columns]\n\n\n\n\n\nThere are a variety of variables in this data set. The following example focuses on the two variables bill_length_mm and bill_depth_mm across the various species recorded.,\nThese variables are distributed as follows:\n\nRPython\n\n\n\nggplot(penguins, aes(x = bill_depth_mm,\n y = bill_length_mm,\n colour = species)) +\n geom_point()\n\nWarning: Removed 2 rows containing missing values (`geom_point()`).\n\n\n\n\n\n\n\n\n\n\n\n\n(ggplot(penguins_py,\n aes(x = \"bill_depth_mm\",\n y = \"bill_length_mm\",\n colour = \"species\")) +\n geom_point())\n\n\n\n\n\n\n\n\n\n\n\nWe can already see that the data appear to cluster quite closely by species. A great example to illustrate K-means clustering (you’d almost think I chose the example on purpose)!",
+ "crumbs": [
+ "Multivariate analysis",
+ "6K-means clustering"
+ ]
+ },
+ {
+ "objectID": "materials/mva-kmeans.html#perform-k-means-clustering",
+ "href": "materials/mva-kmeans.html#perform-k-means-clustering",
+ "title": "6 K-means clustering",
+ "section": "6.5 Perform K-means clustering",
+ "text": "6.5 Perform K-means clustering\nTo do the clustering, we’ll need to do a bit of data wrangling, since we can only do the clustering on numerical data.\nAs we did with the PCA, we also need to scale the data. Although it is not required in this case, because both variables have the same unit (millimetres), it is good practice. In other scenarios it could be that there are different units that are being compared, which could affect the clustering.\n\nRPython\n\n\nWe’re using the kmeans() function.\n\npenguins_scaled <- penguins %>% \n select(bill_depth_mm, # select data\n bill_length_mm) %>% \n drop_na() %>% # remove missing values\n scale() %>% # scale the data\n as_tibble() %>% \n rename(bill_depth_scaled = bill_depth_mm,\n bill_length_scaled = bill_length_mm)\n\nNext, we can perform the clustering:\n\nkclust <- kmeans(penguins_scaled, # perform k-means clustering\n centers = 3) # using 3 centers\n\nsummary(kclust) # summarise output\n\n Length Class Mode \ncluster 342 -none- numeric\ncenters 6 -none- numeric\ntotss 1 -none- numeric\nwithinss 3 -none- numeric\ntot.withinss 1 -none- numeric\nbetweenss 1 -none- numeric\nsize 3 -none- numeric\niter 1 -none- numeric\nifault 1 -none- numeric\n\n\nThe output is a list of vectors, with differing lengths. That’s because they contain different types of information:\n\ncluster contains information about each point\ncenters, withinss, and size contain information about each cluster\ntotss, tot.withinss, betweenss, and iter contain information about the full clustering\n\n\n\nTo do the clustering, we’ll be using the KMeans() function.\n\nfrom sklearn.cluster import KMeans\nfrom sklearn.preprocessing import StandardScaler\n\nstd_scaler = StandardScaler()\n\n# remove missing values\npenguins_scaled_py = penguins_py.dropna()\n# select relevant columns\npenguins_scaled_py = penguins_scaled_py[['bill_depth_mm', 'bill_length_mm']]\n\npenguins_scaled_py = std_scaler.fit_transform(penguins_scaled_py)\n\nkmeans = KMeans(\ninit = 'random',\nn_clusters = 3,\nn_init = 10,\nmax_iter = 300,\nrandom_state = 42\n)\n\nkmeans.fit(penguins_scaled_py)\n\nKMeans(init='random', n_clusters=3, random_state=42)\n\n\n\n\n\n\n6.5.1 Visualise clusters\nWhen we performed the clustering, the centers were calculated. These values give the (x, y) coordinates of the centroids.\n\nRPython\n\n\n\ntidy_clust <- tidy(kclust) # get centroid coordinates\n\ntidy_clust\n\n# A tibble: 3 × 5\n bill_depth_scaled bill_length_scaled size withinss cluster\n <dbl> <dbl> <int> <dbl> <fct> \n1 0.560 -0.943 153 88.0 1 \n2 -1.09 0.590 125 59.4 2 \n3 0.799 1.10 64 39.0 3 \n\n\n\n\n\n# calculate the cluster centers\nkclusts_py = kmeans.cluster_centers_\nkclusts_py = pd.DataFrame(kclusts_py, columns = ['0', '1'])\n\n# convert to Pandas DataFrame and rename columns\nkclusts_py = pd.DataFrame(kclusts_py)\n\nkclusts_py = (kclusts_py\n .rename(columns = {\"0\": \"bdepth_scaled\",\n \"1\": \"blength_scaled\"}))\n\n# and show the coordinates\nkclusts_py\n\n bdepth_scaled blength_scaled\n0 -1.098055 0.586444\n1 0.795036 1.088684\n2 0.553935 -0.950240\n\n\n\n\n\n\n\n\n\n\n\nInitial centroid placement\n\n\n\nThe initial centroids get randomly placed in the data. This, combined with the iterative nature of the process, means that the values that you will see are going to be slightly different from the values here. That’s normal!\n\n\n\nRPython\n\n\nNext, we want to visualise to which data points belong to which cluster. We can do that as follows:\n\nkclust %>% # take clustering data\n augment(penguins_scaled) %>% # combine with original data\n ggplot(aes(x = bill_depth_scaled, # plot the scaled data\n y = bill_length_scaled)) +\n geom_point(aes(colour = .cluster)) + # colour by classification\n geom_point(data = tidy_clust,\n size = 7, shape = 'x') # add the cluster centers\n\n\n\n\n\n\n\n\n\n\nWe reformat and rename the scaled data:\n\n# convert NumPy arrays to Pandas DataFrame\npenguins_scaled_py = pd.DataFrame(penguins_scaled_py)\n\npenguins_scaled_py = (penguins_scaled_py\n .rename(columns = {0: 'bdepth_scaled',\n 1: 'blength_scaled'}))\n\nand merge this with the original data:\n\n# remove missing values\npenguins_py = penguins_py.dropna()\n# add an ID column\npenguins_py['id'] = range(1, len(penguins_py) + 1)\n\n# add an ID column to the scaled data\n# so we can match the observations\npenguins_scaled_py['id'] = range(1, len(penguins_scaled_py) + 1)\n\n# merge the data by ID\npenguins_augment_py = (pd.merge(penguins_py.dropna(),\n penguins_scaled_py,\n on = 'id'))\n\n# add the cluster designation\npenguins_augment_py['cluster'] = kmeans.fit_predict(penguins_scaled_py)\n\n# and convert it into a factor\npenguins_augment_py['cluster'] = (penguins_augment_py['cluster']\n .astype('category'))\n\nWe can then (finally!) plot this:\n\n(ggplot(penguins_augment_py,\n aes(x = 'bdepth_scaled',\n y = 'blength_scaled',\n colour = 'cluster')) +\n geom_point() +\n geom_point(kclusts_py, colour = \"black\",\n shape = \"x\", size = 7))",
+ "crumbs": [
+ "Multivariate analysis",
+ "6K-means clustering"
+ ]
+ },
+ {
+ "objectID": "materials/mva-kmeans.html#optimising-cluster-number",
+ "href": "materials/mva-kmeans.html#optimising-cluster-number",
+ "title": "6 K-means clustering",
+ "section": "6.6 Optimising cluster number",
+ "text": "6.6 Optimising cluster number\nIn the example we set the number of clusters to 3. This made sense, because the data already visually separated in roughly three groups - one for each species.\nHowever, it might be that the cluster number to choose is a lot less obvious. In that case it would be helpful to explore clustering your data into a range of clusters.\nIn short, we determine which values of \\(k\\) we want to explore and then loop through these values, repeating the workflow we looked at previously.\n\nRPython\n\n\nReiterating over a range of \\(k\\) values is reasonably straightforward using tidyverse. Although we could write our own function to loop through these \\(k\\) values, tidyverse has a series of map() functions that can do this for you. More information on them here.\nIn short, the map() function spits out a list which contains the output. When we do this on our data, we can create a table that contains lists with all of the information that we need.\nHere we calculate the following:\n\nthe kclust column contains a list with all the kmeans() output, for each value of \\(k\\)\nthe tidied column contains the information on a per-cluster basis\nthe glanced column contains single-row summary for each \\(k\\) - we’ll use the tot.withinss values a little bit later on\nthe augmented column contains the original data, augmented with the classification that was calculated by the kmeans() function\n\n\nkclusts <- \n # check for k = 1 to 6\n tibble(k = 1:6) %>%\n mutate(\n # perform clustering for each k\n kclust = map(k, ~ kmeans(penguins_scaled, .x)),\n # summary at per-cluster level\n tidied = map(kclust, tidy),\n # get single-row summary\n glanced = map(kclust, glance),\n # add classification to data set\n augmented = map(kclust, augment, penguins_scaled))\n\nkclusts\n\n# A tibble: 6 × 5\n k kclust tidied glanced augmented \n <int> <list> <list> <list> <list> \n1 1 <kmeans> <tibble [1 × 5]> <tibble [1 × 4]> <tibble [342 × 3]>\n2 2 <kmeans> <tibble [2 × 5]> <tibble [1 × 4]> <tibble [342 × 3]>\n3 3 <kmeans> <tibble [3 × 5]> <tibble [1 × 4]> <tibble [342 × 3]>\n4 4 <kmeans> <tibble [4 × 5]> <tibble [1 × 4]> <tibble [342 × 3]>\n5 5 <kmeans> <tibble [5 × 5]> <tibble [1 × 4]> <tibble [342 × 3]>\n6 6 <kmeans> <tibble [6 × 5]> <tibble [1 × 4]> <tibble [342 × 3]>\n\n\nLists can sometimes be a bit tricky to get your head around, so it’s worthwhile exploring the output. RStudio is particularly useful for this, since you can just left-click on the object in your Environment panel and look.\nThe way I see lists in this context is as containers. We have one huge table kclusts that contains all of the information that we need. Each ‘cell’ in this table has a container with the relevant data. The kclust column is a list with kmeans objects (the output of our kmeans() for each of the \\(k\\) values), whereas the other columns are lists of tibbles (because the tidy(), glance() and augment() functions output a tibble with the information for each value of \\(k\\)).\nFor us to use the data in the lists, it makes sense to extract them on a column-by-column basis. We’re ignoring the kclust column, because we don’t need the actual kmeans() output any more.\nTo extract the data from the lists we use the unnest() function.\n\nclusters <- \n kclusts %>%\n unnest(cols = c(tidied))\n\nassignments <- \n kclusts %>% \n unnest(cols = c(augmented))\n\nclusterings <- \n kclusts %>%\n unnest(cols = c(glanced))\n\nNext, we can visualise some of the data. We’ll start by plotting the scaled data and colouring the data points based on the final cluster it has been assigned to by the kmeans() function.\nThe (augmented) data are in assignments. Have a look at the structure of the table.\nWe facet the data by \\(k\\), so we get a single panel for each value of \\(k\\).\nWe also add the calculated cluster centres, which are stored in clusters.\n\nggplot(assignments,\n aes(x = bill_depth_scaled, # plot data\n y = bill_length_scaled)) + \n geom_point(aes(color = .cluster), # colour by cluster\n alpha = 0.8) + # add transparency\n facet_wrap(vars(k)) + # facet for each k\n geom_point(data = clusters, # add centers\n size = 7,\n shape = \"x\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLooking at this plot shows what we already knew (if only things were this easy all the time!): three clusters is a pretty good choice for these data. Remember that you’re looking for clusters that are distinct, i.e. are separated from one another. For example, using k = 4 gives you four nice groups, but two of them are directly adjacent, suggesting that they would do equally well in a single cluster.\n\n6.6.1 Elbow plot\nVisualising the data like this can be helpful but at the same time it can also be a bit subjective (hoorah!). To find another subjective way of interpreting these clusters (remember, statistics isn’t this YES/NO magic mushroom and we should be comfortable wandering around in the murky grey areas of statistics by now), we can plot the total within-cluster variation for each value of k.\nIntuitively, if you keep adding clusters then the total amount of variation that can be explained by these clusters will increase. The most extreme case would be where each data point is its own cluster and we can then explain all of the variation in the data.\nOf course that is not a very sensible approach - hence us balancing the number of clusters against how much variation they can capture.\nA practical approach to this is creating an “elbow” plot where the cumulative amount of variation explained is plotted against the number of clusters.\n\nRPython\n\n\nThe output of the kmeans() function includes tot.withinss - this is the total within-cluster sum of squares.\n\nggplot(clusterings,\n aes(x = k, # for each k plot...\n y = tot.withinss)) + # total within variance\n geom_line() +\n geom_point() +\n scale_x_continuous(\n breaks = seq(1, 6, 1)) # set the x-axis breaks\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWe can see that the total within-cluster sum of squares decreases as the number of clusters increases. We can also see that from k = 3 onwards the slope of the line becomes much shallower. This “elbow” or bending point is a useful gauge to find the optimum number of clusters.\nFrom the exploration we can see that three clusters are optimal in this scenario.",
+ "crumbs": [
+ "Multivariate analysis",
+ "6K-means clustering"
+ ]
+ },
+ {
+ "objectID": "materials/mva-kmeans.html#exercises",
+ "href": "materials/mva-kmeans.html#exercises",
+ "title": "6 K-means clustering",
+ "section": "6.7 Exercises",
+ "text": "6.7 Exercises\n\n6.7.1 Finch beaks\n\n\n\n\n\n\nExercise\n\n\n\n\n\n\n\nLevel: \nFor this exercise we’ll be using the data from data/finch_beaks.csv.\nThe finches data has been adapted from the accompanying website to 40 years of evolution. Darwin’s finches on Daphne Major Island by Peter R. Grant and Rosemary B. Grant.\nPractice running through the clustering workflow using the finches dataset. Try doing the following:\n\nRead in the data\nExplore and visualise the data\nPerform the clustering with k = 2\nFind out if using k = 2 is a reasonable choice\nTry and draw some conclusions\n\n\n\n\n\n\n\nAnswer\n\n\n\n\n\n\n\n\nLoad and visualise\n\nRPython\n\n\n\nfinches <- read_csv(\"data/finch_beaks.csv\")\n\nhead(finches)\n\n# A tibble: 6 × 5\n band species beak_length_mm beak_depth_mm year\n <dbl> <chr> <dbl> <dbl> <dbl>\n1 2 fortis 9.4 8 1975\n2 9 fortis 9.2 8.3 1975\n3 12 fortis 9.5 7.5 1975\n4 15 fortis 9.5 8 1975\n5 305 fortis 11.5 9.9 1975\n6 307 fortis 11.1 8.6 1975\n\n\n\nggplot(finches, aes(x = beak_depth_mm,\n y = beak_length_mm,\n colour = species)) +\n geom_point()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nClustering\nNext, we perform the clustering. We first clean and scale the data.\n\nRPython\n\n\n\nfinches_scaled <- finches %>% \n select(beak_depth_mm, # select data\n beak_length_mm) %>% \n drop_na() %>% # remove missing values\n scale() %>% # scale the data\n as_tibble() %>% \n rename(bdepth_scaled = beak_depth_mm,\n blength_scaled = beak_length_mm)\n\n\nkclust <-\n kmeans(finches_scaled, # perform k-means clustering\n centers = 2) # using 2 centers\n\nsummary(kclust) # summarise output\n\n Length Class Mode \ncluster 651 -none- numeric\ncenters 4 -none- numeric\ntotss 1 -none- numeric\nwithinss 2 -none- numeric\ntot.withinss 1 -none- numeric\nbetweenss 1 -none- numeric\nsize 2 -none- numeric\niter 1 -none- numeric\nifault 1 -none- numeric\n\n\n\ntidy_clust <- tidy(kclust) # get centroid coordinates\n\ntidy_clust\n\n# A tibble: 2 × 5\n bdepth_scaled blength_scaled size withinss cluster\n <dbl> <dbl> <int> <dbl> <fct> \n1 0.618 0.676 339 486. 1 \n2 -0.672 -0.734 312 220. 2 \n\n\n\n\n\n\n\n\n\n\nVisualise the clusters\nWe can visualise the clusters as follows:\n\nRPython\n\n\n\nkclust %>% # take clustering data\n augment(finches_scaled) %>% # combine with original data\n ggplot(aes(x = bdepth_scaled, # plot the original data\n y = blength_scaled)) +\n geom_point(aes(colour = .cluster)) + # colour by classification\n geom_point(data = tidy_clust,\n size = 7, shape = 'x') # add the cluster centers\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nOptimise clusters\nIt looks like two clusters is a reasonable choice. But let’s explore this a bit more.\n\nRPython\n\n\n\nkclusts <- \n # check for k = 1 to 6\n tibble(k = 1:6) %>%\n mutate(\n # perform clustering for each k\n kclust = map(k, ~ kmeans(finches_scaled, .x)),\n # summary at per-cluster level\n tidied = map(kclust, tidy),\n # get single-row summary\n glanced = map(kclust, glance),\n # add classification to data set\n augmented = map(kclust, augment, finches_scaled)\n )\n\nExtract the relevant data.\n\nclusters <- \n kclusts %>%\n unnest(cols = c(tidied))\n\nassignments <- \n kclusts %>% \n unnest(cols = c(augmented))\n\nclusterings <- \n kclusts %>%\n unnest(cols = c(glanced))\n\nVisualise the result.\n\nggplot(assignments,\n aes(x = bdepth_scaled, # plot data\n y = blength_scaled)) + \n geom_point(aes(color = .cluster), # colour by cluster\n alpha = 0.8) + # add transparency\n facet_wrap(~ k) + # facet for each k\n geom_point(data = clusters, # add centers\n size = 7,\n shape = 'x')\n\n\n\n\n\n\n\n\nCreate an elbow plot to have a closer look.\n\nggplot(clusterings,\n aes(x = k, # for each k plot...\n y = tot.withinss)) + # total within variance\n geom_line() +\n geom_point() +\n scale_x_continuous(\n breaks = seq(1, 6, 1)) # set the x-axis breaks\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nConclusions\nOur initial clustering was done using two clusters, basically capturing the two different finch species.\nRedoing the analysis with different numbers of clusters seems to reasonably support that decision. The elbow plot suggests that k = 3 would not be such a terrible idea either.\nIn the example above we used data that were collected at two different time points: 1975 and 2012.\nIn the analysis we’ve kept these data together. However, the original premises of these data was to see if there is any indication of evolution going on in these species of finches. Think about how you would approach this question!",
+ "crumbs": [
+ "Multivariate analysis",
+ "6K-means clustering"
+ ]
+ },
+ {
+ "objectID": "materials/mva-kmeans.html#summary",
+ "href": "materials/mva-kmeans.html#summary",
+ "title": "6 K-means clustering",
+ "section": "6.8 Summary",
+ "text": "6.8 Summary\n\n\n\n\n\n\nKey points\n\n\n\n\nk-means clustering partitions data into clusters\nthe k defines the number of clusters\ncluster centers or centroids get assigned randomly\neach data point gets assigned to the closest centroid\nthe centroid of the new clusters gets calculated and the process of assignment and recalculation repeats until the cluster do no longer change\nthe optimal number of clusters can be determined with an ‘elbow’ plot\n\n\n\n\n\n\n\n“Mastodon.” n.d. https://genomic.social/@BioInfoCambs.",
+ "crumbs": [
+ "Multivariate analysis",
+ "6K-means clustering"
+ ]
+ }
+]
\ No newline at end of file
diff --git a/setup.html b/setup.html
new file mode 100644
index 0000000..4c6885e
--- /dev/null
+++ b/setup.html
@@ -0,0 +1,868 @@
+
+
+
+
+
+
+
+
+
+Data & Setup
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Go to the R installation folder and look at the instructions for your distribution.
+
Download the RStudio installer for your distribution and install it using your package manager.
+
+
+
+
+
+
Python
+
For this course we’ll be using Visual Studio Code. This provides support for various programming languages (including Python and R). It works on Windows, MacOS and Linux. It’s also open-source and free.
+
Please refer to the installation instructions and make sure that you verify that Python code will run.
+
A brief sequence of events:
+
+
Install Visual Studio Code
+
Install the VS Code Python extension
+
Install a Python interpreter
+
+
Windows: install from Python.org or use the Microsoft Store
+
MacOS: install the Homebrew package manager, then use this to install Python
+
Linux: comes with Python 3, but needs pip to install additional packages