forked from hemberg-lab/scRNA.seq.course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-L1-process-raw.Rmd
410 lines (322 loc) · 17.7 KB
/
05-L1-process-raw.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
---
output: html_document
---
## File formats
### FastQ
FastQ is the most raw form of scRNASeq data you will encounter. All scRNASeq
protocols are sequenced with paired-end sequencing. Barcode sequences may occur
in one or both reads depending on the protocol employed. However, protocols
using unique molecular identifiers (UMIs) will generally contain one read with
the cell and UMI barcodes plus adapters but without any transcript sequence.
Thus reads will be mapped as if they are single-end sequenced despite actually
being paired end.
FastQ files have the format:
```{eval=FALSE}
>ReadID
READ SEQUENCE
+
SEQUENCING QUALITY SCORES
```
### BAM
BAM file format stores mapped reads in a standard and efficient manner. The
human-readable version is called a SAM file, while the BAM file is the highly
compressed version. BAM/SAM files contain a header which typically includes
information on the sample preparation, sequencing and mapping; and a tab-separated row for each individual alignment of each read.
Alignment rows employ a standard format with the following columns:
(1) QNAME : read name (generally will include UMI barcode if applicable)
(2) FLAG : number tag indicating the "type" of alignment, [link](https://broadinstitute.github.io/picard/explain-flags.html) to explanation of all possible "types"
(3) RNAME : reference sequence name (i.e. chromosome read is mapped to).
(4) POS : leftmost mapping position
(5) MAPQ : Mapping quality
(6) CIGAR : string indicating the matching/mismatching parts of the read (may include soft-clipping).
(7) RNEXT : reference name of the mate/next read
(8) PNEXT : POS for mate/next read
(9) TLEN : Template length (length of reference region the read is mapped to)
(10) SEQ : read sequence
(11) QUAL : read quality
BAM/SAM files can be converted to the other format using 'samtools':
```{bash, eval=FALSE}
samtools view -S -b file.sam > file.bam
samtools view -h file.bam > file.sam
```
Some sequencing facilities will automatically map your reads to the a standard
genome and deliver either BAM or CRAM formatted files. Generally they will not
have included ERCC sequences in the genome thus no ERCC reads will be mapped in
the BAM/CRAM file. To quantify ERCCs (or any other genetic alterations) or if
you just want to use a different alignment algorithm than whatever is in the
generic pipeline (often outdated), then you will need to convert the BAM/CRAM
files back to FastQs:
BAM files can be converted to FastQ using bedtools. To ensure a single copy for
multi-mapping reads first sort by read name and remove secondary alignments
using samtools. [Picard](https://broadinstitute.github.io/picard/index.html)
also contains a method for converting BAM to FastQ files.
```{bash, eval=FALSE}
# sort reads by name
samtools sort -n original.bam -o sorted_by_name.bam
# remove secondary alignments
samtools view -b -F 256 sorted_by_name.bam -o primary_alignment_only.bam
# convert to fastq
bedtools bamtofastq -i primary_alignment_only.bam -fq read1.fq -fq2 read2.fq
```
### CRAM
[CRAM](https://www.ebi.ac.uk/ena/software/cram-usage) files are similar to BAM files only they contain information in the header
to the reference genome used in the mapping in the header. This allow the bases
in each read that are identical to the reference to be further compressed. CRAM
also supports some lossy data compression approaches to further optimize storage
compared to BAMs. CRAMs are mainly used by the Sanger/EBI sequencing facility.
CRAM and BAM files can be interchanged using the lastest version of samtools (>=v1.0).
However, this conversion may require downloading the reference genome into cache.
Alternatively, you may pre-download the correct reference either from metadata in the header
of the CRAM file, or from talking to whomever generated the CRAM and specify that file using '-T'
Thus we recommend setting a specific cache location prior to doing this:
```{bash, eval=FALSE}
export REF_CACHE=/path_to/cache_directory_for_reference_genome
samtools view -b -h -T reference_genome.fasta file.cram -o file.bam
samtools view -C -h -T reference_genome.fasta file.bam -o file.cram
```
### Mannually Inspecting files
At times it may be useful to mannual inspect files for example to check the metadata in headers that the files
are from the correct sample. 'less' and 'more' can be used to inspect any text files from the command line.
By "pipe-ing" the output of samtools view into these commands using '|' we check each of these file types without having to save
multiple copies of each file.
```{bash, eval=FALSE}
less file.txt
more file.txt
# counts the number of lines in file.txt
wc -l file.txt
samtools view -h file.[cram/bam] | more
# counts the number of lines in the samtools output
samtools view -h file.[cram/bam] | wc -l
```
__Exercises__
You have been provided with a small cram file: EXAMPLE.cram
Task 1: How was this file aligned? What software was used? What was used as the genome? (Hint: check the header)
Task 2: How many reads are unmapped/mapped? How total reads are there? How many secondary alignments are present? (Hint: use the FLAG)
Task 3: Convert the CRAM into two Fastq files. Did you get exactly one copy of each read? (name these files "10cells_read1.fastq" "10cells_read2.fastq")
If you get stuck help information for each piece of software can be displayed
by entering running the command "naked" - e.g. 'samtools view', 'bedtools'
__Answer__
```{bash, eval=TRUE, include=FALSE}
samtools view -T 2000_reference.transcripts.fa -H EXAMPLE.cram | more
samtools view -T 2000_reference.transcripts.fa -f 4 EXAMPLE.cram | wc -l # unmapped
samtools view -T 2000_reference.transcripts.fa -F 260 EXAMPLE.cram | wc -l # mapped
samtools view -T 2000_reference.transcripts.fa -F 256 EXAMPLE.cram | wc -l # total
samtools view -T 2000_reference.transcripts.fa -f 256 EXAMPLE.cram | wc -l # secondary alignments
samtools view -b -h -T 2000_reference.transcripts.fa EXAMPLE.cram -o EXAMPLE.bam
samtools sort -n EXAMPLE.bam -o sorted_EXAMPLE.bam
samtools view -b -F 256 sorted_EXAMPLE.bam -o primary_EXAMPLE.bam
# convert to fastq
bedtools bamtofastq -i primary_EXAMPLE.bam -fq 10cells_read1.fq -fq2 10cells_read2.fq
```
### Genome (FASTA, GTF)
To map your reads you will also need the reference genome and in many cases
the genome annotation file (in either GTF or GFF format). These can be
downloaded for model organisms from any of the main genomics databases:
[Ensembl](http://www.ensembl.org/info/data/ftp/index.html),
[NCBI](ftp://ftp.ncbi.nih.gov/genomes/), or [UCSC Genome Browser](http://hgdownload.soe.ucsc.edu/downloads.html).
GTF files contain annotations of genes, transcripts, and exons. They must contain:
(1) seqname : chromosome/scaffold
(2) source : where this annotation came from
(3) feature : what kind of feature is this? (e.g. gene, transcript, exon)
(4) start : start position (bp)
(5) end : end position (bp)
(6) score : a number
(7) strand : + (forward) or - (reverse)
(8) frame : if CDS indicates which base is the first base of the first codon (0 = first base, 1 = second base, etc..)
(9) attribute : semicolon-separated list of tag-value pairs of extra information (e.g. names/IDs, biotype)
Empty fields are marked with "."
In our experience Ensembl is the easiest of these to use, and has the largest
set of annotations. NCBI tends to be more strict in including only high
confidence gene annotations. Whereas UCSC contains multiple geneset annotations
that use different criteria.
If you experimental system includes non-standard sequences these must be added
to both the genome fasta and gtf to quantify their expression. Most commonly
this is done for the ERCC spike-ins, although the same must be done for CRISPR-
related sequences or other overexpression/reporter constructs.
For maximum utility/flexibility we recommend creating complete and detailed entries
for any non-standard sequences added.
There is no standardized way to do this. So below is our custom perl script for creating a gtf and fasta file for ERCCs which can be appended to the genome. You may also need to alter a gtf file to deal with repetitive elements in introns when/if you want to quantify intronic reads. Any scripting language or even 'awk' and/or some text editors can be used to do this relatively efficiently, but they are beyond the scope of this course.
```{bash, eval=FALSE}
# Converts the Annotation file from
# https://www.thermofisher.com/order/catalog/product/4456740 to
# gtf and fasta files that can be added to existing genome fasta & gtf files.
my @FASTAlines = ();
my @GTFlines = ();
open (my $ifh, "ERCC_Controls_Annotation.txt") or die $!;
<$ifh>; #header
while (<$ifh>) {
# Do all the important stuff
chomp;
my @record = split(/\t/);
my $sequence = $record[4];
$sequence =~ s/\s+//g; # get rid of any preceeding/tailing white space
$sequence = $sequence."NNNN";
my $name = $record[0];
my $genbank = $record[1];
push(@FASTAlines, ">$name\n$sequence\n");
# is GTF 1 indexed or 0 indexed? -> it is 1 indexed
# + or - strand?
push(@GTFlines, "$name\tERCC\tgene\t1\t".(length($sequence)-2)."\t.\t+\t.\tgene_id \"$name-$genbank\"; transcript_id \"$name-$genbank\"; exon_number \"1\"; gene_name \"ERCC $name-$genbank\"\n");
push(@GTFlines, "$name\tERCC\ttranscript\t1\t".(length($sequence)-2)."\t.\t+\t.\tgene_id \"$name-$genbank\"; transcript_id \"$name-$genbank\"; exon_number \"1\"; gene_name \"ERCC $name-$genbank\"\n");
push(@GTFlines, "$name\tERCC\texon\t1\t".(length($sequence)-2)."\t.\t+\t.\tgene_id \"$name-$genbank\"; transcript_id \"$name-$genbank\"; exon_number \"1\"; gene_name \"ERCC $name-$genbank\"\n");
} close($ifh);
# Write output
open(my $ofh, ">", "ERCC_Controls.fa") or die $!;
foreach my $line (@FASTAlines) {
print $ofh $line;
} close ($ofh);
open($ofh, ">", "ERCC_Controls.gtf") or die $!;
foreach my $line (@GTFlines) {
print $ofh $line;
} close ($ofh);
```
## Demultiplexing
Demultiplexing is done differently depending on the protocol used and the particular pipeline you are using a full pipeline. The most
flexible demultiplexing pipeline we are aware of is [zUMIs](https://github.com/sdparekh/zUMIs/wiki/Usage) which can be used to demultiplex and
map most UMI-based protocols. For Smartseq2 or other paired-end full transcript protocols the data will usually already be demultiplexed.
Public repositories such as GEO or ArrayExpress require data small-scale/plate-based scRNASeq data to be demultiplexed prior to upload, and many
sequencing facilities will automatically demultiplex data before returning it to you. If you aren't using a published pipeline and the data was
not demultiplexed by the sequencing facility you will have to demultiplex it yourself. This usually requires writing a custom script since barcodes
may be of different lengths and different locations in the reads depending on the protocols used.
For all data-type "demultiplexing" involves identifying and removing the cell-barcode sequence from one or both reads. If the expected
cell-barcodes are known ahead of time, i.e. the data is from a PCR-plate-based protocol, all that is necessarily is to compare each cell-barcode to
the expected barcodes and assign the associated reads to the closest cell-barcode (with maximum mismatches of 1 or 2 depending on the design of the cell-barcodes). These data are generally demultiplexed prior to mapping as an easy way of parallelizing the mapping step.
We have [publicly available](https://github.com/tallulandrews/scRNASeqPipeline) perl scripts capable of demultiplexing any scRNASeq data with a single cell-barcode with or without UMIs for plate-based protocols. These can be used as so:
```{bash, eval=TRUE}
perl 1_Flexible_UMI_Demultiplexing.pl 10cells_read1.fq 10cells_read2.fq "C12U8" 10cells_barcodes.txt 2 Ex
```
```{bash, eval=TRUE}
perl 1_Flexible_FullTranscript_Demultiplexing.pl 10cells_read1.fq 10cells_read2.fq "start" 12 10cells_barcodes.txt 2 Ex
```
For UMI containing data, demultiplexing includes attaching the UMI code to the read name of the gene-body containing read. If the data are from a
droplet-based protocol or SeqWell where the number of expected barcodes is much higher than the expected number of cell, then usually the cell-barcode will also be attached to the read name to avoid generating a very large number of files. In these cases, demultiplexing will happen during the
quantification step to facilitate the identification of cell-barcodes which correspond to intact cells rather than background noise.
### Identifying cell-containing droplets/microwells
For droplet based methods only a fraction of droplets contain both beads
and an intact cell. However, biology experiments are messy and some RNA will
leak out of dead/damaged cells. So droplets without an intact cell are likely to capture a small amount of the ambient RNA which will end up in the sequencing
library and contribute a reads to the final sequencing output. The variation in
droplet size, amplification efficiency, and sequencing will lead both
"background" and real cells to have a wide range of library sizes. Various
approaches have been used to try to distinguish those cell barcodes which
correspond to real cells.
Most methods use the total molecules (could be applied to total reads) per
barcode and try to find a "break point" between bigger libraries which are
cells + some background and smaller libraries assumed to be purely background.
Let's load some example simulated data which contain both large and small cells:
```{r}
umi_per_barcode <- read.table("droplet_id_example_per_barcode.txt.gz")
truth <- read.delim("droplet_id_example_truth.gz", sep=",")
```
__Exercise__
How many unique barcodes were detected?
How many true cells are present in the data?
To simplify calculations for this section exclude all barcodes
with fewer than 10 total molecules.
__Answer__
```{r, include=FALSE}
dim(umi_per_barcode)[1]
dim(truth)[1]
umi_per_barcode <- umi_per_barcode[umi_per_barcode[,2] > 10,]
```
One approach is to look for the inflection point where the total molecules per
barcode suddenly drops:
```{r}
barcode_rank <- rank(-umi_per_barcode[,2])
plot(barcode_rank, umi_per_barcode[,2], xlim=c(1,8000))
```
Here we can see an roughly exponential curve of library sizes, so to make
things simpler lets log-transform them.
```{r}
log_lib_size <- log10(umi_per_barcode[,2])
plot(barcode_rank, log_lib_size, xlim=c(1,8000))
```
That's better, the "knee" in the distribution is much more pronounced. We
could manually estimate where the "knee" is but it much more reproducible to
algorithmically identify this point.
```{r}
# inflection point
o <- order(barcode_rank)
log_lib_size <- log_lib_size[o]
barcode_rank <- barcode_rank[o]
rawdiff <- diff(log_lib_size)/diff(barcode_rank)
inflection <- which(rawdiff == min(rawdiff[100:length(rawdiff)], na.rm=TRUE))
plot(barcode_rank, log_lib_size, xlim=c(1,8000))
abline(v=inflection, col="red", lwd=2)
threshold <- 10^log_lib_size[inflection]
cells <- umi_per_barcode[umi_per_barcode[,2] > threshold,1]
TPR <- sum(cells %in% truth[,1])/length(cells)
Recall <- sum(cells %in% truth[,1])/length(truth[,1])
c(TPR, Recall)
```
Another is to fix a mixture model and find where the higher and lower distributions intersect. However, data may not fit the assumed distributions very well:
```{r}
set.seed(-92497)
# mixture model
require("mixtools")
mix <- normalmixEM(log_lib_size)
plot(mix, which=2, xlab2="log(mol per cell)")
p1 <- dnorm(log_lib_size, mean=mix$mu[1], sd=mix$sigma[1])
p2 <- dnorm(log_lib_size, mean=mix$mu[2], sd=mix$sigma[2])
if (mix$mu[1] < mix$mu[2]) {
split <- min(log_lib_size[p2 > p1])
} else {
split <- min(log_lib_size[p1 > p2])
}
```
__Exercise__
Identify cells using this split point and calculate the TPR and Recall.
__Answer__
```{r, include=FALSE}
cells <- umi_per_barcode[umi_per_barcode[,2] > 10^split,1]
TPR <- sum(cells %in% truth[,1])/length(cells)
Recall <- sum(cells %in% truth[,1])/length(truth[,1])
c(TPR, Recall)
```
A third, used by CellRanger, assumes a ~10-fold range of library sizes for real
cells and estimates this range using the expected number of cells.
```{r}
n_cells <- length(truth[,1])
# CellRanger
totals <- umi_per_barcode[,2]
totals <- sort(totals, decreasing = TRUE)
# 99th percentile of top n_cells divided by 10
thresh = totals[round(0.01*n_cells)]/10
plot(totals, xlim=c(1,8000))
abline(h=thresh, col="red", lwd=2)
```
__Exercise__
Identify cells using this threshodl and calculate the TPR and Recall.
__Answer__
```{r, include=FALSE}
cells <- umi_per_barcode[umi_per_barcode[,2] > thresh,1]
TPR <- sum(cells %in% truth[,1])/length(cells)
Recall <- sum(cells %in% truth[,1])/length(truth[,1])
c(TPR, Recall)
```
Finally (EmptyDrops)[https://github.com/MarioniLab/DropletUtils], which is currently in beta testing, uses the full genes x
cells molecule count matrix for all droplets and estimates the profile of
"background" RNA from those droplets with extremely low counts, then looks for
cells with gene-expression profiles which differ from the background. This is
combined with an inflection point method since background RNA often looks very
similar to the expression profile of the largests cells in a population. As
such EmptyDrops is the only method able to identify barcodes for very small
cells in highly diverse samples.
Below we have provided code for how this method is currently run:
(We will update this page when the method is officially released)
```{r, eval=FALSE}
require("Matrix")
raw.counts <- readRDS("droplet_id_example.rds")
require("DropletUtils")
# emptyDrops
set.seed(100)
e.out <- emptyDrops(my.counts)
is.cell <- e.out$FDR <= 0.01
sum(is.cell, na.rm=TRUE)
plot(e.out$Total, -e.out$LogProb, col=ifelse(is.cell, "red", "black"),
xlab="Total UMI count", ylab="-Log Probability")
cells <- colnames(raw.counts)[is.cell]
TPR <- sum(cells %in% truth[,1])/length(cells)
Recall <- sum(cells %in% truth[,1])/length(truth[,1])
c(TPR, Recall)
```