Question

Bulk RNA-Seq pipeline suggestion incorporating UMIs

3

Entering edit mode

4.9 years ago

steveh ▴ 70

Hi,

I have 149 bulk RNA-Seq samples (100 bp, paired-end, Illumina) which have come from sequencing in the form of fastq triplets, i.e. pairs of reads plus a third fastq which contains only UMIs.

My first question is - do I need to use the UMIs at all or just ignore them?

So far I've ignored them, and used this workflow (on just 10 samples to begin with):

FastQC on raw reads
Align to full ref human genome using STAR (in: fastq, out: BAMs, sortedByCoord)
Produce counts using featureCounts
MultiQC on results produced so far
Analyse using DESeq2

This works, but ignores the UMIs completely. Results from multiQC show, from STAR:

and from featureCounts:

(note the fairly large percentages of unassigned multimapping reads there).

Alternatively, I've tried to incorporate the UMIs with this changed workflow:

FastQC on raw reads
Align to full ref human genome using STAR (in: fastq, out: BAMs, sortedByCoord)
Add the UMIs from the fastq files to the BAMs produced by STAR, using fgbio’s AnnotateBamWithUmis

but I'm getting lost down a rabit-hole now, adding more and more steps to this pipeline just in order to satisfy various errors I'm getting from downstream tools, e.g.

fgbio SortBam
fgbio SetMateInformation
fgbio GroupReadsByUmi
fgbio CallMolecularConsensusReads
samtools rehead, to add SM tag to BAMs
fgbio FilterConsensusReads (results in vastly reduced BAM file sizes)

for the moment I've stopped here - maybe I can use these BAM files, but this workflow is starting to feel over-complicated and I don't have confidence it's the correct way to go.

So to summarise:

Do I need to incorporate the UMIs at all?
If so, could anybody suggest a workflow?

Many thanks, Steve

RNA-Seq bulk UMI • 3.7k views

ADD COMMENT • link updated 4.9 years ago by swbarnes2 14k • written 4.9 years ago by steveh ▴ 70

0

Entering edit mode

Can not see images,.

ADD REPLY • link 4.9 years ago by MatthewP ★ 1.4k

0

Entering edit mode

apologies, corrected now

ADD REPLY • link 4.9 years ago by steveh ▴ 70

0

Entering edit mode

Have you tried to de-duplicate reads using UMI's alone or in combination with read alignment starts using umi_tools?

ADD REPLY • link 4.9 years ago by GenoMax 147k

score 5 · Answer 1 · 2019-12-24

5

Entering edit mode

4.9 years ago

i.sudbery 20k

Here is what I would recommend with umi-tools.

Extract the UMIs from the fastqs before mapping. You'll need to do this once for each of the non-UMI reads.

umi_tools extract --bc-pattern=NNNNNNNNNN -I umi_reads.fastq.gz --read2s-in=reads_R1.fastq.gz --read2-stdout | gzip > reads_R1.extracted.fastq.gz
umi_tools extract --bc-pattern=NNNNNNNNNN -I umi_reads.fastq.gz --read2s-in=reads_R2.fastq.gz --read2-stdout | gzip > reads_R2.extracted.fastq.gz

where the number of Ns in the bc-pattern matches the number of bases in the UMI.

You can then proceed to map these reads using STAR as before.

Once the reads are mapped, sorted and indexed, deduplicate the BAMs with umi_tools dedup:

umi_tools dedup -I mapped_reads.bam -S deduplicated_reads.bam --paired

Now you can proceed to quantify with featureCounts and analyse with Deseq2 as before.

ADD COMMENT • link 4.9 years ago by i.sudbery 20k

0

Entering edit mode

Thanks Ian - would that be sorted by coordinate? (asking because the fgbio workflow seems to require re-sorting by Queryname)

ADD REPLY • link 4.9 years ago by steveh ▴ 70

0

Entering edit mode

Yes, sorted by coordinate.

ADD REPLY • link 4.9 years ago by i.sudbery 20k

0

Entering edit mode

thanks - and for the dedup step, do I need the --paired option or is that assumed?

ADD REPLY • link 4.9 years ago by steveh ▴ 70

0

Entering edit mode

Ooops. Yes, you will need the paired option, I'll edit the post.

ADD REPLY • link 4.9 years ago by i.sudbery 20k

0

Entering edit mode

great, thanks so much for taking the time to answer at this time of year!

ADD REPLY • link 4.9 years ago by steveh ▴ 70

0

Entering edit mode

Just to update after lots of testing - this is the method I settled on, although adding the UMIs to the already-aligned BAMs and then using umi_tools dedup also works fine.

I don't recommend the method mentioned in my original post, using fgbio.

ADD REPLY • link 4.9 years ago by steveh ▴ 70

0

Entering edit mode

Not to be pedantic but "this" meaning the method/answer suggested by @i.sudbery above? If so I can move that comment to an answer, which you can then accept to provide closure to this thread.

ADD REPLY • link 4.9 years ago by GenoMax 147k

0

Entering edit mode

yes that's correct, the @i.sudbery answer. The general pointer to umi_tools is also useful, but Ian's answer is very specific.

ADD REPLY • link 4.9 years ago by steveh ▴ 70

0

Entering edit mode

You are able to accept more than one answer. Ian's comment has been moved to an answer now.

ADD REPLY • link 4.9 years ago by GenoMax 147k

score 1 · Answer 2 · 2019-12-23

1

Entering edit mode

4.9 years ago

swbarnes2 14k

Have you looked at umi-tools?

https://github.com/CGATOxford/UMI-tools

ADD COMMENT • link 4.9 years ago by swbarnes2 14k

score 0 · Answer 3 · 2019-12-23

0

Entering edit mode

4.9 years ago

padwalmk ▴ 140

Hi, Check out the number of the read in UMI, If it's less than 1 or 2 % of total reads then you do not have to worry et al. But if it's more than 10 % than you have to do something about it.

ADD COMMENT • link 4.9 years ago by padwalmk ▴ 140