Question

Post alignment QC of RNA-seq data

4

Entering edit mode

7.2 years ago

KVC_bioinfo ▴ 600

Hello,

I have aligned the RNA-seq data to the human genome and I used FASTQC for the pre-processing data. Can I use FASTQC for post alignment QC also? Or is there a better way of doing it?

Thank you in advance.

qc fastqc • 9.8k views

ADD COMMENT • link updated 7.2 years ago by Brian Bushnell 20k • written 7.2 years ago by KVC_bioinfo ▴ 600

1

Entering edit mode

7.2 years ago

Kevin Blighe 88k

FASTQC is primarily for pre-alignment and it takes as input FASTQ or FASTA files. After you perform your alignment, you should have produced a SAM or BAM file, which are not used as input for FASTQC.

The most common quality control metric that is used post-alignment is to check how many of your reads have aligned to the reference genome. The command samtools mpileup produces this.

ADD COMMENT • link 6.7 years ago by Kevin Blighe 88k

1

Entering edit mode

I read in the manual for FASTQC that it takes the SAM and BAM input.

ADD REPLY • link 7.2 years ago by KVC_bioinfo ▴ 600

0

Entering edit mode

Yes, because some platforms produce these as unaligned files.

ADD REPLY • link 7.2 years ago by Kevin Blighe 88k

0

Entering edit mode

I am using STAR aligner

ADD REPLY • link 7.2 years ago by KVC_bioinfo ▴ 600

0

Entering edit mode

7.2 years ago

Ron ★ 1.2k

Post alignment ,you can use rseqc, http://rseqc.sourceforge.net/

It has very good modules,to work with BAM.

Also ,if you use STAR for alignment,please look out the log.final.out file for QC metrics.

ADD COMMENT • link 7.2 years ago by Ron ★ 1.2k

0

Entering edit mode

Yes. I have used STAR aligner. It gave all the statistics in the log.final file. What is the next QC step I should perform?

ADD REPLY • link 7.2 years ago by KVC_bioinfo ▴ 600

0

Entering edit mode

look at the STAR statistics
run QoRTs or RSeQC

ADD REPLY • link 7.2 years ago by Friederike 9.0k

score 8 · Accepted Answer · 2017-09-19

8

Entering edit mode

7.2 years ago

Friederike 9.0k

RSeQC is used a lot, but I find QoRTs a bit more user-friendly, especially if you have numerous samples.

In addition, feeding the results of a) FastQC, b) STAR (or whatever aligner you've used), and c) featureCounts to MultiQC is already quite useful.

The typical things you want to look out for:

at least 80% alignment rate
not too many intronic/intergenic reads
even gene body coverage

ADD COMMENT • link 7.2 years ago by Friederike 9.0k

0

Entering edit mode

Could you suggest a publication where such parameters are described? Or the QC of RNA-seq (pre and post analysis) is described

ADD REPLY • link 7.2 years ago by KVC_bioinfo ▴ 600

3

Entering edit mode

I would expect that the papers of QoRTs and RSeQC may contain a discussion of that.

The classic resources for basic RNA-seq measures is the ENCODE recommendation although it's a bit dated by now. There may be more updated guidelines on their website, haven't checked in a while.

An interesting read regarding the importance of which annotation you choose is Zhao & Zhang (2015) BMC Genomics. Regarding gene body coverage, I believe Lahens et al. (2014) Genome Biology 15:R86 discussed that nicely.

If you really want to learn about all the details of RNA-seq pre-processing, you may want to have a look at the notes that I compiled for a class I teach. It's more than 80 pages and fairly detailed especially in regard to raw and aligned read handling. The github repo is this: https://github.com/friedue/course_RNA-seq2017

ADD REPLY • link 7.2 years ago by Friederike 9.0k

0

Entering edit mode

Hello,

Thank you very much. The introduction to RNA-seq on your GitHub page is really very helpful.

ADD REPLY • link 7.2 years ago by KVC_bioinfo ▴ 600

1

Entering edit mode

I should add that it's not completely straight-forward to ask for definite thresholds. Whether an experiment failed or has acceptable results will depend on the specific circumstances of your experiment and the biological question that you want to address. For example, if you specifically expect many unannotated transcripts to be present in your sample, then increased numbers of intergenic and/or intronic reads may not be worrisome, but expected.

ADD REPLY • link 7.2 years ago by Friederike 9.0k

0

Entering edit mode

Is it usual for samples enriched by rRNA depletion to have higher intergenic/intronic content for human genome? I was working with publicly available data and observed this. Though it seems logical because these samples will contain ncRNA but I am not totally sure.

ADD REPLY • link 5.3 years ago by Arindam Ghosh ▴ 530

1

Entering edit mode

You will also get more immature/unspliced transcripts, which will still contain introns. The intergenic content shouldn't be dramatically elevated IMO, but usually you'll see considerably more introns than what you'd get with Poly(A) enrichment.

ADD REPLY • link 5.3 years ago by Friederike 9.0k

0

Entering edit mode

Yes. Using Picard I calculated the percentage of bases aligned to intronic and intergenic regions and found <15% for poly-A samples and for rRNA depleted samples it was ~25%. So I believe this should not be worrisome. And another query in addition to this, can the presence of ncRNA in rRNA samples produce multiple peaks in FastQC GC content curve.

ADD REPLY • link 5.3 years ago by Arindam Ghosh ▴ 530