Hello,
I have aligned the RNA-seq data to the human genome and I used FASTQC for the pre-processing data. Can I use FASTQC for post alignment QC also? Or is there a better way of doing it?
Thank you in advance.
Hello,
I have aligned the RNA-seq data to the human genome and I used FASTQC for the pre-processing data. Can I use FASTQC for post alignment QC also? Or is there a better way of doing it?
Thank you in advance.
RSeQC is used a lot, but I find QoRTs a bit more user-friendly, especially if you have numerous samples.
In addition, feeding the results of a) FastQC, b) STAR (or whatever aligner you've used), and c) featureCounts to MultiQC is already quite useful.
The typical things you want to look out for:
FASTQC is primarily for pre-alignment and it takes as input FASTQ or FASTA files. After you perform your alignment, you should have produced a SAM or BAM file, which are not used as input for FASTQC.
The most common quality control metric that is used post-alignment is to check how many of your reads have aligned to the reference genome. The command samtools mpileup
produces this.
Post alignment ,you can use rseqc, http://rseqc.sourceforge.net/
It has very good modules,to work with BAM.
Also ,if you use STAR for alignment,please look out the log.final.out file for QC metrics.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Could you suggest a publication where such parameters are described? Or the QC of RNA-seq (pre and post analysis) is described
I would expect that the papers of QoRTs and RSeQC may contain a discussion of that.
The classic resources for basic RNA-seq measures is the ENCODE recommendation although it's a bit dated by now. There may be more updated guidelines on their website, haven't checked in a while.
An interesting read regarding the importance of which annotation you choose is Zhao & Zhang (2015) BMC Genomics. Regarding gene body coverage, I believe Lahens et al. (2014) Genome Biology 15:R86 discussed that nicely.
If you really want to learn about all the details of RNA-seq pre-processing, you may want to have a look at the notes that I compiled for a class I teach. It's more than 80 pages and fairly detailed especially in regard to raw and aligned read handling. The github repo is this: https://github.com/friedue/course_RNA-seq2017
Hello,
Thank you very much. The introduction to RNA-seq on your GitHub page is really very helpful.
I should add that it's not completely straight-forward to ask for definite thresholds. Whether an experiment failed or has acceptable results will depend on the specific circumstances of your experiment and the biological question that you want to address. For example, if you specifically expect many unannotated transcripts to be present in your sample, then increased numbers of intergenic and/or intronic reads may not be worrisome, but expected.
Is it usual for samples enriched by rRNA depletion to have higher intergenic/intronic content for human genome? I was working with publicly available data and observed this. Though it seems logical because these samples will contain ncRNA but I am not totally sure.
You will also get more immature/unspliced transcripts, which will still contain introns. The intergenic content shouldn't be dramatically elevated IMO, but usually you'll see considerably more introns than what you'd get with Poly(A) enrichment.
Yes. Using Picard I calculated the percentage of bases aligned to intronic and intergenic regions and found <15% for poly-A samples and for rRNA depleted samples it was ~25%. So I believe this should not be worrisome. And another query in addition to this, can the presence of ncRNA in rRNA samples produce multiple peaks in FastQC GC content curve.