hi,
Assuming that this is human poly+ve RNA-seq (Illumina), first up see the number of reads you have. 40-50million reads is min. number if you are aiming to do differential expression.
2) Look at the box plot of qual. distrib. in FastQC. A good data would have the majority of plots well above phred score 30 (green region). Depending on read length, boxes for few bases at the end falling below 30 is normal.
3) Then check the plot for "Sequence dupli. level". Don't worry if its shown to be failed as per FastQC. RNA-seq is bound to have duplicates because mRNAs are present in multiple copies. A peak/ hump for x-axis value ">10" or more is good. It means you have sampled the transcriptome well.
4) Check the "Per seq. GC content". Ideally it should be unimodal: a single hump in the centre with the theoretical and observed more or less overlapping. I would be worried if I see two humps.
5) Check "Adapter content" plot
If you have adapter content warning, remove them using any suitable tool (Cutadapt, Trimmomatic etc.). Use a base quality-based filter to clip reads when quality falls below a level. Like "SLIDINGWINDOW" in Trimmomatic.
After quality filtering if you still have good number of reads left, box plots look OK for most of the read length, no mean looking 2nd hump in GC content (a shoulder is OK) and majority of read lengths are towards the higher end (like if you have trimmed 5 bases, then the peak in Seq. len. distrib plot is over 95/96), then your data is good to go (for algn.)
Post algn. a >70% algn. rate is decent. Though I routinely get >85-90% using STAR.
All the above is theory, ultimately it would depend how much you value a particular sample. If you have many and there is one low quality sample you could remove it. If its clinical sample on other hand, you try to salvage what you have.
Check the sample with 5% algn. rate for its FastQC. How many reads it had to start with? Use bedtools to overlap the BAM with exons to find where reads aligned.
Also if you want to dig deep for that sample, check the diagnostic metrics of the sequencing machine.
Thank you very much, Amitm! It's a great tutorial!
I am little confused about the mapped/aligned rate (same thing?). I looked into the .bam file with low assigned rate, using samtools flagstat, the output is shown below:
It seems that 100% reads were mapped, but less than 5% were counted/assigned to genes, how this happened?
BTW, have you ever used fastx_toolkit to trim and clip the fastq before analysis?
hi,
depending on what aligner you used, all the reads provided might not end up in the BAM i.e. if only aligned reads are written into the BAM. Hence samtools flagstat has no way of knowing how many reads you started with, other than just looking at the BAM. More informative would be to look at the stats generated by the aligner where you can get the actual align. rate. Like TopHat & STAR, both give useful stats of how much aligned (unique or multi-mapping).
Also, open the BAM in IGV and check how the alignment looks. If you don't have idea of how a RNA-seq aligned BAM might look then you can import Bodymap/ ENCODE datasets through IGV menu and then you can make out if the alignment looks OK.
I am hoping that you used a splice-aware aligner (TopHat, STAR etc.).
One way to find where the alignments are is to convert your BAM to BED and then overlap the BED with BED of exons (from UCSC or Ensembl).
I have used Fastx-toolkit. Last time I used it, mult-threading wasn't available. If speed is not an issue, you can use that. BTW, any good (well used in community) trimmer would be OK.
Yes, I am using STAR. Because I am doing large scale analysis, I just realised, it turns out that those low algnd rate datasets are miRNA/ncRNA ......
Thank you very much.