Hi all,
I am working with RNA-seq data of different drug treated human cell lines sequenced on Illumina (2X150 bp chemistry). Data generated is 11-14 Gb.
I am using HISAT2
for alignment on human genome hg38 build downloaded from Ensembl database using default parameters. To my surprise, the alignment percentage is very low (~2 %
) for all the samples.
I have the following observations regarding the data quality :
- Data quality (phred score) is good i.e. in the range of
32-40
! - Illumina adapters were already removed using trimmomatic. No other contaminants.
- The fastqc metric 'Per base sequence content' fails for all samples. More specifically, A,T,G,C % is deviating considerably at the first 10 base positions.
Questions
- What could possibly be wrong? Any other QC stuff I need to re-consider?
- More importantly, this is cell-line data, so shall I consider mapping on specific cell line reference OR is it okay if I have just the hg38 standard human reference?
Hi Vijay,
Can you
BLAST
few reads manually and confirm that there was no issue with de-multiplexing or sample labelling? Specific reference should not be an issue. You can also try soft trimming first 10 bases during alignment. Its not unusual.Best
Thanks Satya! I am shuffling the reads and will blast a few thousands to check for the same.
With 2% alignment you don't need a few thousand. Take 10 and that should be plenty to diagnose the issue with BLAST.
It appears to be a clear case of contamination rather confusion.
Evidence1: Blasting the reads againt NCBI nt database suggest it's hamster. May be chinese or golden hamster.
Evidence2: Mapping against hamster genome gave ~80 % mapping for all samples with HISAT2.
Last thing, under investigation is blasting the scaffolds of the assembly from one of the samples. If that also hits hamster, which I am 90% sure would be, I have to traceback the history! :)
Ouch! Maybe time to start running FastQ Screen routinely.. :)
Or BBSketch, which only takes a couple seconds :)
Yikes! So the cell line is actually hamster then?
Looks like CHO cell line contamination.
Another thing I wanted to add is the following information which I found in the STAR aligner documentation:
I have infact kept the sequences of only chr1-22, X, Y and mito DNA in my reference genome file. Could that possibly be the reason. Meantime, I am trying to align the reads of the original reference file and see how it goes!
I would be very surprised if this has much effect on your alignments. These extra scaffolds are recommended to "soak up" reads which would otherwise misalign, but they shouldn't account for more than a few of % of your library. That can be a problem for downstream analysis, but won't result in a global 2% alignment rate. Even aligning Human data against the mouse reference genome can give more alignments than that ;)
HAHAHAHA :D
Adding another twist here. What I have used here is the toplevel file and NOT the primary assembly file. I am looking at this blog and what I found is :
So, I am trying the mapping with the primary assembly file as well. Let's see!
I don't think toplevel file is an issue either. I have been using toplevel file for one model organism and never suffered from low alignment.
It's probably worth trying the 5' trimming approach from my answer below first.. ;) I'm pretty sure that'll fix your problems and it's not to do with your reference.
+1 Phil. Indeed, I just confirmed that it does not make much difference. Prima facie, it looks like that there is indeed some issue with the data itself; probable contamination which I am going to confirm further.
A short answer to my 2nd question is that, Yes, I can use the human reference itself. I found a good paper based on a similar study here.
While I know that BowTie2 is not a suitable aligner for RNA-seq data, I am getting ~25% mapping with human reference in this case. How can we explain this?
RNA-seq data has reads that span exon-exon boundaries. BowTie will not align these reads, so you'll only get back alignments that lie entirely within an exon. Tophat / STAR etc are "splice aware" aligners, so work with RNA-seq data.
bowtie v.1 does ungapped alignments as opposed to bowtie v.2 which does gapped alignments.
With that 25%, it is possible that you are only seeing alignments to regions that are homologous between human and hamster. You could use
bbsplit.sh
from BBMap with human/hamster genomes to bin your reads (discarding those that map to both genomes). This may be more of a diagnostic exercise.If this data was supposed to be for a human cell line then this experiment is a lost cause. You (or someone else) is going to have to start backtracking (sequencing --> library prep --> sample submission --> RNA prep --> cell lines) to identify where the problem may lie. It could be at any one of these steps unless the original cell line can be quickly identified to be the root cause.