Hello all,
I have a 3kb gene sequence for which I am aligning my 150 bp RNASeq reads with bowtie2. There are no known introns in the gene. The first 90% of the sequence has a relatively consistent coverage of 50-200 reads, but the last 300 bp or so has 10x the coverage as any of the rest of the gene.
My first suspicion was that this sequence may be duplicated elsewhere in the genome and thus reads from another genomic region are spuriously aligning to the gene I'm looking at. However, BLASTing this 300 bp sequence to the genome or to NCBI's full database results in only matches to my gene of interest.
The full gene has 41% GC, while the last 300 has 36% GC. This doesn't seem too terribly different to cause such an effect...
What are other likely explanations when you see this kind of heterogeneity in coverage?
Looking forward to learning from you all.
What library kit was used, and do you see this bias for every gene?
It is poly-A selection. And the RIN value was only 4. So is this degradation?
That RIN value is really low, so I suspect that the RNA is fairly degraded. If it is, and you are poly-A selecting, that means you may be losing the original 5' half of many RNA molecules.
A good way to check this would be to align your reads to the entire genome as GenoMax suggested (which is good practice anyway), and then check whether this pattern persists for other genes.
Are you aligning using a reduced reference (i.e. just this gene or some smaller fraction genes)? If the data is from whole genome then you should align to full genome/transcriptome.
GenoMax, yes I am aligning to just a few genes. I will try with the whole genome. So that I can learn from this, would you mind helping me understand the problem with this strategy?
NGS aligners are designed to be greedy. They will try to align reads to their best ability. If your data is from whole genome and you use a reduced reference then there is always a possibility that reads may be aligned to regions where they may not have originated, simply because of sequence homology (e.g. think of a motif that is common).
This makes good sense and thank for taking the time to help me understand the best practices and why they are valuable. This is why I had BLASTed portions of this gene to the whole genome to see if perhaps there was an area of high sequence homology. Since nothing matched well enough by BLASTing with default parameters, is it still likely to think that the NGS aligners would pick up something from elsewhere in the genome?
Could you please describe your problem a bit more detailed:
What organism are you dealing with - is it an eukaryotic one? - Does your gene have shorter isoforms?
What kind of library prep was used to generate the data - FFPE-based, targeted, 3'Seq, whole transcriptome?