I am analyzing some RNA-seq data generated from a treatment/control experiment on the NG108-15 neuroblastoma-glioma hybrid cell line. The sequencing was done using Illumina SE 100 bp reads.
When aligning the reads back against the reference -- I used both HISAT2 and Tophat against mm10 and grcm38 -- approximately 20-30% of the reads align to multiple loci.
Upon further inspection, it looks like a large fraction of the multiply aligning reads hit against repeats of a "targeted KO-first, conditional ready, lacZ-tagged mutant allele." I've also subsampled 1M reads from my samples and aligned them using Blast against accession JN958699.1 . Counting only perfect Blast matches shows that about 12% of the 1M reads samples align perfectly on JN958699.1
Both the mm10 and grcm38 references seems to contain hundreds of paralogs of that LacZ-tagged mutant allele.
Any one knows what the repeat "targeted KO-first, conditional ready, lacZ-tagged mutant allele." is involved in and what would lead to its enrichment in an RNA-seq experiment? Note that the multiply aligning reads are as abundant in the control as in the treatment.
Thank you so much for any suggestions of hints you might be able to offer.
The question becomes where in that construct they align. My guess would be the poly-adenylation site, which would be understandable for RNAseq data.
Always actually look at your data in something like IGV.
Thanks for your Answer, Devon. The 1M test dataset I ran is distributed across the complete 35 Kbp construct sequence. Why would it be understandable in RNA-seq for such a large portion of the data to hit poly-adenylation site of this particular repeat family? Could you please elaborate?
Typical RNAseq library preparation includes polyA enrichment, so one generally expects to see a bunch of random polyA signal due to that.
Yes, of course... in which case, it shouldn't be specific to this particular repeat family. Thanks again, Devon.