Hello,
I'm trying to figure out if my dataset was collected from mRNA, or tRNA. My pipeline is TopHat2 for alignment and htseq-count for read counting/generation of DESeq2 input. Previously, I had aligned and counted the data using the genes.gtf file from the GRCh37 iGenomes build: http://support.illumina.com/sequencing/sequencing_software/igenome.html. As I understood it, this should represent all mRNA-encoding regions of the genome.
However, I'm seeing something strange - although alignment/TopHat2 output is fine (90%+ alignment), 95% of those aligned reads are listed as 'no feature' by HTSeq, and not included in the final read tallies. As a result, the total number of reads in HTSeq's output for a sample is close to 1 million, not 30 million as I would expect.
Right now, I'm trying to figure out if the company that did the sequencing accidentally sequenced total RNA, rather than mRNA - so of course most of the reads would have 'no feature' when I count reads using the genes.gtf file, because maybe most of those RNAs are ribosomal RNA or something.
So, I'd like to re-try counting the aligned reads using a .gtf file that represents all RNA-encoding regions of the genome, not just the mRNA-encoding regions. Is this the WholeGenomeFasta.gff that comes with the GRCh37 iGenome package? Or is there another place to get such a file?
Thank you for your help!
Make sure you are using the same version of genome and annotations. One more thing you could try is to just align the data to transcriptome and see the mapping statistics.
Thank you! All the files I'm using (genome and annotations) are from the same iGenome GRCh37 files, so I think they should be the same. Could you explain what you mean by aligning the data to the transcriptome -i.e., how is that different from what TopHat is doing? Do you mean making the data into a BigWig file and looking at that in a genome browser?
The annotation and fasta reference should not only be from the same genome version, but also from the same place - you should not use UCSC reference with an Ensembl annotation, even if they are from the same genome version.
I mean to say align the reads to transcriptome using bowtie2 or bwa, instead of genome. This helps in finding if the reads arises from mRNA. You could get the transcriptome from utilities like gffread. Tophat also builds a transcriptome.fa while aligning the reads, which can be saved.