Recently I was doing RNA-seq for a dataset of 30 samples (15 Fwd, 15 Rev). The pipeline was STAR alignment, then featureCounts for the gene level quantification.
Case 1: So, for that, I initially used an annotation file from the UCSC Genome browser website. The file name is hg38.refGene.gtf.gz
(This one I usually use and I am sure a lot of people do the same).
Case 2: Now, I just wanted to try a different annotation. So, I ran the STAR alignment and featureCounts again for the same set of samples. This time, the gtf file was from NCBI RefSeq. The file name is GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
. It says "The annotation is NCBI Homo sapiens Updated Annotation Release 110 from 25 February 2022."
While I got nearly identical alignment values from STAR after both runs using different annotation files, the featureCounts values were not the same. Assignment rates are 2-3% higher in case 2 for almost all samples, also in the gene counts the fragment numbers are sometimes different. Surprisingly, in case 2, the count for some genes has values (sometimes quite high), while in case 1, it has 0 counts.
It is a bit of a confusion for me. So, can any of you experts explain why is this anomaly? Do you think this Case 2 annotation file is more recent and has more annotations, that is the sole reason? If yes, should I use this GTF file for the analysis of my RNA-seq? Or I am totally wrong and using the wrong GTF for a normal RNA-seq pipeline with the goal to study DGE?
Some details:
- Samples are from Humans and are paired-end reverse-stranded reads
- In featureCounts, fragments were read instead of reads and exon-level countings were chosen
Thanks a lot in advance.
You need to make sure that your genomes and annotations match each other. Don't get a genome from one place and an annotation for somewhere else. But you used different inputs and got different outputs. Why is that a surprise?