Hey everybody!
I am trying to annotate my RNA-seq files (paired-end) with featureCounts. However, I keep having a quite low annotation rate, ~35-36% for all the files.
My command line is the following:
featureCounts -T 6 -p -s 2 -a annotation_file.gtf -o output_file.txt input_files.bam
I am using the parameter -s 2
since my paired-end files seem to be reversely stranded (I have assessed that by running infer_experiment.py
from the RSeQC package.)
Alignment has been performed with hisat2
and .bam files sorted by coordinates before performing annotation.
I have checked rRNA contamination in .bam files using split_bam.py
from RSeQC package and the .bed file for hg38 rRNAs provided by the package. None of my files show rRNA contamination.
In order to try solving the problem, I first checked that the chromosome names were consistent between my .bam files and the provided .gtf file. They are the same.
Then, I tried using .gtf files from different sources (Gencode, Ensembl, UCSC,..) but the results did not change.
Lastly, I tried running featureCounts
using also the -O
parameter, which includes multi-overlapping reads, just to check if most of the reads were not annotated due to multi-overlapping reasons. The results did not change even in this case.
I would really appreciate any suggestion for trying to figure out what causes poor annotation of my files. I have checked as much as I could, accordingly to my experience with RNA-seq data, but I could not find a solution.
Thanks in advance to everybody who will try helping me! :)
Update
I tried setting up
-t gene
rather than the default-t exon
and it increased the annotation rate above 50% (~52-53%). Still not satisfying, but for sure it made a good improvement.