Question

Which GTF file to use for FeatureCounts and STAR: low percent of assigned reads with primary assembly but higher with main annotation file

0

Entering edit mode

4.9 years ago

science_lizard ▴ 20

I've been having some issues with the percentage of successfully aligned reads from a mouse RNA-seq experiment. My main question is what the difference between the primary GTF file vs. the main annotation GTF, and why they result in wildly different percent of assigned features in FeatureCounts?

After playing around with the data a little, I found that using Trimmomatic to trim the FASTQs resulted in less adapter contamination and ultimately better alignment. After trimming and sorting the paired reads, I aligned to the most recent Gencode mouse release, using the "primary" assembly GTF as per the recommendations of the STAR manual (gencode.vM25.primary_assembly.annotation.gtf). I generated the genome directory also with the most recent release (GRCm38.primary_assembly.genome.fa). I get about 72% uniquely mapped reads and ~23% counted as too short (not amazing, but this was 150bp paired end and I think I got a fair amount of adapter contamination that even Trimmomatic isn't perfectly handling). I then used bedtools intersect to remove mitochondrial and rRNA from the files.

This is the weird part: I then use FeatureCounts to assign reads to genes using the same GTF file I used for alignment, and I only get about 20% successfully assigned. After trying about a million different things, I realized that if I use the "main annotation" GTF from the same Gencode release, I then get almost 80% successfully assigned reads. I clearly don't understand the difference between these two files, but it was my assumption that the primary GTF has MORE information, so I would think that MORE reads should be counted using this file instead of the subsetted "main annotation" file. Though, as I write this post, I'm realizing that the primary file is smaller than the "main..."

Can someone please explain the difference, and if it's acceptable to map to the "primary" but count against the "main," or if everything should be done with either the primary or the main.

For reference, here is my code:

STAR:

genomeDir=STAR
REF=GRCm38.primary_assembly.genome.fa
GTF=gencode.vM25.primary_assembly.annotation.gtf

FASTQ1='fastq/*SORTED.PAIREDOUT.R1.fastq.gz'
FASTQ2='fastq/*SORTED.PAIREDOUT.R2.fastq.gz'

STAR --runThreadN 8 --genomeDir $genomeDir --sjdbGTFfile $GTF --sjdbOverhang 149 --bamRemoveDuplicatesType UniqueIdentical --readFilesIn $FASTQ1 $FASTQ2 --twopassMode Basic --outSAMtype BAM SortedByCoordinate Unsorted --quantMode TranscriptomeSAM GeneCounts --readFilesCommand zcat

FeatureCounts:

IN=Sample_L1 
OUTF=Sample_L1
BAM1=$IN/L1_aligned.out.MTrm.rRNArm.bam

#GTF=gencode.vM25.primary_assembly.annotation.gtf
GTF=gencode.vM25.annotation.gtf

featureCounts -T 8 -t gene -s 2 -g gene_name -F GTF -a $GTF -o $OUTF/total_counts.txt $BAM1 -p

RNA-Seq alignment assembly • 2.2k views

ADD COMMENT • link 4.9 years ago by science_lizard ▴ 20