I am seeking a way to correct the formats for inputs to FeatureCounts.
I have sourced data from a resource for sweet potato genome annotations, and aligned my RNA-seq data to the available reference genome. However, it seems FeatureCounts cannot report on the SAM files I input.
The reference genome FASTA file does not have chromosomes clearly in the headers. I believe this to be the reason for FeatureCounts not working, however, I want to be sure before I go scripting in different directions to correct the files I have.
To be clear, I am providing the genome annotation file along with each of my aligned samples.
Example reference genome sequence header:
>itf11g12620.t1
The above gene identifier shows that it is on chromosome 11.
Example GFF:
##gff-version 3
Chr11 gt4sp_v3 mRNA 1467 2621 . + . ID=itf01g00010.t1;Name=itf01g00010.t1;Parent=itf01g00010
I am not getting errors from FeatureCounts, but it is possible that Chr11 should be in the FASTA header. Also, I do convert the GFF3 file to GTF so that FeatureCounts has a GTF as input.
That is correct. Chromosome identifiers i.e.
Chr11
need to match in your reference and the annotation.I have corrected my headers, yet still I am not getting features to be counted.
sequence header:
>itb13g17640.t1|chr13
GTF:
chr13 gt4sp_v3 CDS 26079446 26082424 . - 0 transcript_id "itb13g19070.t1"; gene_id "itb13g19070";
Do you have another suggestion for recourse?
This entire string
>itb13g17640.t1|chr13
needs to match first column in your GTF. If possible get rid of>itb13g17640.t1
and leavechr13
behind.Thank you! Seems counter-intuitive that the sequences do not need gene ids.
Just to make sure, is your reference sequence really the genome ? Or the transcriptome/cDNA ? Having headers like
>itf11g12620.t1
makes it looks like it is a transcriptome sequence, for which renaming the header aschr13
does not make sense indeed.In addition, if you aligned on the transcriptome, you do not need to use featureCounts because reads are already assigned to transcripts. Instead, you can simply use
samtools idxstats
to obtain the number of reads by sequence (after perhaps filtering against unwanted – multimappers, chimeric, etc... – reads).Gene ID's are provided by
ID=itf01g00010.t1
in last field in your GTF. Be sure to use-g ID
withfeatureCounts
.