Hi all,
I'm using featureCounts to generate read counts from paired end alignments (generated w/ subread align). To corroborate gene-level expression estimates and to also look at differential exon usage, I'd like to also generate read counts for individual exons without merging at the gene level. To this end, I've tried omitting -g gene_id as a parameter, hoping it would default to exon-level counts, but it does not. I've also tried supplying -g exon, but it doesn't recognize exon as a valid field in my gtf (which makes sense looking at my gtf; example provided below).
My questions is thus: Is there a way to generate exon-level read counts without creating a new "exon_id" field in my gtf? I assume that if featureCounts is counting aligned fragments or reads on specific exons, there must be some way to see those counts unadulterated in output. I assume I can create an exon_id field to go along with gene_id, but this would require some gtf wrangling that I'm not able to do at present (suggestions also welcome here). If we use the GTF snippet below, I'd like to get counts for each of the 6 exons instead of 1 count for the entire gene.
Thanks for reading.
featureCounts -s 2 -t exon -g gene_id -a GCF_019175285.1.gtf -o putput.txt $FILE_PATH/file1.bam \
$FILE_PATH/file2.bam \
$FILE_PATH/file3.bam \
$FILE_PATH/file4.bam \
$FILE_PATH/file5.bam \
$FILE_PATH/file6.bam \
$FILE_PATH/file7.bam \
$FILE_PATH/file8.bam \
$FILE_PATH/file9.bam \
$FILE_PATH/file10.bam \
NC_056522.1 Gnomon transcript 16228 22254 . + . gene_id "LOC121915633"
NC_056522.1 Gnomon exon 16228 16392 . + . gene_id "LOC121915633";
NC_056522.1 Gnomon exon 16637 16863 . + . gene_id "LOC121915633";
NC_056522.1 Gnomon exon 17562 17700 . + . gene_id "LOC121915633";
NC_056522.1 Gnomon exon 18138 18295 . + . gene_id "LOC121915633";
NC_056522.1 Gnomon exon 18446 18498 . + . gene_id "LOC121915633";
NC_056522.1 Gnomon exon 20400 22254 . + . gene_id "LOC121915633";
NC_056522.1 Gnomon CDS 16306 16392 . + 0 gene_id "LOC121915633";
NC_056522.1 Gnomon CDS 16637 16863 . + 0 gene_id "LOC121915633";
NC_056522.1 Gnomon CDS 17562 17700 . + 1 gene_id "LOC121915633";
NC_056522.1 Gnomon CDS 18138 18295 . + 0 gene_id "LOC121915633";
NC_056522.1 Gnomon CDS 18446 18498 . + 1 gene_id "LOC121915633";
NC_056522.1 Gnomon CDS 20400 21178 . + 2 gene_id "LOC121915633";
I'm afraid you need the
exon_id
field. You can try to just add a dummyexon_id
to each exon of the GTF snippet you posted in your question, and runfeaturecounts
using -g exon_id
to check if the output is what you expect. Afterwards, you can think of a way of adding an exon_id to all the exons of the full GTF. Probably it can be done with some simpleawk
, such as:Code not tested.
Thanks, iraun. Your code worked, as well. Can you think of a way to affix gene_id to each exon_id? Example below
Many thanks!
looks like you GFF file is not following the specs completely. For each feature there should have been a unique ID
You might try to run it trough AGAT, it might give you some more info (or even fix the file) ?
Hi lieven. It's a truncated gtf for sure, but I've had issues getting featureCounts to play nicely with both the unmodified GFF from NCBI as well as a "standard" GTF (commented on a related thread regarding this ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.).
featureCounts seems to fail to recognize gene_id in the 9th column of GTFs if there are other entries in that column. To get around this, I dropped anything from the GTF that wasn't "gene_id". I could attempt running featureCounts again with this remaining 9th column info retained, but I'm worried it will fail for the same reasons as outlined in the thread I linked.
One solution would be to create a file in simple annotation formation using the data from your GTF file and use that. SAF format is described in featureCounts/subread manual.