Exon-level counts with featureCounts
0
0
Entering edit mode
2.7 years ago
rependo ▴ 40

Hi all,

I'm using featureCounts to generate read counts from paired end alignments (generated w/ subread align). To corroborate gene-level expression estimates and to also look at differential exon usage, I'd like to also generate read counts for individual exons without merging at the gene level. To this end, I've tried omitting -g gene_id as a parameter, hoping it would default to exon-level counts, but it does not. I've also tried supplying -g exon, but it doesn't recognize exon as a valid field in my gtf (which makes sense looking at my gtf; example provided below).

My questions is thus: Is there a way to generate exon-level read counts without creating a new "exon_id" field in my gtf? I assume that if featureCounts is counting aligned fragments or reads on specific exons, there must be some way to see those counts unadulterated in output. I assume I can create an exon_id field to go along with gene_id, but this would require some gtf wrangling that I'm not able to do at present (suggestions also welcome here). If we use the GTF snippet below, I'd like to get counts for each of the 6 exons instead of 1 count for the entire gene.

Thanks for reading.

featureCounts -s 2 -t exon -g gene_id -a GCF_019175285.1.gtf -o putput.txt $FILE_PATH/file1.bam  \
 $FILE_PATH/file2.bam   \
 $FILE_PATH/file3.bam   \
 $FILE_PATH/file4.bam   \
 $FILE_PATH/file5.bam   \
 $FILE_PATH/file6.bam   \
 $FILE_PATH/file7.bam   \
 $FILE_PATH/file8.bam   \
 $FILE_PATH/file9.bam   \
 $FILE_PATH/file10.bam   \ 

NC_056522.1     Gnomon  transcript      16228   22254   .       +       .       gene_id "LOC121915633"
NC_056522.1     Gnomon  exon    16228   16392   .       +       .       gene_id "LOC121915633";
NC_056522.1     Gnomon  exon    16637   16863   .       +       .       gene_id "LOC121915633";
NC_056522.1     Gnomon  exon    17562   17700   .       +       .       gene_id "LOC121915633";
NC_056522.1     Gnomon  exon    18138   18295   .       +       .       gene_id "LOC121915633";
NC_056522.1     Gnomon  exon    18446   18498   .       +       .       gene_id "LOC121915633";
NC_056522.1     Gnomon  exon    20400   22254   .       +       .       gene_id "LOC121915633";
NC_056522.1     Gnomon  CDS     16306   16392   .       +       0       gene_id "LOC121915633";
NC_056522.1     Gnomon  CDS     16637   16863   .       +       0       gene_id "LOC121915633";
NC_056522.1     Gnomon  CDS     17562   17700   .       +       1       gene_id "LOC121915633";
NC_056522.1     Gnomon  CDS     18138   18295   .       +       0       gene_id "LOC121915633";
NC_056522.1     Gnomon  CDS     18446   18498   .       +       1       gene_id "LOC121915633";
NC_056522.1     Gnomon  CDS     20400   21178   .       +       2       gene_id "LOC121915633";
featureCounts subread • 1.9k views
ADD COMMENT
1
Entering edit mode

I'm afraid you need the exon_id field. You can try to just add a dummy exon_id to each exon of the GTF snippet you posted in your question, and run featurecounts using -g exon_id to check if the output is what you expect. Afterwards, you can think of a way of adding an exon_id to all the exons of the full GTF. Probably it can be done with some simple awk, such as:

awk -F'\t' 'BEGIN{counter=1}{if ($3=="exon"){print $0"; exon_id_"counter; counter++}else{print}}' input.gtf  > output.gtf

Code not tested.

ADD REPLY
0
Entering edit mode

Thanks, iraun. Your code worked, as well. Can you think of a way to affix gene_id to each exon_id? Example below

NC_056522.1 Gnomon  transcript  16228   22254   .   +   .   gene_id "LOC121915633"
NC_056522.1 Gnomon  exon    16228   16392   .   +   .   gene_id "LOC121915633";; LOC121915633 _exon_id_1
ADD REPLY
1
Entering edit mode
$ awk -F'\t| |"' 'BEGIN{counter=1}{if ($3=="exon"){print $0 " "$11"_exon_id_"counter; counter++}else{print}}' test.txt

NC_056522.1 Gnomon  transcript  16228   22254   .   +   .   gene_id "LOC121915633"
NC_056522.1 Gnomon  exon    16228   16392   .   +   .   gene_id "LOC121915633"; LOC121915633_exon_id_1
NC_056522.1 Gnomon  exon    16637   16863   .   +   .   gene_id "LOC121915633"; LOC121915633_exon_id_2
NC_056522.1 Gnomon  exon    17562   17700   .   +   .   gene_id "LOC121915633"; LOC121915633_exon_id_3
NC_056522.1 Gnomon  exon    18138   18295   .   +   .   gene_id "LOC121915633"; LOC121915633_exon_id_4
NC_056522.1 Gnomon  exon    18446   18498   .   +   .   gene_id "LOC121915633"; LOC121915633_exon_id_5
NC_056522.1 Gnomon  exon    20400   22254   .   +   .   gene_id "LOC121915633"; LOC121915633_exon_id_6
ADD REPLY
0
Entering edit mode

Many thanks!

ADD REPLY
0
Entering edit mode

looks like you GFF file is not following the specs completely. For each feature there should have been a unique ID

You might try to run it trough AGAT, it might give you some more info (or even fix the file) ?

ADD REPLY
0
Entering edit mode

Hi lieven. It's a truncated gtf for sure, but I've had issues getting featureCounts to play nicely with both the unmodified GFF from NCBI as well as a "standard" GTF (commented on a related thread regarding this ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.).

featureCounts seems to fail to recognize gene_id in the 9th column of GTFs if there are other entries in that column. To get around this, I dropped anything from the GTF that wasn't "gene_id". I could attempt running featureCounts again with this remaining 9th column info retained, but I'm worried it will fail for the same reasons as outlined in the thread I linked.

ADD REPLY
1
Entering edit mode

One solution would be to create a file in simple annotation formation using the data from your GTF file and use that. SAF format is described in featureCounts/subread manual.

ADD REPLY

Login before adding your answer.

Traffic: 2862 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6