Extracting GeneID from Dbxref section in GFF file while using featureCounts
1
0
Entering edit mode
3.1 years ago
Shraddha ▴ 90

Hi all,

I'm trying to generate feature count files for the DeSeq2 pipeline, but I've run into an issue while using featureCounts .(see image)

I see that the gene IDs that I need, aren't in the same format at the rest of the attributes, but within the Dbxref section. How can I extract just the gene ID so that my featurecounts will produce an output?

thanks and kind regards

featurecounts gff • 1.2k views
ADD COMMENT
1
Entering edit mode
3.1 years ago
vkkodali_ncbi ★ 3.8k

One solution is to use the gene attribute with featureCounts. Separately, you can generate a GeneID to gene name map from the GFF3 file using something like this:

zgrep 'GeneID' GCF_900626175.2_cs10_genomic.gff.gz \
  | cut -f9 | perl -pe 's/ID.*(GeneID:\d+).*gene=([^;]*).*/\1\t\2/g' \
  | sort -u > genes.txt

Finally, join the featureCounts output table to the genes.txt file on gene name column.

ADD COMMENT
0
Entering edit mode

Thanks for your response! I tried using 'gene' with the -g flag, but it gave me unsatisfactory results (no features were found for any of my samples). I would hypothesize that the gene ID should be just the number, without the LOC. I was doing a long-winded series of awk commands to execute your second alternative, but this is far neater. Thanks again!

ADD REPLY
0
Entering edit mode

I would hypothesize that the gene ID should be just the number, without the LOC.

Yes, if you come across any LOC style identifiers you can be sure that the suffix numeral is the GeneID.

ADD REPLY

Login before adding your answer.

Traffic: 1375 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6