Question

Using RefSeq gene annotation file to determine gene lengths

0

Entering edit mode

4.1 years ago

williamtmills ▴ 20

I am trying to calculate Fragments Per Kilobase of transcript per Million mapped reads (FPKM) for an RNA-seq experiment. The gene annotation file I used to count genes with htseq was UCSC's NCBI RefSeq file (https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ncbiRefSeq.gtf.gz). This file has a not so straightforward structure without column headings:

head mm10.ncbiRefSeq.gtf
chr1    ncbiRefSeq  transcript  3199733 3672278 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3";  gene_name "Xkr4";
chr1    ncbiRefSeq  exon    3199733 3207317 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "1"; exon_id "XM_006495550.3.1"; gene_name "Xkr4";
chr1    ncbiRefSeq  3UTR    3199733 3207317 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "1"; exon_id "XM_006495550.3.1"; gene_name "Xkr4";
chr1    ncbiRefSeq  exon    3213439 3216968 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "2"; exon_id "XM_006495550.3.2"; gene_name "Xkr4";
chr1    ncbiRefSeq  3UTR    3213439 3216021 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "2"; exon_id "XM_006495550.3.2"; gene_name "Xkr4";
chr1    ncbiRefSeq  CDS 3216025 3216968 .   -   2   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "2"; exon_id "XM_006495550.3.2"; gene_name "Xkr4";
chr1    ncbiRefSeq  exon    3421702 3421901 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "3"; exon_id "XM_006495550.3.3"; gene_name "Xkr4";
chr1    ncbiRefSeq  CDS 3421702 3421901 .   -   1   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "3"; exon_id "XM_006495550.3.3"; gene_name "Xkr4";
chr1    ncbiRefSeq  exon    3670552 3672278 .   -   .   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "4"; exon_id "XM_006495550.3.4"; gene_name "Xkr4";
chr1    ncbiRefSeq  CDS 3670552 3671348 .   -   0   gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "4"; exon_id "XM_006495550.3.4"; gene_name "Xkr4";

How does one use this file to calculate feature lengths that can then be used to calculate FPKM?

I can see that columns 4 and 5 are the start and stop of each feature so you could get length from that, but how do you condense all of the other information so you just have gene_id and length?

I am sure this has been done before I just don't know how you parse through all the gene regions to just get one length per gene.

Thanks in advance!

refseq rstudio htseq featurelength gtf fpkm geneid • 1.9k views

ADD COMMENT • link updated 4.1 years ago by Juke34 9.3k • written 4.1 years ago by williamtmills ▴ 20

0

Entering edit mode

that file is a typical GFF (or GTF, very similar) format. You will quickly find what the columns are when searching for it.

Not 100% sure of it but I think the AGAT package might have some sub-programs to calculate lengths of genes etc.

AGAT - Another Gff/Gtf Analysis Toolkit

Additional note: FPKM is a somewhat deprecated measure (except for very specific use-cases) , nowadays people rely more on TPM or CPM values.

ADD REPLY • link 4.1 years ago by lieven.sterck 15k

0

Entering edit mode

Not such feature in AGAT :/ but might be a good idea to implement something in AGAT to perform this type of task…

ADD REPLY • link 4.1 years ago by Juke34 9.3k

score 0 · Answer 1 · 2021-06-30

0

Entering edit mode

4.1 years ago

swbarnes2 15k

Don't try to do this yourself. Gene length is not enough, because genes can have multiple transcripts of differing length.

Use either RSEM, or something like Salmon. Those are smart enough to calculate this properly, and they will return FPKM.