I am trying to calculate Fragments Per Kilobase of transcript per Million mapped reads (FPKM) for an RNA-seq experiment. The gene annotation file I used to count genes with htseq was UCSC's NCBI RefSeq file (https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ncbiRefSeq.gtf.gz). This file has a not so straightforward structure without column headings:
head mm10.ncbiRefSeq.gtf
chr1 ncbiRefSeq transcript 3199733 3672278 . - . gene_id "Xkr4"; transcript_id "XM_006495550.3"; gene_name "Xkr4";
chr1 ncbiRefSeq exon 3199733 3207317 . - . gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "1"; exon_id "XM_006495550.3.1"; gene_name "Xkr4";
chr1 ncbiRefSeq 3UTR 3199733 3207317 . - . gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "1"; exon_id "XM_006495550.3.1"; gene_name "Xkr4";
chr1 ncbiRefSeq exon 3213439 3216968 . - . gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "2"; exon_id "XM_006495550.3.2"; gene_name "Xkr4";
chr1 ncbiRefSeq 3UTR 3213439 3216021 . - . gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "2"; exon_id "XM_006495550.3.2"; gene_name "Xkr4";
chr1 ncbiRefSeq CDS 3216025 3216968 . - 2 gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "2"; exon_id "XM_006495550.3.2"; gene_name "Xkr4";
chr1 ncbiRefSeq exon 3421702 3421901 . - . gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "3"; exon_id "XM_006495550.3.3"; gene_name "Xkr4";
chr1 ncbiRefSeq CDS 3421702 3421901 . - 1 gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "3"; exon_id "XM_006495550.3.3"; gene_name "Xkr4";
chr1 ncbiRefSeq exon 3670552 3672278 . - . gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "4"; exon_id "XM_006495550.3.4"; gene_name "Xkr4";
chr1 ncbiRefSeq CDS 3670552 3671348 . - 0 gene_id "Xkr4"; transcript_id "XM_006495550.3"; exon_number "4"; exon_id "XM_006495550.3.4"; gene_name "Xkr4";
How does one use this file to calculate feature lengths that can then be used to calculate FPKM?
I can see that columns 4 and 5 are the start and stop of each feature so you could get length from that, but how do you condense all of the other information so you just have gene_id and length?
I am sure this has been done before I just don't know how you parse through all the gene regions to just get one length per gene.
Thanks in advance!
that file is a typical GFF (or GTF, very similar) format. You will quickly find what the columns are when searching for it.
Not 100% sure of it but I think the AGAT package might have some sub-programs to calculate lengths of genes etc.
AGAT - Another Gff/Gtf Analysis Toolkit
Additional note: FPKM is a somewhat deprecated measure (except for very specific use-cases) , nowadays people rely more on TPM or CPM values.
Not such feature in AGAT :/ but might be a good idea to implement something in AGAT to perform this type of taskā¦