Question

generating 'maximal' cds sequence from multiple isoforms of a gene

0

Entering edit mode

5.1 years ago

mforthman ▴ 50

For a given gene, I would like to take the exons or CDS coordinates for all isoforms and 'merge' them to create a reference transcript that represents the 'maximal' number of coding regions. I have seen some previous posts that address this issue, but they only apply to Swiss-Prot, Entrez, Ensembl, or USCS genomes, none of which have any references close to my group of study (only NCBI does). I have a .gff file from NCBI, but not sure how I can use this to accomplish my goal. As an example,

Input (3 isoforms for a single gene):

NW_020110170.1  Gnomon  gene    410812  420085  .   -   .   ID=gene-LOC106680947;Dbxref=GeneID:106680947;Name=LOC106680947;gbkey=Gene;gene=LOC106680947;gene_biotype=protein_coding
NW_020110170.1  Gnomon  mRNA    410812  420085  .   -   .   ID=rna-XM_024358664.1;Parent=gene-LOC106680947;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;Name=XM_024358664.1;gbkey=mRNA;gene=LOC106680947;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1  Gnomon  exon    419892  420085  .   -   .   ID=exon-XM_024358664.1-1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1  Gnomon  exon    419680  419780  .   -   .   ID=exon-XM_024358664.1-2;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1  Gnomon  exon    418565  418722  .   -   .   ID=exon-XM_024358664.1-3;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1  Gnomon  exon    415504  415763  .   -   .   ID=exon-XM_024358664.1-4;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1  Gnomon  exon    414418  414585  .   -   .   ID=exon-XM_024358664.1-5;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1  Gnomon  exon    413749  413822  .   -   .   ID=exon-XM_024358664.1-6;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1  Gnomon  exon    410812  411149  .   -   .   ID=exon-XM_024358664.1-7;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1  Gnomon  CDS 419680  419719  .   -   0   ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1  Gnomon  CDS 418565  418722  .   -   2   ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1  Gnomon  CDS 415504  415763  .   -   0   ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1  Gnomon  CDS 414418  414585  .   -   1   ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1  Gnomon  CDS 413749  413822  .   -   1   ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1  Gnomon  CDS 410860  411149  .   -   2   ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1  Gnomon  mRNA    410812  420085  .   -   .   ID=rna-XM_024358308.1;Parent=gene-LOC106680947;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;Name=XM_024358308.1;gbkey=mRNA;gene=LOC106680947;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 4 samples with support for all annotated introns;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1  Gnomon  exon    419680  420085  .   -   .   ID=exon-XM_024358308.1-1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1  Gnomon  exon    418565  418722  .   -   .   ID=exon-XM_024358308.1-2;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1  Gnomon  exon    415504  415763  .   -   .   ID=exon-XM_024358308.1-3;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1  Gnomon  exon    414418  414585  .   -   .   ID=exon-XM_024358308.1-4;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1  Gnomon  exon    413749  413822  .   -   .   ID=exon-XM_024358308.1-5;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1  Gnomon  exon    410812  411149  .   -   .   ID=exon-XM_024358308.1-6;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1  Gnomon  CDS 419680  419794  .   -   0   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  CDS 418565  418722  .   -   2   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  CDS 415504  415763  .   -   0   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  CDS 414418  414585  .   -   1   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  CDS 413749  413822  .   -   1   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  CDS 410860  411149  .   -   2   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  mRNA    410813  419240  .   -   .   ID=rna-XM_024358505.1;Parent=gene-LOC106680947;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;Name=XM_024358505.1;gbkey=mRNA;gene=LOC106680947;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 6 samples with support for all annotated introns;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1  Gnomon  exon    418922  419240  .   -   .   ID=exon-XM_024358505.1-1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1  Gnomon  exon    418565  418722  .   -   .   ID=exon-XM_024358505.1-2;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1  Gnomon  exon    415504  415763  .   -   .   ID=exon-XM_024358505.1-3;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1  Gnomon  exon    414418  414585  .   -   .   ID=exon-XM_024358505.1-4;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1  Gnomon  exon    413749  413822  .   -   .   ID=exon-XM_024358505.1-5;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1  Gnomon  exon    410813  411149  .   -   .   ID=exon-XM_024358505.1-6;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1  Gnomon  CDS 418922  419015  .   -   0   ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1  Gnomon  CDS 418565  418722  .   -   2   ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1  Gnomon  CDS 415504  415763  .   -   0   ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1  Gnomon  CDS 414418  414585  .   -   1   ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1  Gnomon  CDS 413749  413822  .   -   1   ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1  Gnomon  CDS 410860  411149  .   -   2   ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1

Ideally, an approach would be able to assess which exons/CDS of one isoform overlaps with other exons/CDSs from another isoform, e.g., the final set of coordinates would be:

NW_020110170.1  Gnomon  CDS 410860  411149  .   -   2   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  CDS 413749  413822  .   -   1   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  CDS 414418  414585  .   -   1   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  CDS 415504  415763  .   -   0   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  CDS 418565  418722  .   -   2   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1  Gnomon  CDS 418922  419015  .   -   0   ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1  Gnomon  CDS 419680  419794  .   -   0   ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1

I wonder if there is a way to modify any of the recommended approaches in this post to serve my purpose.

gene gff CDS • 1.3k views

ADD COMMENT • link updated 5.1 years ago by Brice Sarver ★ 3.8k • written 5.1 years ago by mforthman ▴ 50

score 3 · Answer 1 · 2019-10-16

A couple of things:

All exons are not CDSs. Your selection could matter for downstream analyses, so choose carefully.
If your GFF has annotated or predicted genes from multiple annotation sources, it may be worthwhile to select one.
Since GFFs have more information than just the coordinates, the codon position may or may not be preserved, and other information may or may not be relevant (e.g., the transcript ID associated with a given exon).

If you just want a list of coordinates, you can take all the transcript coordinates (e.g., all CDSs) for a gene of interest and merge them into overlapping groups using, for example, bedtools merge; see here. Make sure the coordinates from your source are correctly stored using a zero- or one-based indexing scheme. This post by Obi Griffith is referenced pretty regularly.