For a given gene, I would like to take the exons or CDS coordinates for all isoforms and 'merge' them to create a reference transcript that represents the 'maximal' number of coding regions. I have seen some previous posts that address this issue, but they only apply to Swiss-Prot, Entrez, Ensembl, or USCS genomes, none of which have any references close to my group of study (only NCBI does). I have a .gff file from NCBI, but not sure how I can use this to accomplish my goal. As an example,
Input (3 isoforms for a single gene):
NW_020110170.1 Gnomon gene 410812 420085 . - . ID=gene-LOC106680947;Dbxref=GeneID:106680947;Name=LOC106680947;gbkey=Gene;gene=LOC106680947;gene_biotype=protein_coding
NW_020110170.1 Gnomon mRNA 410812 420085 . - . ID=rna-XM_024358664.1;Parent=gene-LOC106680947;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;Name=XM_024358664.1;gbkey=mRNA;gene=LOC106680947;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1 Gnomon exon 419892 420085 . - . ID=exon-XM_024358664.1-1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1 Gnomon exon 419680 419780 . - . ID=exon-XM_024358664.1-2;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1 Gnomon exon 418565 418722 . - . ID=exon-XM_024358664.1-3;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1 Gnomon exon 415504 415763 . - . ID=exon-XM_024358664.1-4;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1 Gnomon exon 414418 414585 . - . ID=exon-XM_024358664.1-5;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1 Gnomon exon 413749 413822 . - . ID=exon-XM_024358664.1-6;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1 Gnomon exon 410812 411149 . - . ID=exon-XM_024358664.1-7;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XM_024358664.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X3;transcript_id=XM_024358664.1
NW_020110170.1 Gnomon CDS 419680 419719 . - 0 ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1 Gnomon CDS 418565 418722 . - 2 ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1 Gnomon CDS 415504 415763 . - 0 ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1 Gnomon CDS 414418 414585 . - 1 ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1 Gnomon CDS 413749 413822 . - 1 ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1 Gnomon CDS 410860 411149 . - 2 ID=cds-XP_024214432.1;Parent=rna-XM_024358664.1;Dbxref=GeneID:106680947,Genbank:XP_024214432.1;Name=XP_024214432.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X3;protein_id=XP_024214432.1
NW_020110170.1 Gnomon mRNA 410812 420085 . - . ID=rna-XM_024358308.1;Parent=gene-LOC106680947;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;Name=XM_024358308.1;gbkey=mRNA;gene=LOC106680947;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 4 samples with support for all annotated introns;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1 Gnomon exon 419680 420085 . - . ID=exon-XM_024358308.1-1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1 Gnomon exon 418565 418722 . - . ID=exon-XM_024358308.1-2;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1 Gnomon exon 415504 415763 . - . ID=exon-XM_024358308.1-3;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1 Gnomon exon 414418 414585 . - . ID=exon-XM_024358308.1-4;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1 Gnomon exon 413749 413822 . - . ID=exon-XM_024358308.1-5;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1 Gnomon exon 410812 411149 . - . ID=exon-XM_024358308.1-6;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XM_024358308.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X1;transcript_id=XM_024358308.1
NW_020110170.1 Gnomon CDS 419680 419794 . - 0 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon CDS 418565 418722 . - 2 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon CDS 415504 415763 . - 0 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon CDS 414418 414585 . - 1 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon CDS 413749 413822 . - 1 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon CDS 410860 411149 . - 2 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon mRNA 410813 419240 . - . ID=rna-XM_024358505.1;Parent=gene-LOC106680947;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;Name=XM_024358505.1;gbkey=mRNA;gene=LOC106680947;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 6 samples with support for all annotated introns;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1 Gnomon exon 418922 419240 . - . ID=exon-XM_024358505.1-1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1 Gnomon exon 418565 418722 . - . ID=exon-XM_024358505.1-2;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1 Gnomon exon 415504 415763 . - . ID=exon-XM_024358505.1-3;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1 Gnomon exon 414418 414585 . - . ID=exon-XM_024358505.1-4;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1 Gnomon exon 413749 413822 . - . ID=exon-XM_024358505.1-5;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1 Gnomon exon 410813 411149 . - . ID=exon-XM_024358505.1-6;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XM_024358505.1;gbkey=mRNA;gene=LOC106680947;product=uncharacterized LOC106680947%2C transcript variant X2;transcript_id=XM_024358505.1
NW_020110170.1 Gnomon CDS 418922 419015 . - 0 ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1 Gnomon CDS 418565 418722 . - 2 ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1 Gnomon CDS 415504 415763 . - 0 ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1 Gnomon CDS 414418 414585 . - 1 ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1 Gnomon CDS 413749 413822 . - 1 ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1 Gnomon CDS 410860 411149 . - 2 ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
Ideally, an approach would be able to assess which exons/CDS of one isoform overlaps with other exons/CDSs from another isoform, e.g., the final set of coordinates would be:
NW_020110170.1 Gnomon CDS 410860 411149 . - 2 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon CDS 413749 413822 . - 1 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon CDS 414418 414585 . - 1 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon CDS 415504 415763 . - 0 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon CDS 418565 418722 . - 2 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
NW_020110170.1 Gnomon CDS 418922 419015 . - 0 ID=cds-XP_024214273.1;Parent=rna-XM_024358505.1;Dbxref=GeneID:106680947,Genbank:XP_024214273.1;Name=XP_024214273.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X2;protein_id=XP_024214273.1
NW_020110170.1 Gnomon CDS 419680 419794 . - 0 ID=cds-XP_024214076.1;Parent=rna-XM_024358308.1;Dbxref=GeneID:106680947,Genbank:XP_024214076.1;Name=XP_024214076.1;gbkey=CDS;gene=LOC106680947;product=uncharacterized protein LOC106680947 isoform X1;protein_id=XP_024214076.1
I wonder if there is a way to modify any of the recommended approaches in this post to serve my purpose.
This was great! After some time trying to sort the file, I think I got bedtools to do exactly what I wanted with the .gff file. Next step is to figure out how I can use the bedtools output to extract the "new" exons (what I'm more interested in for downstream analyses) from the corresponding NCBI genome. Thanks!