Entering edit mode
7.2 years ago
theoharis
▴
40
Supposing we have a text file such as the one by refseqgene (see example below). What is a suitable awk program (and regex) to create a new file with 4 columns - gene, synonym, note, summary:
gene="AP3B2" gene_synonym="EIEE48; NAPTB" note="adaptor related protein complex 3 beta 2 subunit" Summary= "Adaptor protein complex 3 (AP-3 complex) is a heterotrimeric protein complex involved in the formation of clathrin-coated synaptic vesicles. The protein encoded by this gene represents the beta subunit of the neuron-specific AP-3 complex and was first identified as the target antigen in human paraneoplastic neurologic disorders. The encoded subunit binds clathrin and is phosphorylated by a casein kinase-like protein, which mediates synaptic vesicle coat assembly. Defects in this gene are a cause of early-onset epileptic encephalopathy. [provided by RefSeq, Feb 2017]."
> LOCUS NG_052957 57628 bp DNA linear PRI
> 02-MAR-2017 DEFINITION Homo sapiens adaptor related protein complex 3
> beta 2 subunit
> (AP3B2), RefSeqGene on chromosome 15. ACCESSION NG_052957 VERSION NG_052957.1 KEYWORDS RefSeq; RefSeqGene.
> SOURCE Homo sapiens (human) ORGANISM Homo sapiens
> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
> Catarrhini; Hominidae; Homo. COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
> reference sequence was derived from AC105339.9 and FJ695193.1.
> This sequence is a reference standard in the RefSeqGene project.
>
> Summary: Adaptor protein complex 3 (AP-3 complex) is a
> heterotrimeric protein complex involved in the formation of
> clathrin-coated synaptic vesicles. The protein encoded by this gene
> represents the beta subunit of the neuron-specific AP-3 complex and
> was first identified as the target antigen in human paraneoplastic
> neurologic disorders. The encoded subunit binds clathrin and is
> phosphorylated by a casein kinase-like protein, which mediates
> synaptic vesicle coat assembly. Defects in this gene are a cause of
> early-onset epileptic encephalopathy. [provided by RefSeq, Feb
> 2017]. PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP
> 1-35060 AC105339.9 88079-123138
> 35061-35259 FJ695193.1 1-199 c
> 35260-57628 AC105339.9 123337-145705 FEATURES Location/Qualifiers
> source 1..57628
> /organism="Homo sapiens"
> /mol_type="genomic DNA"
> /db_xref="taxon:9606"
> /chromosome="15"
> /map="15q25.2"
> gene 916..4438
> /gene="LOC338963"
> /note="epididymal protein pseudogene"
> /pseudo
> /db_xref="GeneID:338963"
> misc_RNA join(916..1179,2602..3348,3477..3722,4334..4438)
> /gene="LOC338963"
> /product="epididymal protein pseudogene"
> /exception="mismatches in transcription"
> /pseudo
> /transcript_id="NR_034139.1"
> /db_xref="GeneID:338963"
> gene 5001..55628
> /gene="AP3B2"
> /gene_synonym="EIEE48; NAPTB"
> /note="adaptor related protein complex 3 beta 2 subunit"
> /db_xref="GeneID:8120"
> /db_xref="HGNC:HGNC:567"
> /db_xref="MIM:602166"
> mRNA join(5001..5315,25456..25531,25677..25751,26078..26173,
> 33329..33489,33731..33797,33890..34072,34154..34437,
> 34680..34734,35109..35180,36742..36804,37106..37238,
> 37526..37635,38272..38448,47976..48162,49334..49452,
> 49606..49662,49966..50074,50419..50542,50934..51108,
> 51289..51349,51676..51782,51987..52215,52657..52741,
> 52987..53084,54926..55064,55199..55628)
> /gene="AP3B2"
> /gene_synonym="EIEE48; NAPTB"
> /product="adaptor related protein complex 3 beta 2
> subunit, transcript variant 1"
> /transcript_id="NM_001278512.1"
> /db_xref="GeneID:8120"
> /db_xref="HGNC:HGNC:567"
> /db_xref="MIM:602166"
> exon 5001..5315
> /gene="AP3B2"
> /gene_synonym="EIEE48; NAPTB"
> /inference="alignment:Splign:2.0.8"
> /number=1
> CDS join(5203..5315,25456..25531,25677..25751,26078..26173,
> 33329..33489,33731..33797,33890..34072,34154..34437,
> 34680..34734,35109..35180,36742..36804,37106..37238,
> 37526..37635,38272..38448,47976..48162,49334..49452,
> 49606..49662,49966..50074,50419..50542,50934..51108,
> 51289..51349,51676..51782,51987..52215,52657..52741,
> 52987..53084,54926..55064,55199..55349)
> /gene="AP3B2"
> /gene_synonym="EIEE48; NAPTB"
> /note="isoform 1 is encoded by transcript variant 1;
> Neuronal adaptin-like protein, beta-subunit; AP-3 complex
> subunit beta-2; beta-3B-adaptin; adaptor protein complex
> AP-3 subunit beta-2; neuron-specific vesicle coat protein
> beta-NAP; clathrin assembly protein complex 3 beta-2 large
> chain; adaptor-related protein complex 3 subunit beta-2"
> /codon_start=1
> /product="AP-3 complex subunit beta-2 isoform 1"
> /protein_id="NP_001265441.1"
> /db_xref="CCDS:CCDS61737.1"
> /db_xref="GeneID:8120"
> /db_xref="HGNC:HGNC:567"
> /db_xref="MIM:602166"
> /translation="MSAAPAYSEDKGGSAGPGEPEYGHDPASGGIFSSDYKRHDDLKE
> MLDTNKDSLKLEAMKRIVAMIARGKNASDLFPAVVKNVACKNIEVKKLVYVYLVRYAE
> EQQDLALLSISTFQRGLKDPNQLIRASALRVLSSIRVPIIVPIMMLAIKEAASDMSPY
> VRKTAAHAIPKLYSLDSDQKDQLIEVIEKLLADKTTLVAGSVVMAFEEVCPERIDLIH
> KNYRKLCNLLIDVEEWGQVVIISMLTRYARTQFLSPTQNESLLEENAEKAFYGSEEDE
> AKGAGSEETAAAAAPSRKPYVMDPDHRLLLRNTKPLLQSRSAAVVMAVAQLYFHLAPK
> AEVGVIAKALVRLLRSHSEVQYVVLQNVATMSIKRRGMFEPYLKSFYIRSTDPTQIKI
> LKLEVLTNLANETNIPTVLREFQTYIRSMDKDFVAATIQAIGRCATNIGRVRDTCLNG
> LVQLLSNRDELVVAESVVVIKKLLQMQPAQHGEIIKHLAKLTDNIQVPMARASILWLI
> GEYCEHVPRIAPDVLRKMAKSFTAEEDIVKLQVINLAAKLYLTNSKQTKLLTQYVLSL
> AKYDQNYDIRDRARFTRQLIVPSEQGGALSRHAKKLFLAPKPAPVLESSFKDRDHFQL
> GSLSHLLNAKATGYQELPDWPEEAPDPSVRNVEEEDLSLIETHVGLLGEYTEVPEWTK
> CSNREKRKEKEKPFYSDSEGESGPTESADSDPESESESDSKSSSESGSGESSSESDNE
> DQDEDEEKGRGSESEQSEEDGKRKTKKKVPERKGEASSSDEGSDSSSSSSESEMTSES
> EEEQLEPASWSRKTPPSSKSAPATKEISLLDLEDFTPPSVQPVSPPAIVSTSLAADLE
> GLTLTDSTLVPSLLSPVSGVGRQELLHRVAGEGLAVDYTFSRQPFSGDPHMVSVHIHF
> SNSSDTPIKGLHVGTPKLPAGISIQEFPEIESLAPGESATAVMGINFCDSTQAANFQL
> CTQTRQFYVSIQPPVGELMAPVFMSENEFKKEQGKLMGMNEITEKLMLPDTCRSDHIV
> VQKVTATANLGRVPCGTSDEYRFAGRTLTGGSLVLLTLDARPAGAAQLTVNSEKMVIG
> TMLVKDVIQALTQ"
> gene complement(22089..>57628)
> /gene="CPEB1-AS1"
> /note="CPEB1 antisense RNA 1"
> /db_xref="GeneID:283692"
> /db_xref="HGNC:HGNC:27523"
> ncRNA complement(22089..>22898)
> /ncRNA_class="lncRNA"
> /gene="CPEB1-AS1"
> /product="CPEB1 antisense RNA 1"
> /inference="similar to RNA sequence (same
> species):RefSeq:NR_046096.1"
> /exception="annotated by transcript or proteomic data"
> /transcript_id="NR_046096.1"
> /db_xref="GeneID:283692"
> /db_xref="HGNC:HGNC:27523"
> gene 22457..23383
> /gene="LOC100421235"
> /note="serine and arginine rich splicing factor 9
> pseudogene"
> /pseudo
> /db_xref="GeneID:100421235"
> exon 25456..25531
> /gene="AP3B2"
> /gene_synonym="EIEE48; NAPTB"
> /inference="alignment:Splign:2.0.8"
> /number=2 ORIGIN
> 1 gtccccatgg ggtgggtggc atgatcaggc caggtgcccc aggagtggga gtctctgttc
> 61 cctgggctct tacagctcca gggccttgcc cccttttctt tcttacaaag aaaacggtgg
> 121 cttgactcag caaaaactaa gaagggtagc tgtttctcca ggtcaggaag gatacggggg
> 181 tcagcacttc ctggcagttg agtctgggga agggggacct cacatgccag cagcgtgaga
> 241 aagatgatac tgtacagtgg tgaaggacac gggcactgga gccagaccac ttggcctgaa
> 301 tactggttgt gccgcttacc agcttgtaac ctctccaagc ctcagtttcc ccatctgtaa
> 361 aatgggaagt ataacatcat ctacttcaag tcattattgt tagggctaaa tgatgcttta //
Take a look at parsing genbank files in BioPython.
the question was about using awk and refseqgene file format :)
Have you tried an awk or refseq command? If so, post it, and any errors/outputs you're getting. Then, someone may be able to help. You are asking for a service to be done here, i.e. a usable command to be provided, without input from you as the OP, and that is not the purpose of the site. Given your file type, I provided a suggestion, to get you started in formulating your own code that will parse your file in the manner that you wish.