Hey, I have a large GB file that contains thousands of records. I used the NCBI E-utility tool to retrieve the GB formatted files. Now, I need the translated sequence of a particular gene (pol) along with the accession number and Definition in a single fasta file. Here, I am giving an example that can help in easy understanding:
INPUT:
LOCUS LC570899 893 bp RNA linear VRL 08-AUG-2020
DEFINITION Human immunodeficiency virus 1 msaha01609 pol gene for pol protein,
partial cds.
ACCESSION LC570899
VERSION LC570899.1
KEYWORDS .
SOURCE Human immunodeficiency virus 1 (HIV-1)
ORGANISM Human immunodeficiency virus 1
Viruses; Riboviria; Pararnavirae; Artverviricota; Revtraviricetes;
Ortervirales; Retroviridae; Orthoretrovirinae; Lentivirus.
REFERENCE 1
AUTHORS Bhatta,M., Nandi,S. and Saha,M.K.
TITLE HIV Drug Related
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 893)
AUTHORS Bhatta,M., Nandi,S. and Saha,M.K.
TITLE Direct Submission
JOURNAL Submitted (20-JUL-2020) Contact:Mihir Bhatta ICMR-National
Institute of Cholera and Enteric Disease, Virology; P-33, CIT Road,
Scheme-XM, Beliaghata, Kolkata, Kolkata, West Bengal 700010, India
FEATURES Location/Qualifiers
source 1..893
/organism="Human immunodeficiency virus 1"
/mol_type="genomic RNA"
/isolate="msaha01609"
/host="Homo sapiens"
/db_xref="taxon:11676"
/country="India"
/collection_date="2017-07-19"
/collected_by="Srijita Nandi"
/identified_by="Mihir Bhatta"
gene <1..>893
/gene="pol"
CDS <1..>893
/gene="pol"
/note="protease and reverse transcriptase"
/codon_start=3
/product="pol protein"
/protein_id="BCK50781.1"
/translation="QRPLVPIKVGGQTKEALLDTGADDTVLEEINLPGKWKPKMIGGI
GGFIKVRQYDQIPIEICGXKAIGTVLVGPTPVNIIGRNLLTQLGCTLNFPISPIETVP
VKLKPGMDGPKVKQWPLTEEKIKALTAICEEMEKEGKISKIGPENPYNTPIFAIKKKD
STKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLYEDFR
KYTAFTIPSLNNETPGIRYQYNVLPQGWKGSPXIFQASMTKILEPFRAQNPEIVIYQY
MDDLYVGSDLEIGQHRAKIEE"
ORIGIN
1 ggcagcgacc ccttgtccca ataaaagtag ggggtcagac aaaagaggct ctcttagaca
61 caggagcaga tgatacagta ttagaagaaa taaatttgcc aggaaaatgg aaaccaaaaa
121 tgataggagg aattggaggt ttyatcaaag tgagacaata tgatcaaata cctatagaaa
181 tttgtggama aaaggctata ggtacagtat tagtgggacc tacacctgtc aacataattg
241 gaagaaatct gttgactcag cttggatgca cactaaattt tccaattagt cccattgaaa
301 ctgtaccagt aaaattaaaa ccaggaatgg atggcccaaa ggttaaacaa tggccattga
361 cagaagagaa aataaaagca ttaacagcaa tttgtgagga aatggagaag gaaggaaaaa
421 tttcaaaaat tgggcctgaa aatccatata acactccaat atttgccata aaaaagaagg
481 acagtactaa gtggagaaaa ttagtagatt tcagggaact caataaaaga actcaagatt
541 tttgggaagt ccaattagga ataccacacc cagcagggtt aaaaaagaaa aaatcagtga
601 cagtactgga tgtgggggat gcatattttt cagttccttt atatgaagay ttcaggaaat
661 atactgcatt caccatacct agtttaaaca atgaaacacc agggattaga tatcaatata
721 atgtgcttcc acagggatgg aaaggatcac cakcaatatt ccaggcyagc atgacaaaaa
781 tcttagagcc ctttagggca caaaatccag aaatagtcat ctatcaatat atggatgact
841 tgtatgtagg atctgactta gaaatagggc aacatagagc aaaaatagaa gaa
//
LOCUS EU158868 1703 bp DNA linear VRL 26-JUL-2016
DEFINITION HIV-1 isolate G8-AFMC-5 from India gag protein (gag) and pol
protein (pol) genes, partial cds.
ACCESSION EU158868
VERSION EU158868.1
KEYWORDS .
SOURCE Human immunodeficiency virus 1 (HIV-1)
ORGANISM Human immunodeficiency virus 1
Viruses; Riboviria; Pararnavirae; Artverviricota; Revtraviricetes;
Ortervirales; Retroviridae; Orthoretrovirinae; Lentivirus.
REFERENCE 1 (bases 1 to 1703)
AUTHORS Lall,M., Gupta,R.M., Sen,S., Kapila,K., Tripathy,S.P. and
Paranjape,R.S.
TITLE Profile of primary resistance in HIV-1-infected treatment-naive
individuals from Western India
JOURNAL AIDS Res. Hum. Retroviruses 24 (7), 987-990 (2008)
PUBMED 18593351
REFERENCE 2 (bases 1 to 1703)
AUTHORS Lall,M., Gupta,R.M., Sen,S., Kapila,K., Tripathy,S.P. and
Paranjpe,R.S.
TITLE Direct Submission
JOURNAL Submitted (17-SEP-2007) Dept. of Microbiology, Armed Forces Medical
College, Sholapur Road, Pune, Maharashtra 411040, India
FEATURES Location/Qualifiers
source 1..1703
/organism="Human immunodeficiency virus 1"
/proviral
/mol_type="genomic DNA"
/isolate="G8-AFMC-5"
/db_xref="taxon:11676"
/country="India"
/note="antiretroviral naive individual;
subtype: C"
gene <1..241
/gene="gag"
CDS <1..241
/gene="gag"
/codon_start=2
/product="gag protein"
/protein_id="ACB20420.1"
/translation="REGPIMRDCTERQANFLGKIWPSLKGRPGNFLQSRPEPTAPPAE
SFRFEEPTPAPKQEPKDREPLTALRSLFGSDPLSQ"
gene <46..>1703
/gene="pol"
CDS <46..>1703
/gene="pol"
/codon_start=1
/product="pol protein"
/protein_id="ACB20421.1"
/translation="FFRENLAFPQGEAREFPPEQTRANSPTSRELQVRGANPSSEAGA
ERQGALNCPQITLWQRPLVSIKVGGQTKEALLDTGADDTVLEEINLPGKWKPKMIGGI
GGFIKVRQYDQIPIEICGKXAIGTVLVGPTPVNIIGRNMLTQLGCTLNFPISPIETVP
VKLKPGMDGPKVKQWPLTEEKIKALTEICDEMEKEGKITKIGPENPYNTPIFAIKKKD
STKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLYEDFR
KYTAFTIPSTNNETPGIRYQYNVLPQGWKGSPAIFQASMTKILEPFREQNPEIVIYQY
MDDLYVGSDLEIGQHRAKIEELREHLLKWGFTTPDKKHQKEPPFLWMGYELHPDKWTV
QPIQLPEKDSWTVNDIQKLVGKLNWASQIYPGIKIRQLCKLLRGAKALTEIVPLTKEA
ELELAENREILKEPVHGAYYDPSKDLIAEIQKQGRDQWTYQIYQEPFKNLKTGKYAKM
RTAHTNDVKQLTEAVQKIAMESIVIWGKTPKFRLPIQKETWETW"
ORIGIN
1 aagggaagga cccataatga gagactgtac tgaaagacag gctaattttt tagggaaaat
61 ttggccttcc ctcaagggga ggccagggaa tttcctccag agcagaccag agccaacagc
121 cccaccagca gagagcttca ggttcgagga gccaacccca gctccgaagc aggagccgaa
181 agacagggag cccttaactg ccctcagatc actctttggc agcgacccct tgtctcaata
241 aaagtagggg gccaaacaaa agaggctctc ttagacacag gagcagatga tacagtatta
301 gaagaaataa atttgccagg gaaatggaaa ccaaaaatga taggaggaat tggaggtttt
361 atcaaagtaa gacaatatga tcaaatacct atagaaattt gtggaaaaar ggctataggt
421 acagtattag taggacccac acctgtcaac ataattggaa gaaatatgtt gactcagctt
481 ggatgcacac taaattttcc aatcagtcct attgaaactg taccagtaaa attaaagcca
541 ggaatggatg gcccaaaggt taaacaatgg ccattgacag aagagaaaat aaaagcatta
601 acagaaatct gtgatgaaat ggagaaggaa ggaaaaatta caaaaattgg gcctgaaaat
661 ccatataaca ctccaatatt ygccataaaa aagaaggaca gtactaagtg gagaaaatta
721 gtagatttca gggaactcaa taaaagaact caagattttt gggaagtcca attaggaata
781 ccacacccag cagggttaaa raagaaaaaa tcagtgacag tactagatgt gggggatgca
841 tatttttcag tacctttata tgaagacttc aggaagtata ctgcattcac catacctagt
901 acaaacaatg aaacaccagg gattaggtat caatataatg tgcttccaca gggatggaaa
961 ggatcaccag caatattcca ggctagcatg acaaaaatct tagagccctt tagggaacaa
1021 aatccagaaa tagtcatcta tcaatatatg gatgacttgt atgtaggatc tgacttagaa
1081 atagggcaac atagagctaa aatagaggag ttaagagaac atctgttaaa gtggggattt
1141 accacaccag ataagaagca tcagaaagaa cccccatttc tttggatggg gtatgaactc
1201 catcctgaca aatggacagt acagcctata cagctgccag aaaaggatag ctggactgtc
1261 aatgatatac agaagttagt gggaaaatta aactgggcaa gtcagattta cccaggaatt
1321 aaaataaggc aactttgtaa actccttagg ggggccaaag cactaacaga aatagtacca
1381 ctaactaaag aagcagaatt agaattggca gaaaacaggg aaattctaaa agaaccagta
1441 catggagcat attatgaccc atcaaaagac ttaatagctg aaatccagaa acaggggcgg
1501 gaccagtgga catatcaaat ttaccaggaa ccattcaaaa atctgaaaac agggaagtat
1561 gcaaaaatga ggactgccca cactaatgat gtaaaacagt taacagaggc tgtgcagaaa
1621 atagccatgg aaagcatagt aatatgggga aagactccta aatttagatt acccatccaa
1681 aaggaaacat gggagacatg gtg
//
The Output I want:
LC570899 Human immunodeficiency virus 1 msaha01609 pol gene for pol protein, partial cds
QRPLVPIKVGGQTKEALLDTGADDTVLEEINLPGKWKPKMIGGI GGFIKVRQYDQIPIEICGXKAIGTVLVGPTPVNIIGRNLLTQLGCTLNFPISPIETVP
VKLKPGMDGPKVKQWPLTEEKIKALTAICEEMEKEGKISKIGPENPYNTPIFAIKKKD
STKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLYEDFR
KYTAFTIPSLNNETPGIRYQYNVLPQGWKGSPXIFQASMTKILEPFRAQNPEIVIYQY
MDDLYVGSDLEIGQHRAKIEE
EU158868 HIV-1 isolate G8-AFMC-5 from India gag protein (gag) and pol protein (pol) genes, partial cds.
FFRENLAFPQGEAREFPPEQTRANSPTSRELQVRGANPSSEAGA
ERQGALNCPQITLWQRPLVSIKVGGQTKEALLDTGADDTVLEEINLPGKWKPKMIGGI
GGFIKVRQYDQIPIEICGKXAIGTVLVGPTPVNIIGRNMLTQLGCTLNFPISPIETVP
VKLKPGMDGPKVKQWPLTEEKIKALTEICDEMEKEGKITKIGPENPYNTPIFAIKKKD
STKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLYEDFR
KYTAFTIPSTNNETPGIRYQYNVLPQGWKGSPAIFQASMTKILEPFREQNPEIVIYQY
MDDLYVGSDLEIGQHRAKIEELREHLLKWGFTTPDKKHQKEPPFLWMGYELHPDKWTV
QPIQLPEKDSWTVNDIQKLVGKLNWASQIYPGIKIRQLCKLLRGAKALTEIVPLTKEA
ELELAENREILKEPVHGAYYDPSKDLIAEIQKQGRDQWTYQIYQEPFKNLKTGKYAKM
RTAHTNDVKQLTEAVQKIAMESIVIWGKTPKFRLPIQKETWETW
P.S. As you can be seen in example no.2, there are two genes gag and pol. I want a translated sequence of pol gene only.
It may be possible to change the original query to just get the
pol
protein sequence. It would avoid any parsing. Which specific viruses did you look at?HIV-1 virus
I need the solution urgently. If you can help me out, then, please help...
You're going to have to share a couple of additional details with us if what you want is a "solution":
GenBank
record you're interested in?Also, reposting the same question isn't doing you any favors.
Hey,
The above two records were just examples. In fact, I have a Genbank file which contains 9000 records which I retrieved using an NCBI E-utility tool. And, Yes, I do have a list of identifiers but I need to retrieve the protein sequences from the GB file only. This is the actual problem, I need to extract the protein sequence of the pol gene from this particular GB file in a fast file.
If you have any further questions let me know.....