Extract anticodon and other information from multiple genbank files
1
0
Entering edit mode
6.0 years ago
jg • 0

I have around 1000 gbff (genbank) files and I want to extract from each:

  1. Accession number e.g. LOCUS NZ_CP011636
  2. tax ID e.g. taxon:571
  3. all the tRNA anticodons (zero or multiple per file) e.g. /anticodon=pos:complement(1141190..1141192),aa:Met,seq:cat) (I actually just want the aa:Met,seq:cat part)

(note that the latter sometimes spills across two lines).

and compile output into a table containing rows for each file.

Example of file: https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP011636

grep -e "LOCUS" -e "taxon:" -e "anticodon=" Klebsiella_oxytoca.gbff > output.txt

This sort of works but grep doesn't work when #3 spills onto two lines and also I need to store all the output from multiple files. Thanks for any help!

Genbank • 1.5k views
ADD COMMENT
4
Entering edit mode
6.0 years ago
vkkodali_ncbi ★ 3.8k

This is probably best done using BioPython for a more 'readable' code but if you wish, you can do this using Entrez Direct on the command line using this command

## <filename.txt> should have nucleotide accessions; one per line
epost -db nuccore -input <filename.txt> \
    | efetch -db nuccore -format gb -mode xml -style withparts \
    | xtract -pattern GBSeq -tab '\n' -element GBSeq_accession-version \
        -group GBFeature -if GBFeature_key -equals 'source' \
            -block GBQualifier -if GBQualifier_name -equals 'db_xref' \
            -tab '\n' -element GBQualifier_value \
        -group GBFeature -if GBFeature_key -equals 'tRNA' \
            -block GBQualifier -if GBQualifier_name -equals 'anticodon' \
            -tab '\n' -element GBQualifier_value 

## sample output
NZ_CP011636.1
taxon:571
(pos:complement(60036..60038),aa:Glu,seq:ttc)
(pos:complement(118496..118498),aa:Thr,seq:ggt)
ADD COMMENT
0
Entering edit mode

This is really helpful thanks! Just one issue: this genome has 86 tRNAs but this only spits out 8 lines. Any ideas why? Also, is there a way to give it a list of accession numbers to process and then save the output? Thanks again!

ADD REPLY
1
Entering edit mode

To keep the output size manageable I piped it tohead. Remove that and you will see all 86 features

ADD REPLY
0
Entering edit mode

Thanks! Is there a way to give it a list of accession numbers to process and then save all the output? Thanks again!

ADD REPLY
1
Entering edit mode

I have updated the command above to clean it up a little bit, remove the final head command and add an epost step to read nucleotide accessions from a file. Replace <filename.txt> with your file. Good luck!

ADD REPLY
0
Entering edit mode

Awesome! Sorry to ask another question but -group command is not found and it doesn't appear to be in the /edirect directory. Any ideas? Thanks!

ADD REPLY
1
Entering edit mode

This sounds like a formatting issue. See what happens if you paste the entire command as a single line without the backslashes.

ADD REPLY

Login before adding your answer.

Traffic: 2621 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6