Entering edit mode
6.0 years ago
jg
•
0
I have around 1000 gbff (genbank) files and I want to extract from each:
- Accession number e.g.
LOCUS NZ_CP011636
- tax ID e.g.
taxon:571
- all the tRNA anticodons (zero or multiple per file) e.g.
/anticodon=pos:complement(1141190..1141192),aa:Met,seq:cat)
(I actually just want theaa:Met,seq:cat
part)
(note that the latter sometimes spills across two lines).
and compile output into a table containing rows for each file.
Example of file: https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP011636
grep -e "LOCUS" -e "taxon:" -e "anticodon=" Klebsiella_oxytoca.gbff > output.txt
This sort of works but grep
doesn't work when #3 spills onto two lines and also I need to store all the output from multiple files. Thanks for any help!
This is really helpful thanks! Just one issue: this genome has 86 tRNAs but this only spits out 8 lines. Any ideas why? Also, is there a way to give it a list of accession numbers to process and then save the output? Thanks again!
To keep the output size manageable I piped it to
head
. Remove that and you will see all 86 featuresThanks! Is there a way to give it a list of accession numbers to process and then save all the output? Thanks again!
I have updated the command above to clean it up a little bit, remove the final
head
command and add anepost
step to read nucleotide accessions from a file. Replace<filename.txt>
with your file. Good luck!Awesome! Sorry to ask another question but
-group
command is not found and it doesn't appear to be in the/edirect
directory. Any ideas? Thanks!This sounds like a formatting issue. See what happens if you paste the entire command as a single line without the backslashes.