Question

accession to taxid for fasta files downloaded from refseq

0

Entering edit mode

8.4 years ago

badribio ▴ 290

I am trying to build a database for metagenomic analysis have all the genomic.fna.gz files for bacteria and virus from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ release 77 which has 42080 species for bacteria and 5654 for viral

header for viral sequence looks like this

>gi|433660771|ref|NC_019944.1| Okra enation leaf curl alphasatellite, complete sequence

And for bacterial

>gi|759427590|ref|NZ_CDDW01000001.1| Aeromonas salmonicida subsp. salmonicida genome assembly PRJEB7036, contig F321_contig22, whole genome shotgun sequence

Is there a way i can modify all the fasta files (n =659 for bacterial and n=2 for viral) in such a way that my header look like this for both viral and bacterial :

>NZ_CDDW01000001.1|some_text|taxid Aeromonas salmonicida subsp. salmonicida genome assembly PRJEB7036 contig F321_contig22, whole genome shotgun sequence

Thanks

Badri

sequence bacteria viral refseq fasta • 3.9k views

ADD COMMENT • link updated 8.4 years ago by Brian Bushnell 20k • written 8.4 years ago by badribio ▴ 290

score 0 · Answer 1 · 2016-08-11

0

Entering edit mode

8.4 years ago

shenwei356 8.7k

can you write scripts, if yes. you can parse the gi_taxid_nucl.dmp.gz file from ncbi taxonomy ftp and replace GIs in fasta header.

ADD COMMENT • link 8.4 years ago by shenwei356 8.7k

0

Entering edit mode

Yes I can write scripts, but the problem is i have concatenated all the fna to one huge fasta file which is 235Gb. And my awk script is not helping to slow.

ADD REPLY • link 8.4 years ago by badribio ▴ 290

GenoMax · Answer 2 · 2016-08-11

0

Entering edit mode

8.4 years ago

Brian Bushnell 20k

You can do this with the BBMap package's gi2taxid.sh tool, by following the guide in /bbmap/docs/guides/TaxonomyGuide.txt - specifically, the step labeled "Renaming gi numbers to NCBI Tax ID numbers".

ADD COMMENT • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

that's cool. should be fast

ADD REPLY • link 8.4 years ago by shenwei356 8.7k

0

Entering edit mode

Thats awesome!!

Second question, is there a way in bbmap which i can use to extract out reads from a fasta file based on a one column text file with accession id for each sequence.

Thanks

ADD REPLY • link 8.4 years ago by badribio ▴ 290

0

Entering edit mode

it has, visit the homepage

ADD REPLY • link 8.4 years ago by shenwei356 8.7k

0

Entering edit mode

is it filterbyname.sh ? if so it not no working for me..

ADD REPLY • link 8.4 years ago by badribio ▴ 290

0

Entering edit mode

filterbyname.sh is the correct program. The names need to be reasonably similar. Post examples of your ID's of interest and some fasta sequence headers.

ADD REPLY • link 8.4 years ago by GenoMax 148k

0

Entering edit mode

My multifasta file has (have turncated reads to fit in here) bash-4.1$ head viral.2.1.genomic.fna

>gi|433660771|ref|NC_019944.1| Okra enation leaf curl alphasatellite, complete sequence
CTCGAATAGTTTAATGGTATTGGGTTGCGGCCCAATAAATTAAAAGTCCACTAGATAAGCGAATATTGACTGGTCAACTC
>gi|433660773|ref|NC_019946.1| Tomato yellow mottle virus segment DNA-A, complete sequence
TTGGTAAAGTCATGTGCATCTCAGACATCACACGAGGTGGTGGTATTACCCACCGTGTAGGTAAACGTTT

My text file bash-4.1$ head trial.txt

NC_019944.1
NC_019946.1
NC_019947.1
NC_019704.1
NC_019519.1
NC_019521.1
NC_019945.1
NC_020072.1
NC_020074.1
NC_020083.1

command used:

./filterbyname.sh -Xmx10g in=viral.2.1.genomic.fna names=trial.txt include=t out=viral_trial.fasta

ADD REPLY • link updated 8.4 years ago by GenoMax 148k • written 8.4 years ago by badribio ▴ 290

0

Entering edit mode

Do you get an error or just an empty file?

ADD REPLY • link 8.4 years ago by GenoMax 148k

0

Entering edit mode

Just an empty file, see log

Input is being processed as unpaired Time: 0.257 seconds.

Reads Processed: 469 1.83k reads/sec

Bases Processed: 23151316 90.17m bases/sec

Reads Out: 0

Bases Out: 0

ADD REPLY • link 8.4 years ago by badribio ▴ 290

0

Entering edit mode

I think this time the "|" (pipe) characters may be causing trouble. I will take a look at this later but perhaps @Brian will be along before that.

ADD REPLY • link 8.4 years ago by GenoMax 148k

0

Entering edit mode

I think i fixed it changed the headers using awk '$1 ~ />/ { split($0,a,"|"); print ">"a[4],"\t",a[5];next } { print $0 }'

my new headers

NC_019944.1 Okra enation leaf curl alphasatellite, complete sequence

NC_019946.1 Tomato yellow mottle virus segment DNA-A, complete sequence

NC_019947.1 Tomato yellow mottle virus segment DNA-B, complete sequence

NC_019704.1 Enterobacteria phage mEp237, complete genome

NC_019519.1 Agrobacterium phage 7-7-1, complete genome

NC_019521.1 Sphingomonas phage PAU, complete genome

NC_019945.1 Thysanoplusia orichalcea NPV isolate p2, complete genome

NC_020072.1 Colombian datura virus, complete genome

NC_020074.1 Bovine adenovirus 6 strain 671130, complete genome

NC_020083.1 Serratia phage phiMAM1, complete genome

This did the job.

Thanks

ADD REPLY • link 8.4 years ago by badribio ▴ 290

0

Entering edit mode

I tried using it on my original fasta file (235Gb and trial.txt file (117 Gb , with 6515957 accession id) and the step failed just like before no output or error just a blank file, note I used 60Gb of memory for this step.

ADD REPLY • link 8.4 years ago by badribio ▴ 290