Question

How to extract headers from MultiFasta file to generate taxonomy

0

Entering edit mode

4.7 years ago

m.radz ▴ 10

I need to extract the headers from a multifasta file being used as a database for metagenomic analysis. The structure of the fasta file is as follows:

>NW_002197112.1 Penicillium marneffei ATCC 18224 scf_1107713384177, whole genome shotgun sequence
GCCTTAAAATGCCGCTTCCCAGATCTGCGCCGAAGAGCAATCCATCTCCTCTCCAGCCCCAATGCAGCAACTGCTAACGG
CAGTGCGACGTGCGGGGTGAATTTCAGCGGTTGCTATCGACTTGTGCCATCGCAGCGTTTTCGCGTCCACGGTCGCCGCC
GCATGCTCCATGCACGATATGGCTGGTCGGATGCTAGTTGTGCTC

>NW_002197111.1 Penicillium marneffei ATCC 18224 scf_1107713383857, whole genome shotgun sequence
TACTGCTTTGTGGAACATCGCCCTTGTGGAGATCTCCCTCACGCTGGATGTTGAAAGACGCAGAACAGTTGGCACAGCCA
ATTTAGAATGCCTGATCAAGACGCATCGCCACATCCAGGCAGGTGCGATTCCTCTCTTATAAATAAATATTTTCAACGGC
ATCTGGAGAACTCATCAACTTGCAGTTGCTCATCATTATCTCGGTCAT

What I need to do is extract only the identifiers and the taxonomy to create a taxonomy.txt file with the same structure as shown below, with taxonomy separated by level and the taxon identifier in the final column:

Saccharomycetaceae;Kluyveromyces;lactis;CR382121.1

Saccharomycetaceae;Kluyveromyces;lactis;CR382122.1

grep Fasta sequencing next-gen • 2.6k views

ADD COMMENT • link 4.7 years ago by m.radz ▴ 10

1

Entering edit mode

What have you tried so far? Did you attempt a solution with awk / perl / python / whatever? Did you try searching BioStars? Manipulating fasta headers is a recurring theme, searching the site should give you enough material to at least get started.

ADD REPLY • link 4.7 years ago by h.mon 35k

0

Entering edit mode

I tried a couple methods using grep, but I was extracting everthing to a text file instead of just the taxonomy and ID number

ADD REPLY • link 4.7 years ago by m.radz ▴ 10

0

Entering edit mode

Do you have a file for mapping accession ids to taxonomy? If so, you can use this code

grep ">" input.fasta | sed 's/>//g' | sed 's/ .*//g' | while read line ; do grep $line mappingfile ; done

and then you can add more commands to the last grep to modify the output.

ADD REPLY • link 4.7 years ago by Fatima ▴ 1000

0

Entering edit mode

4.7 years ago

Mensur Dlakic ★ 28k

Don't know if there is an automated way to do what you want, but I will point you to several files that might help.

https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz

This file maps sequence IDs to taxonomic IDs. Looks like this:

accession       accession.version       taxid   gi
A0A023GPI8      A0A023GPI8.1    232300  1027923628
A0A023GPJ0      A0A023GPJ0.2    716541  765680613

https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz

Inside of this archive there is a file names.dmp that connects numbers in previous file to taxonomic categories. For example, taxid 232300 means Canavalia boliviana and 716541 means Enterobacter cloacae subsp. cloacae ATCC 13047. You should be able to extract sequence IDs from your file, add to them taxids from prot file, and replace the IDs with species names from the taxdump file.

ADD COMMENT • link 4.7 years ago by Mensur Dlakic ★ 28k

score 1 · Accepted Answer · 2020-03-16

1

Entering edit mode

4.7 years ago

m.radz ▴ 10

Hi All,

I actually found the answer I needed in this post:

Extraction Of Header Of Sequences In Fasta File

ADD COMMENT • link 4.7 years ago by m.radz ▴ 10

1

Entering edit mode

With all due respect, that is not the answer to your question. You asked for

with taxonomy separated by level and the taxon identifier in the final column

None of the answers on the page you linked to are about taxonomy. Maybe you changed your mind as to what you need, but your question above is quite specific and the accepted answer does not address it.

ADD REPLY • link 4.7 years ago by Mensur Dlakic ★ 28k