Question

How to reformat (i.e. to clean) NCBI .fasta archives into a singleline .fasta with only the unique identifier before each seqeunce?

0

Entering edit mode

6.1 years ago

johnnytam100 ▴ 110

Hi, I have just downloaded the NCBI nr protein sequences from here. Opening the unzipped file, it looks like this:

>S18 [Lactococcus lactis subsp. lactis]^AATZ02303.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^APLW60021.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^AAUS70574.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^APPA66113.1 30S ribosomal protein S18 [Lactococcus lactis]^ABBC75095.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]^AAWN66876.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^ASPS10927.1 30S ribosomal protein S18 [Lactococcus lactis]^ARDG21709.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]^AAXN66482.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris]^ARHJ25897.1 30S ribosomal protein S18 [Lactococcus lactis]^ARJK90210.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ
N
>XP_642131.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]^AP54670.1 RecName: Full=Calfumirin-1; Short=CAF-1^ABAA06266.1 calfumirin-1 [Dictyostelium discoideum AX2]^AEAL68086.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
VQKLLNPDQ
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]^AEAL68957.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE
DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR
IEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
>WP_000184067.1 MULTISPECIES: MbtH family protein [Bacillus]^ANP_844755.1 hypothetical protein BA_2373 [Bacillus anthracis str. Ames]^AYP_028470.1 hypothetical protein BAS2209 [Bacillus anthracis str. Sterne]^AYP_036475.1 balhimycin biosynthetic protein MbtH [[Bacillus thuringiensis] serovar konkukian str. 97-27]^AAAP26241.1 mbtH-like protein [Bacillus anthracis str. Ames]^AAAT31492.1 mbtH-like protein [Bacillus anthracis str. 'Ames Ancestor']^AAAT54521.1 mbtH-like protein [Bacillus anthracis str. Sterne]^AAAT62162.1 MbtH protein [[Bacillus thuringiensis] serovar konkukian str. 97-27]^AABK85418.1 mbtH-like protein [Bacillus thuringiensis str. Al Hakam]^AEDR19165.1 mbtH-like protein [Bacillus anthracis str. A0488]^AEDR87721.1 mbtH-like protein [Bacillus anthracis str. A0193]^AEDR94244.1 mbtH-like protein [Bacillus anthracis str. A0442]^AEDS97287.1 mbtH-like protein [Bacillus anthracis str. A0389]^AEDT19705.1 mbtH-like protein [Bacillus anthracis str. A0465]^AEDT69654.1 mbtH-like protein [Bacillus anthracis str. A0174]^AEDV17672.1

How could I reformat the file to a singleline .fasta (to remove the ^A etc.) with only the unique identifier (i.e. without any additional information e.g. species name etc.) before each seqeunce?

>identifier_1
seq1
>identifier_2
seq2
>identifier_3
seq3

Thanks in advance!!!

linux bash ncbi fasta • 1.6k views

ADD COMMENT • link updated 6.1 years ago by Chirag Parsania ★ 2.0k • written 6.1 years ago by johnnytam100 ▴ 110

score 2 · Answer 1 · 2018-10-15

2

Entering edit mode

6.1 years ago

finswimmer 16k

An awk solution:

$ awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 != "" {seq = ""; split($1, name, " "); for(i=2;i<=NF;i++) {seq = seq$i}; print ">"name[1], seq}' input.fa > output.fa

fin swimmer

ADD COMMENT • link 6.1 years ago by finswimmer 16k

0

Entering edit mode

Thank you so much!!!

ADD REPLY • link 6.1 years ago by johnnytam100 ▴ 110

score 1 · Answer 2 · 2018-10-15

1

Entering edit mode

6.1 years ago

Anima Mundi ★ 2.9k

A Python 2.7 solution:

import sys

header = ''
seq = ''

j = 0
for line in open(sys.argv[1]):
    j += 1

n = 0
for line in open(sys.argv[1]):
    n += 1
    if line[0] == '>':
        print seq
        seq = ''
        for char in line:
            if char != ' ':
                header += char
            else:
                print header
                header = ''
                break
    elif n == j:
        seq += line.replace('\n','')
        print seq
    else:
        seq += line.replace('\n','')

ADD COMMENT • link 6.1 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

Thank you so much!!!

ADD REPLY • link 6.1 years ago by johnnytam100 ▴ 110

score 1 · Answer 3 · 2018-10-19

1

Entering edit mode

6.1 years ago

Jung Soh ▴ 10

A solution using the seqtk toolkit:

seqtk seq -Cl0 in.fasta > out.fasta

The -C option drops the comment (what follows the ID on the header line) and the -l option indicates the sequence line length with 0 representing a maximum of 2^32-1.

ADD COMMENT • link 6.1 years ago by Jung Soh ▴ 10

score 1 · Answer 4 · 2018-10-19

R solution

library(Biostrings)
aa_fasta_file <- Biostrings::readAAStringSet(filepath = "~/Downloads/ff.fasta")

## remove everything after first space in header 
names(aa_fasta_file) <- gsub("\\s.*" , "" , names(aa_fasta_file)) 

aa_fasta_file
> aa_fasta_file
  A AAStringSet instance of length 3
    width seq                                                                                                                                names               
[1]    81 MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQN                                                  S18
[2]   169 MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITI...KDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ XP_642131.1
[3]   217 MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGW...YFKFKRIEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM XP_642837.1

Biostrings::writeXStringSet(aa_fasta_file , filepath = "path/to/save/filename.fasta")