How to reformat (i.e. to clean) NCBI .fasta archives into a singleline .fasta with only the unique identifier before each seqeunce?
4
0
Entering edit mode
6.2 years ago
johnnytam100 ▴ 110

Hi, I have just downloaded the NCBI nr protein sequences from here. Opening the unzipped file, it looks like this:

>S18 [Lactococcus lactis subsp. lactis]^AATZ02303.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^APLW60021.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^AAUS70574.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^APPA66113.1 30S ribosomal protein S18 [Lactococcus lactis]^ABBC75095.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]^AAWN66876.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^ASPS10927.1 30S ribosomal protein S18 [Lactococcus lactis]^ARDG21709.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]^AAXN66482.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris]^ARHJ25897.1 30S ribosomal protein S18 [Lactococcus lactis]^ARJK90210.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ
N
>XP_642131.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]^AP54670.1 RecName: Full=Calfumirin-1; Short=CAF-1^ABAA06266.1 calfumirin-1 [Dictyostelium discoideum AX2]^AEAL68086.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
VQKLLNPDQ
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]^AEAL68957.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE
DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR
IEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
>WP_000184067.1 MULTISPECIES: MbtH family protein [Bacillus]^ANP_844755.1 hypothetical protein BA_2373 [Bacillus anthracis str. Ames]^AYP_028470.1 hypothetical protein BAS2209 [Bacillus anthracis str. Sterne]^AYP_036475.1 balhimycin biosynthetic protein MbtH [[Bacillus thuringiensis] serovar konkukian str. 97-27]^AAAP26241.1 mbtH-like protein [Bacillus anthracis str. Ames]^AAAT31492.1 mbtH-like protein [Bacillus anthracis str. 'Ames Ancestor']^AAAT54521.1 mbtH-like protein [Bacillus anthracis str. Sterne]^AAAT62162.1 MbtH protein [[Bacillus thuringiensis] serovar konkukian str. 97-27]^AABK85418.1 mbtH-like protein [Bacillus thuringiensis str. Al Hakam]^AEDR19165.1 mbtH-like protein [Bacillus anthracis str. A0488]^AEDR87721.1 mbtH-like protein [Bacillus anthracis str. A0193]^AEDR94244.1 mbtH-like protein [Bacillus anthracis str. A0442]^AEDS97287.1 mbtH-like protein [Bacillus anthracis str. A0389]^AEDT19705.1 mbtH-like protein [Bacillus anthracis str. A0465]^AEDT69654.1 mbtH-like protein [Bacillus anthracis str. A0174]^AEDV17672.1

How could I reformat the file to a singleline .fasta (to remove the ^A etc.) with only the unique identifier (i.e. without any additional information e.g. species name etc.) before each seqeunce?

>identifier_1
seq1
>identifier_2
seq2
>identifier_3
seq3

Thanks in advance!!!

linux bash ncbi fasta • 1.7k views
ADD COMMENT
2
Entering edit mode
6.2 years ago

An awk solution:

$ awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 != "" {seq = ""; split($1, name, " "); for(i=2;i<=NF;i++) {seq = seq$i}; print ">"name[1], seq}' input.fa > output.fa

fin swimmer

ADD COMMENT
0
Entering edit mode

Thank you so much!!!

ADD REPLY
1
Entering edit mode
6.2 years ago
Anima Mundi ★ 2.9k

A Python 2.7 solution:

import sys

header = ''
seq = ''

j = 0
for line in open(sys.argv[1]):
    j += 1

n = 0
for line in open(sys.argv[1]):
    n += 1
    if line[0] == '>':
        print seq
        seq = ''
        for char in line:
            if char != ' ':
                header += char
            else:
                print header
                header = ''
                break
    elif n == j:
        seq += line.replace('\n','')
        print seq
    else:
        seq += line.replace('\n','')
ADD COMMENT
0
Entering edit mode

Thank you so much!!!

ADD REPLY
1
Entering edit mode
6.2 years ago
Jung Soh ▴ 10

A solution using the seqtk toolkit:

seqtk seq -Cl0 in.fasta > out.fasta

The -C option drops the comment (what follows the ID on the header line) and the -l option indicates the sequence line length with 0 representing a maximum of 2^32-1.

ADD COMMENT
1
Entering edit mode
6.2 years ago
Chirag Parsania ★ 2.0k

R solution

library(Biostrings)
aa_fasta_file <- Biostrings::readAAStringSet(filepath = "~/Downloads/ff.fasta")

## remove everything after first space in header 
names(aa_fasta_file) <- gsub("\\s.*" , "" , names(aa_fasta_file)) 

aa_fasta_file
> aa_fasta_file
  A AAStringSet instance of length 3
    width seq                                                                                                                                names               
[1]    81 MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQN                                                  S18
[2]   169 MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITI...KDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ XP_642131.1
[3]   217 MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGW...YFKFKRIEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM XP_642837.1

Biostrings::writeXStringSet(aa_fasta_file , filepath = "path/to/save/filename.fasta")
ADD COMMENT

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6