Entering edit mode
6.2 years ago
johnnytam100
▴
110
Hi, I have just downloaded the NCBI nr protein sequences from here. Opening the unzipped file, it looks like this:
>S18 [Lactococcus lactis subsp. lactis]^AATZ02303.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^APLW60021.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^AAUS70574.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^APPA66113.1 30S ribosomal protein S18 [Lactococcus lactis]^ABBC75095.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]^AAWN66876.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^ASPS10927.1 30S ribosomal protein S18 [Lactococcus lactis]^ARDG21709.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]^AAXN66482.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris]^ARHJ25897.1 30S ribosomal protein S18 [Lactococcus lactis]^ARJK90210.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ
N
>XP_642131.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]^AP54670.1 RecName: Full=Calfumirin-1; Short=CAF-1^ABAA06266.1 calfumirin-1 [Dictyostelium discoideum AX2]^AEAL68086.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
VQKLLNPDQ
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]^AEAL68957.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE
DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR
IEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
>WP_000184067.1 MULTISPECIES: MbtH family protein [Bacillus]^ANP_844755.1 hypothetical protein BA_2373 [Bacillus anthracis str. Ames]^AYP_028470.1 hypothetical protein BAS2209 [Bacillus anthracis str. Sterne]^AYP_036475.1 balhimycin biosynthetic protein MbtH [[Bacillus thuringiensis] serovar konkukian str. 97-27]^AAAP26241.1 mbtH-like protein [Bacillus anthracis str. Ames]^AAAT31492.1 mbtH-like protein [Bacillus anthracis str. 'Ames Ancestor']^AAAT54521.1 mbtH-like protein [Bacillus anthracis str. Sterne]^AAAT62162.1 MbtH protein [[Bacillus thuringiensis] serovar konkukian str. 97-27]^AABK85418.1 mbtH-like protein [Bacillus thuringiensis str. Al Hakam]^AEDR19165.1 mbtH-like protein [Bacillus anthracis str. A0488]^AEDR87721.1 mbtH-like protein [Bacillus anthracis str. A0193]^AEDR94244.1 mbtH-like protein [Bacillus anthracis str. A0442]^AEDS97287.1 mbtH-like protein [Bacillus anthracis str. A0389]^AEDT19705.1 mbtH-like protein [Bacillus anthracis str. A0465]^AEDT69654.1 mbtH-like protein [Bacillus anthracis str. A0174]^AEDV17672.1
How could I reformat the file to a singleline .fasta (to remove the ^A etc.) with only the unique identifier (i.e. without any additional information e.g. species name etc.) before each seqeunce?
>identifier_1
seq1
>identifier_2
seq2
>identifier_3
seq3
Thanks in advance!!!
Thank you so much!!!