Question

Fasta header, search and replace...?

0

Entering edit mode

9.0 years ago

Buffo ★ 2.4k

Hi everybody, I did a blastp (about 7,000 sequences), after that, I want to make a table of content, some sequences has the same hit (from gene family) so I did a python script for count duplicates and the result from that script looks like:

Hypothetical protein                  400
Hypothetical protein, putative        200
hypothetical Protein                   40
Hypotetycal protein                     2
etc... 
etc.. with different gene`s names and different errors

In my result I have different counts for the same target because they have not the same name, so, wich I did was check almost one by one that errors and I did a table, and it looks like:

Variants                                                    Rename
Hypothetical protein                                  Hypothetical protein
Hypothetical protein, putative                        Hypothetical protein  
hypothetical Protein                                  Hypothetical protein
trans-sialidase                                        trans-sialidase
trans-sialidase, putative,                             trans-sialidase
mucin-associated surface protein (MASP), putative      mucin-associated surface protein (MASP)
mucin-associated surface protein (MASP)                mucin-associated surface protein (MASP)
etc 
etc

I have to do another blastp with other sequences but with same database, so if I don´t do anything I will have the same error, wich I want to do is rename that sequences (column variants) that have the variants names for one wich I can count with my script (Rename column), the fasta headers looks like:

>TcC.509233.70 | organism=TCCBR_STRA_6789 | product=Spastin, putative | location=TcChr15-S:273711-276341(-) | length=876 | sequence_SO=chromosome | SO=protein_codingT
TKAANNNSTRVTHGNSLLQRVRQSSYCKGIPEETCLAVLQQVVDRACPVSFSGISGLEVC
>TcCLB.506503.53 | organism=TCCBR_STRA_6789 | product=hypothetical protein | location=TcChr40-S:277103-278386(-) | length=427 | sequence_SO=chromosome | SO=protein_coding
MSFEHNASLGLRGSGGKHFSRCPPYMHSGRGASPPKRLPSRRATSHGETGPKVPAHRAYG

I have tried to write a python script for doing that but I can´t,i have tried doing a split '|' but I can´t replace the name :( somebody has did something like this? somebody have some advices for doing that? Help!

sequence database fasta genome • 3.9k views

ADD COMMENT • link updated 9.0 years ago by shenwei356 8.7k • written 9.0 years ago by Buffo ★ 2.4k

score 3 · Answer 1 · 2016-08-03

Try seqkit replace, download, usage of subcommand replace

$ seqkit replace    --pattern '(product=[^,]+),?[^\|]* \|'    --replacement '$1 |'    seq.fa
>TcC.509233.70 | organism=TCCBR_STRA_6789 | product=Spastin | location=TcChr15-S:273711-276341(-) | length=876 | sequence_SO=chromosome | SO=protein_codingT
TKAANNNSTRVTHGNSLLQRVRQSSYCKGIPEETCLAVLQQVVDRACPVSFSGISGLEVC
>TcCLB.506503.53 | organism=TCCBR_STRA_6789 | product=hypothetical protein | location=TcChr40-S:277103-278386(-) | length=427 | sequence_SO=chromosome | SO=protein_coding
MSFEHNASLGLRGSGGKHFSRCPPYMHSGRGASPPKRLPSRRATSHGETGPKVPAHRAYG

You can also edit on the table file with csvtk, download, usage of subcommand mutate

$ head -n 3 table.tsv 
Variants
Hypothetical protein
Hypothetical protein, putative

$ csvtk -t mutate --fields Variants --pattern '(^[^,]+),?' --name Rename table.tsv > renamed_table.tsv

$ csvtk -t pretty renamed_table.tsv 
Variants                                            Rename
Hypothetical protein                                Hypothetical protein
Hypothetical protein, putative                      Hypothetical protein
hypothetical Protein                                hypothetical Protein
trans-sialidase                                     trans-sialidase
trans-sialidase, putative,                          trans-sialidase
mucin-associated surface protein (MASP), putative   mucin-associated surface protein (MASP)
mucin-associated surface protein (MASP)             mucin-associated surface protein (MASP)

score 0 · Answer 2 · 2016-08-03

You can split by "|" then iterating over the elements of the split, split each element on "=" and if [0]=='product' look for the second element in your dictionary, replace the string if needed and then join back on "=" and join the entire list on "|" and write this string as the name of the fasta to another file.