Hi everybody, I did a blastp (about 7,000 sequences), after that, I want to make a table of content, some sequences has the same hit (from gene family) so I did a python script for count duplicates and the result from that script looks like:
Hypothetical protein 400
Hypothetical protein, putative 200
hypothetical Protein 40
Hypotetycal protein 2
etc...
etc.. with different gene`s names and different errors
In my result I have different counts for the same target because they have not the same name, so, wich I did was check almost one by one that errors and I did a table, and it looks like:
Variants Rename
Hypothetical protein Hypothetical protein
Hypothetical protein, putative Hypothetical protein
hypothetical Protein Hypothetical protein
trans-sialidase trans-sialidase
trans-sialidase, putative, trans-sialidase
mucin-associated surface protein (MASP), putative mucin-associated surface protein (MASP)
mucin-associated surface protein (MASP) mucin-associated surface protein (MASP)
etc
etc
I have to do another blastp with other sequences but with same database, so if I don´t do anything I will have the same error, wich I want to do is rename that sequences (column variants) that have the variants names for one wich I can count with my script (Rename column), the fasta headers looks like:
>TcC.509233.70 | organism=TCCBR_STRA_6789 | product=Spastin, putative | location=TcChr15-S:273711-276341(-) | length=876 | sequence_SO=chromosome | SO=protein_codingT
TKAANNNSTRVTHGNSLLQRVRQSSYCKGIPEETCLAVLQQVVDRACPVSFSGISGLEVC
>TcCLB.506503.53 | organism=TCCBR_STRA_6789 | product=hypothetical protein | location=TcChr40-S:277103-278386(-) | length=427 | sequence_SO=chromosome | SO=protein_coding
MSFEHNASLGLRGSGGKHFSRCPPYMHSGRGASPPKRLPSRRATSHGETGPKVPAHRAYG
I have tried to write a python script for doing that but I can´t,i have tried doing a split '|' but I can´t replace the name :( somebody has did something like this? somebody have some advices for doing that? Help!
I never had used that commands but it works!! Thank you so much but share your knowledge! and for your time
Thanks for using my seqkit and csvtk, there are much more functions that you can explorer on the websites. They are both open-source at Github: seqkit and csvtk.