hi i'am trying to edit the header of my fasta files using seqkit and i have been able to do it but i'm not able to save it! the command i am using to edit multiple fasta files with respect to their filename and doing it with refseq-
for i in $(find -name \genomid); do
seqkit replace -p "^(.+?) (.+?)$" --replacement '{kv}' -k proid_unique *.faa;
done
The directory having all my fasta files is like this-
PATH:
~/PANGENOMICS/DATA1/test
FILES in the directory:
GCF_000016305.1_ASM1630v1_protein.faa
GCF_000220485.1_ASM22048v1_protein.faa
GCF_900635735.1_32875_B01_protein.faa
proid_unique
genomid
i am finding filenames using a csv file list- genomid
GCF_900635735.1_32875_B01_protein.faa:WP_151362402.1:WP_151362402.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362403.1:WP_151362403.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362404.1:WP_151362404.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362405.1:WP_151362405.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362406.1:WP_151362406.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362407.1:WP_151362407.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362408.1:WP_151362408.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362409.1:WP_151362409.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362410.1:WP_151362410.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362411.1:WP_151362411.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362412.1:WP_151362412.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362413.1:WP_151362413.1#0940
GCF_900635735.1_32875_B01_protein.faa:WP_151362414.1:WP_151362414.1#0940
the file (proid_unique) i used as key-value file to edit the fasta headers look like this-
WP_151362399.1 WP_151362399.1#0940
WP_151362400.1 WP_151362400.1#0940
WP_151362401.1 WP_151362401.1#0940
WP_151362402.1 WP_151362402.1#0940
WP_151362409.1 WP_151362409.1#0940
WP_151362410.1 WP_151362410.1#0940
WP_151362411.1 WP_151362411.1#0940
WP_151362412.1 WP_151362412.1#0940
WP_151362413.1 WP_151362413.1#0940
WP_151362414.1 WP_151362414.1#0940
WP_094096600.1 WP_094096600.1#0945
WP_016530940.1 WP_016530940.1#0950
WP_000940121.1 WP_000940121.1#0951
WP_012540940.1 WP_012540940.1#0951
example of input-
>WP_151362411.1 YoaH family protein [Klebsiella pneumoniae]
MYAPQCSRSKRCFAGLPSLSHEQQQQAVERIHELMAQGISSGQAIALVAEELRATHTGEQ
IVARFEDEDEDE
>WP_151362412.1 gamma-glutamylcyclotransferase [Klebsiella pneumoniae]
MLEAIGGEWRPGYVTGTFYARGWGAAADFPGIVLDAHGPRVNGYLFLSDRLARTGPCWTT
LRRGYDRVPVEVTTDDGQQISAWIYQLQPRG
>WP_151362413.1 acid resistance repetitive basic protein Asr [Klebsiella pneumoniae]
MKKVLALVVAAAMGLSSVAFAADAASTTPSAAASHTTVHHKKHHKAAAKPAAEQKAQAAK
KHHKTAAKTGSRAESAGCKETS
>WP_151362414.1 ABC transporter permease [Klebsiella pneumoniae]
MKRAPWYLRLATWGGVIFLHFPLLIIAIYAFNTEDAAFSFPPQGLTLRWFSEAAGRSDIL
QAVTLSLKIAALSTAIALVLGTLAAGALWRSAFFGKNAVSLLLLLPIALPGIITGLALLT
AFKAVGLEPGLLTIVVGHATFCVVVVFNNVIARFRRTSWSMVEASMDLGATGWQTFRYVV
LPNLGSALLAGGMLAFALSFDEIIVTTFTAGHERTLPLWLLNQLGRPRDVPVTNVVALLV
MLVTTIPILGAWWLTRDGDSDAGNGK
example of output- expected and correct with above command
>WP_151362411.1#0940
MYAPQCSRSKRCFAGLPSLSHEQQQQAVERIHELMAQGISSGQAIALVAEELRATHTGEQ
IVARFEDEDEDE
>WP_151362412.1#0940
MLEAIGGEWRPGYVTGTFYARGWGAAADFPGIVLDAHGPRVNGYLFLSDRLARTGPCWTT
LRRGYDRVPVEVTTDDGQQISAWIYQLQPRG
>WP_151362413.1#0940
MKKVLALVVAAAMGLSSVAFAADAASTTPSAAASHTTVHHKKHHKAAAKPAAEQKAQAAK
KHHKTAAKTGSRAESAGCKETS
>WP_151362414.1#0940
MKRAPWYLRLATWGGVIFLHFPLLIIAIYAFNTEDAAFSFPPQGLTLRWFSEAAGRSDIL
QAVTLSLKIAALSTAIALVLGTLAAGALWRSAFFGKNAVSLLLLLPIALPGIITGLALLT
AFKAVGLEPGLLTIVVGHATFCVVVVFNNVIARFRRTSWSMVEASMDLGATGWQTFRYVV
LPNLGSALLAGGMLAFALSFDEIIVTTFTAGHERTLPLWLLNQLGRPRDVPVTNVVALLV
MLVTTIPILGAWWLTRDGDSDAGNGK
i am getting the required/expected result but this editing is not saving with this command, can someone help me figure out that how to save those editing in the original files bcz when i open those files again they were same as before with no edited header?
Python alternative of the above command used would also be helpful
do seqkit replace -p "^(.+?) (.+?)$" --replacement '{kv}' -k accesskey_idsvalue1 *.faa
is incorrect part of your code. You are not using variablei
in loop. Post output fromfor i in $(find -name \cusid_genomid); do echo $i;done
(fist few lines are helpful). First run seqkit command with single input and output (without using>
) and then within loop, replace input with variable and output with a appropriate second variable name.i don't get it , i need seqkit replace command to do the editing i cannot remove it. it is not just saving the editing in original file that i neede to solve otherwise this whole command is working
Your post is not helpful in addressing the issue.
sorry for ambiguity, now i have edited again, i hope it's more understable now.
i did the editing in header of fasta files but the editing is not saved in original files, whenever i open my fasta files there is not edited header that i did with the alternate awk command-
for i in $(find -name \genomid); do echo $i|\ awk ' FNR==NR{ a[$1]=$2 next } ($2 in a) && /^>/{ print ">"a[$2] next } 1 ' proid_unique FS="[> ]" *.faa; done
Loop code is incorrect. Variables are not used, Inputs are incorrect and there are no outputs in loop. It seems to me that your seqkit replacement regex is complex. Try this:
Assumptions:
.faa
extension.New files will have '.fa' as extension and end with "_out.fa".