I have a code that is creating me a script where I can change the name of the titles of the sequences in a FASTA file.
This is the text file I'm using:
#Assembly Genome Center name RefSeq Accession.version GenBank Accession.version NCBI name
GeoFor_1.0 scaffold40 NW_005054297 JH739887 GPS_002009865
GeoFor_1.0 scaffold112 NW_005054298 JH739888 GPS_002009866
GeoFor_1.0 scaffold41 NW_005054299 JH739889 GPS_002009867
GeoFor_1.0 scaffold130 NW_005054300 JH739890 GPS_002009868
GeoFor_1.0 scaffold54 NW_005054301 JH739891 GPS_002009869
GeoFor_1.0 scaffold16 NW_005054302 JH739892 GPS_002009870
This is the FASTA file that I'm using and that I want to change the names. AS you can see, I want to find the scaffold names that match the different JH######.
>Scaffold410 275
TGCATTAATATGAGTGTGTGCTGCAAAAGTTCAGGTCATGGTCCGATCATACTTCACATTTTGGTAGCACTTTAAGCAGAGATCGGTTATCCCATTCTGTGGAAGACTCAACACTATCATAAGGTCCCACAGTTTTATTATCCCTCTGCCTCCCGGAATGCCCCCGGCAGTGAGGGGTACCATCTTCTCAGCAGTAAGGATATTCTTCAGGAGTTCCGTGTGAGCTTTCCCGGATTTAGTTCCATTTTTTAAATACTTCCCAATTCTTTGCTTTG
>Scaffold430 374
CTTTGTTAACTGAAAGAGCCTCTAAGTAGATGACCAGTGCTCAGTTAGTACAGTATGAATTTTGTTTAATGGAACAGGAAGATTTAGTATTGAGAAGCGGTTAAGGGTTTAACCCAGCCTCCTGTCTGAATGGACCTGAAGAGGGGGGCCGGGAAGAAACCCATGACTGCATTAAAGTGATAGATCTCCAGACATGGGCTAGGGAAGATTTACAAGACACTCCCTGGCCTGAGGGAGAAAATATGTTTATTGATGAGTCTTCAAGGGTGGCAGAAGGGAAGCGATTTACAGGATACACAATCATTAATGGAAGGAAATTAAAGGAAGGGGGGAGATTGTCACCCACCTGGTCAGTTCAGACAGCAGAGCTGTAT
>Scaffold1010 597
GGAACACACCTGGGCACACCTGGATGGAGCAGGAACACACCTGGATGGGGTTAGGACACATCTGGATGGCGTTGGGACACACCTGGATGCGCTCAGGGTACACCTG...
Thesis the command I use to create a script to change the names
tail -n +2 scaffold_names_2.txt | while read assemb gcenter refseq genbank ncbi; do echo -ne "sed 's/[[:<:]]$gcenter[[:>:]]/$genbank/g' | " >>script.sh; done
The problem is that I'm not able to save the fasta file with the new names.
This is the last line of my script:
... sed 's/[[:<:]]scaffold4469[[:>:]]/JH767125/g' name.fasta.fa
The script is running without error, but it's doing nothing.
Do you know why? How can I change all the titles and save it as a new fasta file with another name ?
The only problem here is that you are not getting the same format as the original fasta file.
will give you something like this (see the line number):
Whereas the original file is organized like this:
The solution to this is that:
--line-width 0
will not wrap the text!--keep-key
will keep a string that wasn't matchedSee
seqkit replace -h
glad you found it!!!
Hi all! I'm having sort of the same problem, but in my case I have keys that must only partially match the name in the fasta file, since the fasta include also numbers, and I need to match by the species code. Example fasta
Example key
I tried several things, from including special characters (
*
and.*
) in the key to playing around with the"^(\w+)"
expression, in which I suspect the problem lies, to no avail. I'm sure the solution it's close, but I haven't got it yet. Any clues? P.S. It correctly loads the kv, yet it cannot match anything in the fasta file.Thank you!
Also, the info lines
won't be printed if you decide to save the files! Super handy trick!