I have two files (see below for the actual format): a fasta file with > 7000 sequences and a .txt file consisting of two columns. The first column in the .txt file corresponds with the name in the fasta file (minus the tail ';size=') and the second column gives the total number of sequences corresponding with that name. Now, I would like to add this size information for each sequence to the back of the headers in the fasta file of that same sequence. In other words: I would like to get the number '6047' which corresponds to ZOTU1 in the fasta file like '>Zotu1;size=6047'. The ZOTU's in the text file are not sorted.
I have no clue how to go about this so any pointing in the right direction would be extremely appreciated!
Thanks!
The files:
1) the fasta file looks like this:
>Zotu1;size=
AGCTCCAAAAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAACTTCTGTTCAGGTTCATTTCGACTCGTC
GAGTGAAACTGGACATACGTTTGCAAACTAAAATCGGCCTTCACTGGTTCGTCTTAGGGAGTAAACATTTTACTGTGAAA
AAATTAGAGTGTTCCAGGCAGGTTTTAGCCCGAATACATTAGCATGGAATAATGGAATAGGACTAAGTCCATTTTATTGG
TTCTTGGATTTGGTAATGATTAATAGGGGCAGTTGGGGGCATTAGTATTTAATAGTCAGAGGTGAAATTCTTGGATTTAT
TAAGGACTAACTAATGCGAAAGCATTTGCCAAAGATGTTTTCA
>Zotu2;size=
AGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATCTTGGGTCGGGGGCAGCGGTCCGCCCC
TTGTGGGTGTGCACTGGTCCACCCGGCCTTACTGCCGGGGACGCGCTCCTGGCCTTCGCTGGTCGGGACGCGGAGTTGGC
GATGTTACTTTGAAAAAATTAGAGTGCTCAAAGCAAGCCTATGCTCTGAATACATTAGCATGGAATAACGTGATAGGACT
...
2) the .txt file looks like this:
Zotu1 604
Zotu566 1023
Zotu6785 31
Zotu6 111453
Zotu69 10380
Zotu223 3706
Zotu215 2559
Zotu2697 109
Zotu3 211288
Zotu742 697
...
So you have 6047 sequences named Zotu1 in your fasta file ?
Btw your fasta file is not really a fasta file, you do not have
>
before each headerIs your fasta file really look like this display ?
1) No, I have one sequence with the name Zotu1 in this fasta file. However, based on my reference mapping (-usearch_global command), I know that I have 6047 sequences in my total dataset that have been mapped to the reference 'Zotu1'. These sequences might be, but do not necessarily have to be, identical to Zotu1. I now want to get that size information in the actual file with reference Zotu's.
2) No, there is indeed a '>' in front of the sequences, but that one somehow disappeared in the message.
You can find answers in these links I guess :
replace fasta headers with another name in a text file
Renaming fasta headers according to a matching name list
Thanks for the tips!
I have been playing around with the python script I found in replace fasta headers with another name in a text file and managed to get it to work on a rather archaic, yet effective way, by 1) getting rid of the 'Zotu' notation in the .txt file, 2) sort the .txt file using the 'sort' command in Linux, 3) add 'Zotu' and 'size;' to the .txt file in their respective places using Excel (yes, I know), 4) getting rid of the tabs, and 5) applying the python script to the new .txt file and the original fasta file. I thought this might be interesting for anyone out there who is also totally new in the field, but needs to move along 'fast' with some new data...
Did you solve your issue with this python code ?
Yes! Thanks again for your tip.