I have a question concerning the extraction of sequences from a fasta file (>7000 sequences) using a reference .txt file with sequence headers. I have been playing around and been looking all over the internet to find a solution for this problem, but surprisingly, nothing really matches what I want to do. So, I have two files:
1) a fasta file which looks like this:
>Zotu1
ACTGACAAAGCA
TGCACGTCATTTT
>Zotu2
ATGCATCAGCATA
TGACCCCCGTTTA
>Zotu10
CGTCGAAAAATTT
CGATACACCCTAT
>Zotu22
CGTACGTCCCCTT
CGATATAATATATA
2) a .txt file with a list of sequence names:
Zotu1
Zotu2
Now, I want to use the .txt file to select sequences from the .fasta file. I have two semi-solutions that do part of the job.
OPTION 1:
cat list.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - sequences.fas > newfile.fas
Problem: this function gives me the full sequences, but extracts too many sequences since everything that partially matches the strings in the .txt file will be selected. In this case, it means that also Zotu10 and Zotu22 are selected.
OPTION 2:
grep -A 1 -wFf list.txt sequences.fas > newfile2.fas
Problem: this function correctly selects only the sequences that completely match the strings in the .txt file, but does not return the full fasta sequences, but only the part of the sequence on the first line. An output thus looks like this:
>Zotu1
ACTGACAAAGCA
>Zotu2
ATGCATCAGCATA
I tried combining both solutions but that somehow did not end well. I would be much helped by an elegant solution for this problem, preferably using the codes I already obtained.
Many thanks!
Please format your fasta sequences appropriately, using the formatting button. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
I just did that for the OP. But yes from next time he or she should do it. Thanks WouterDeCoster
Thanks! I adjusted it a bit. It should be fine now.
Did you try to google it/ tried any solution?
I copy pasted your title and the first link I got is this: Extract fasta sequences from a large file using a list of names
Yes, I have been entirely through that thread (and several others) before posting here.