Hi all,
I have a .fasta file resulting from vsearch clustering. The sequences in the .fasta file look like:
>centroid=211650b5-4541-47e4-a7a4-3659962f9818;seqs=2236
GAGATGATGATGATATAATT
the "seqs" parameter in the sequence header, reflects the number of reads of that cluster consensus that was present in the original input file.
I now want to remove sequences that have a value of "seqs" below a certain threshold. (for example 10) I want to use a conditional statement for this, but I cannot seem to find software that can be used for this. I checked things like SeqKit and Seqtk, but these only allow for regular expression filtering. I also find it hard to use bash/awk, as it is .fasta format.
I'd need something like (in pseudocode):
for sequence in fasta:
if seqs < value:
remove sequence
How could I filter based on a conditional statement? Thanks!
Thank you very much @Brian Bushnell for this addition to BBTools! Based on all replies, I am quite surprised that such a tool did not exist yet.
I am getting an error however when I run following line:
reformat.sh in=input.fasta out=output.fasta tags=seqs= delimiter=; minvalue=10
the entries in the input fasta file look like:
>centroid=40352020-a4fc-4473-9c18-ef20ed476c2d;seqs=2501 AAGAAATTTAATAATTTTGAAAATGGATTTTTTTTTTTGTTTTGGCAAGAGCATGAGAGCTTTTACTGGGCAAGAAGACAAGAGATGGAGAGTCCAGCCGGGCCTGCGCTTAAGTGCGCGGTCTTGCTAGGCTTGTAAGTTTCTTTCTTGCTATTCCAA
I downloaded BBMap_39.06.tar.gz
Looks like we each made a mistake here. First off, the flag is "tag", not "tags". Second, "delimiter=;" doesn't work because it's a control character. So you can do either of these, which I just tested:
or
First of all, sorry for the late reply Brian.
I still get the same error when I run the commands above:
my java version =
I now made a test_fasta.fasta file which headers look like on the image, and which only contains the 3 sequences displayed in the image:
is there something I'm missing?
You're using BBTools v39.01. I just added the new flag in 39.06 :)
Oh wow that was a bit stupid.. I downloaded 39.06 but I accidentally used 39.01... anyway, thank you very much for your help!! I now used the correct version and it works like a charm :-)