Hello, I am trying to filter a FASTA sequence database using bash.
>122051856 Abies alba
ACAACATACGCCCTCCCTGCAAGTTTAGAGGGGAGGAGCGGACATGGTCGTCCGTGCCCATCGTGGTGCGGTTGGCTGAAATGCATTTGATGTCCCCTGCCTTGCATCGGTCAGCGGTGGCCTT
>638110106 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAG
>638110119 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAGT
>49823319 Achillea millefolium subsp. sudetica
ATCGCGTCGCCCCCAACAAATATCTGTTGGGGGCGGATATTGGTCTCCCGTGCTCATGGTGTGGTTGGCCAAAATAAGAGTCCCTTCGATGGACACACGAACTAGTGGTGGTCGTAAAAACCCT
>49689252 Fumaria officinalis
ACGCACCGAGTCGCCCCCACCCGCCCCCCAAGAGGTGCCGCGGGAGGGAGCGGAGAATGGCCCCCCGTGCCCCAGCGCGCGGCCGGCCCAAACACAGGCCCCGGGAGGCCGGCGTCACGAT
...
It's a plant database and I want to filter it with a list of plants:
Abies alba
Acer campestre
Achillea millefolium subsp. sudetica
...
This would be the result, I need:
>122051856 Abies alba
ACAACATACGCCCTCCCTGCAAGTTTAGAGGGGAGGAGCGGACATGGTCGTCCGTGCCCATCGTGGTGCGGTTGGCTGAAATGCATTTGATGTCCCCTGCCTTGCATCGGTCAGCGGTGGCCTT
>638110106 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAG
>638110119 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAGT
>49823319 Achillea millefolium subsp. sudetica
ATCGCGTCGCCCCCAACAAATATCTGTTGGGGGCGGATATTGGTCTCCCGTGCTCATGGTGTGGTTGGCCAAAATAAGAGTCCCTTCGATGGACACACGAACTAGTGGTGGTCGTAAAAACCCT
I already tried
grep -Ff list.txt database.txt > filtered.txt
Therefore I created a list with the ID lines from the database and aligned it with the list of the plants. With this command I appended the matching sequences to the result.
grep -x -F -A 1 -f 'filtered.txt' 'database.fasta' > filtered_database.fasta
As it is very huge databse that I want to filter and some plants have to occur multiple times due to the various tax-IDs (e.g. Acer campestre), I am not sure, if that is the right way and if I got all the sequences from the list...
Are there any other possibilities to filter this FASTA database with a list of binary nomenclature names with bash?
Thank you very much!
Greetings, Lisa
Thank you very much! This helps a lot. And yes, the -- lines appeared before, so thank you for your advice :-)