Hi, I have two fasta files with differetn headers. Sequences in file 2 partially match sequences in file1
I want to have a one-liner that will compare and find the partial match between the two files and report only sequences in file1.
Example of the two files and desired output.
thank you so much for your help in adavance
file1.fa
>chr5:111261537-111261740
AGAGTGCTAACATGCTAACATCAAATTCTATCCATCAGATCCAGCATCTACGGCTGAGTGCAATTCAACCTGAGGATGGCGCCCTATTCCGTGAAACCGTAGAGATTCTTTAGAAAATAGTCTAATAGAGGAGGTGGGATTGTACTTAGCCGTAGATGGTGGATCCGATGGTCAGGATTTGATGTTAGCA$
>chr5:121061210-121061338
GCTAGTTTCGGAGCCACAAAACCGGAGGGGATTAGAGGGGCTAAAATCCTCTCCTTATTTAAAATATTGAATAAGGAGGGGATTTTAGCCCCTCCAATCCCCTCTGGTATTTGGACGCCCAAACTAGC
>chr5:132306562-132306879
TGCACTTTCGGCCCTTAAACTTGTCTGGCTGTGTCAACCCGGTCCATAAACTTTTAAAGTATGTCTTTTAAGAACCTAATCTTTGTTCGGGTCTCTATTACGTCCTAACGGTTTGCTGTGGCGCACATGTCAATTGTCATGGTCTATTGGACCGCGTCCACCCTTCTCCCTCCTTCCTCGACGACTAACA$
>chr1:13506448-13506642
GAAACAATATCTTAGGGAGTGTTTGGTTTCTAGAGACTAATTTATAGTCCTTCCATTTTATTCCATTTTAGTTCTTAAATTACTAATTTCTGTATTTGACAGTTTAATTACTAAAATAGAATAAAATTGATGGATTAAAAATTAGTCCCTAGAAACAAAACACCGAGAACTGACAATAATCTAAGATACA$
>chr1:130644204-130644503
CAGTCTACAGGACTTGTCCGGTGCACCACCGGACAGCCAGGCGGGCCCACACGTCAGAGCTTCAACGGTCGGAACCCTACGACCTGGTGACGTGGCTGGCGCACCGGACTGTCCGGTGCACCATGCGACAGACAGCCTCCACCAAACAGCTAGTTTGGTGGAGGCTGTCTGTCGCATGGTGCACCGGACA$
>chr1:13560286-13560440
GCAGTTGAGGGGAATGTTGTCTGGTTGGAGACCTAACACCGCGAATTAATCATCCATGCCATGGAAGCAGCATATGCCCGCCTGCATCTATCCATGCATGATGGTGGAAGGTTTCGGACCAGGCTTCATTCCCCCCAACTCAACACTCAGCTGC
>chr5:160574908-160575134
GAGAGAGGGAGAGGGAGAAACTATGATATTTCATCAAGGCTAAAGGGAGAAGAGAGAGAGTACAGCCTTGTTGCCATGTACACACTCAGATATCAACAGTCTCTTATATTTGCTGTTACTACCTGTTATTAGTGTCCAGTGATGATGAAGGCTGTACCCTCTCTCTTCTTCTCTTAGCTTCTTGATGCAA$
>chr5:164651674-164651746
CTCCATTTCATTTTAGTTATCGCTGGATAGGGTAAAATTAAACTATCGAACGATAACTAAAAAGAAACGGAG
>chr5:182040454-182040600
AGTGAGAGCTGGATGCAGAGGTTTATCGACCGATTCCCGCGTTGTTTGCGAGCTGCTGCAACAATAAAACGCTTGAACCACCACTCCCGGCCGGACCAGCACAAGGCAAGGAAGAGATCGATAAACCACTGCGCCCAGACACCACC
file2.fa
>seq1
CTAACATCAAATTCTATCCATCAGATCCAGCATCTACGGCTGAGTGCAATTCAACCTGAGGATGGCGCCCTATTC
>seq2
CTCCAATCCCCTCTGGTATTTGGACGCCCAAAC
>seq3
TGCACTTTCGGCCCTTAAACTTGTCTGGC
>seq4
GAAACAATATCTTAGGGAGTGTTTGGTTTCTAGAGACTAATTTATAGTCCTTCCATTTTATTC
>seq5
CACGTCAGAGCTTCAACGGTCGGAACCCTACGACCTGGTGACGTGGCTGGCGCACC
>seq6
AGACCTAACACCGCGAATTAATCATCCATGCCATGGAAGCAGCATATGC
>seq7
ACAGCCTTGTTGCCATGTACACACTCAGATATCAACAGTCTCTTATATTTGCTGTTACTACCTGTTA
>seq9
AGCTGATAGCTA
output file
>chr5:111261537-111261740
AGAGTGCTAACATGCTAACATCAAATTCTATCCATCAGATCCAGCATCTACGGCTGAGTGCAATTCAACCTGAGGATGGCGCCCTATTCCGTGAAACCGTAGAGATTCTTTAGAAAATAGTCTAATAGAGGAGGTGGGATTGTACTTAGCCGTAGATGGTGGATCCGATGGTCAGGATTTGATGTTAGCA$
>chr5:121061210-121061338
GCTAGTTTCGGAGCCACAAAACCGGAGGGGATTAGAGGGGCTAAAATCCTCTCCTTATTTAAAATATTGAATAAGGAGGGGATTTTAGCCCCTCCAATCCCCTCTGGTATTTGGACGCCCAAACTAGC
>chr5:132306562-132306879
TGCACTTTCGGCCCTTAAACTTGTCTGGCTGTGTCAACCCGGTCCATAAACTTTTAAAGTATGTCTTTTAAGAACCTAATCTTTGTTCGGGTCTCTATTACGTCCTAACGGTTTGCTGTGGCGCACATGTCAATTGTCATGGTCTATTGGACCGCGTCCACCCTTCTCCCTCCTTCCTCGACGACTAACA$
>chr1:13506448-13506642
GAAACAATATCTTAGGGAGTGTTTGGTTTCTAGAGACTAATTTATAGTCCTTCCATTTTATTCCATTTTAGTTCTTAAATTACTAATTTCTGTATTTGACAGTTTAATTACTAAAATAGAATAAAATTGATGGATTAAAAATTAGTCCCTAGAAACAAAACACCGAGAACTGACAATAATCTAAGATACA$
>chr1:130644204-130644503
CAGTCTACAGGACTTGTCCGGTGCACCACCGGACAGCCAGGCGGGCCCACACGTCAGAGCTTCAACGGTCGGAACCCTACGACCTGGTGACGTGGCTGGCGCACCGGACTGTCCGGTGCACCATGCGACAGACAGCCTCCACCAAACAGCTAGTTTGGTGGAGGCTGTCTGTCGCATGGTGCACCGGACA$
>chr1:13560286-13560440
GCAGTTGAGGGGAATGTTGTCTGGTTGGAGACCTAACACCGCGAATTAATCATCCATGCCATGGAAGCAGCATATGCCCGCCTGCATCTATCCATGCATGATGGTGGAAGGTTTCGGACCAGGCTTCATTCCCCCCAACTCAACACTCAGCTGC
>chr5:160574908-160575134
GAGAGAGGGAGAGGGAGAAACTATGATATTTCATCAAGGCTAAAGGGAGAAGAGAGAGAGTACAGCCTTGTTGCCATGTACACACTCAGATATCAACAGTCTCTTATATTTGCTGTTACTACCTGTTATTAGTGTCCAGTGATGATGAAGGCTGTACCCTCTCTCTTCTTCTCTTAGCTTCTTGATGCAA$
what do you mean by partial match? If the sequence from 2 files match, print the one from file 1 along with header?
yes exactly. So in the example I give e.g. seq1 is part of the first fasta sequence of file one. So I would like to have printed that fasta sequence from file1 in output
Can you elaborate on your requirement of partial matching; from your example, it seems like you want some kind of partial substring matching (a substring that partially matches the longer superstring will report true on a match). Likely, you won't find a one-liner for this.
In order to compare sequences of differing lengths with tolerance for mismatches, you'll have to perform some sort of alignment or string comparison. We can suggest good algorithms for this, but it would be helpful to know what level of tolerance in partial matching that you desire (fixed # of mismatches, some other distance metric, etc.).
I would say 1 or 2 mismatches in the string