compare fasta files by sequence
1
0
Entering edit mode
8.4 years ago
Chris ▴ 30

Hi, I have two fasta files with differetn headers. Sequences in file 2 partially match sequences in file1

I want to have a one-liner that will compare and find the partial match between the two files and report only sequences in file1.

Example of the two files and desired output.

thank you so much for your help in adavance

file1.fa

>chr5:111261537-111261740
AGAGTGCTAACATGCTAACATCAAATTCTATCCATCAGATCCAGCATCTACGGCTGAGTGCAATTCAACCTGAGGATGGCGCCCTATTCCGTGAAACCGTAGAGATTCTTTAGAAAATAGTCTAATAGAGGAGGTGGGATTGTACTTAGCCGTAGATGGTGGATCCGATGGTCAGGATTTGATGTTAGCA$
>chr5:121061210-121061338
GCTAGTTTCGGAGCCACAAAACCGGAGGGGATTAGAGGGGCTAAAATCCTCTCCTTATTTAAAATATTGAATAAGGAGGGGATTTTAGCCCCTCCAATCCCCTCTGGTATTTGGACGCCCAAACTAGC
>chr5:132306562-132306879
TGCACTTTCGGCCCTTAAACTTGTCTGGCTGTGTCAACCCGGTCCATAAACTTTTAAAGTATGTCTTTTAAGAACCTAATCTTTGTTCGGGTCTCTATTACGTCCTAACGGTTTGCTGTGGCGCACATGTCAATTGTCATGGTCTATTGGACCGCGTCCACCCTTCTCCCTCCTTCCTCGACGACTAACA$
>chr1:13506448-13506642
GAAACAATATCTTAGGGAGTGTTTGGTTTCTAGAGACTAATTTATAGTCCTTCCATTTTATTCCATTTTAGTTCTTAAATTACTAATTTCTGTATTTGACAGTTTAATTACTAAAATAGAATAAAATTGATGGATTAAAAATTAGTCCCTAGAAACAAAACACCGAGAACTGACAATAATCTAAGATACA$
>chr1:130644204-130644503
CAGTCTACAGGACTTGTCCGGTGCACCACCGGACAGCCAGGCGGGCCCACACGTCAGAGCTTCAACGGTCGGAACCCTACGACCTGGTGACGTGGCTGGCGCACCGGACTGTCCGGTGCACCATGCGACAGACAGCCTCCACCAAACAGCTAGTTTGGTGGAGGCTGTCTGTCGCATGGTGCACCGGACA$
>chr1:13560286-13560440
GCAGTTGAGGGGAATGTTGTCTGGTTGGAGACCTAACACCGCGAATTAATCATCCATGCCATGGAAGCAGCATATGCCCGCCTGCATCTATCCATGCATGATGGTGGAAGGTTTCGGACCAGGCTTCATTCCCCCCAACTCAACACTCAGCTGC
>chr5:160574908-160575134
GAGAGAGGGAGAGGGAGAAACTATGATATTTCATCAAGGCTAAAGGGAGAAGAGAGAGAGTACAGCCTTGTTGCCATGTACACACTCAGATATCAACAGTCTCTTATATTTGCTGTTACTACCTGTTATTAGTGTCCAGTGATGATGAAGGCTGTACCCTCTCTCTTCTTCTCTTAGCTTCTTGATGCAA$
>chr5:164651674-164651746
CTCCATTTCATTTTAGTTATCGCTGGATAGGGTAAAATTAAACTATCGAACGATAACTAAAAAGAAACGGAG
>chr5:182040454-182040600
AGTGAGAGCTGGATGCAGAGGTTTATCGACCGATTCCCGCGTTGTTTGCGAGCTGCTGCAACAATAAAACGCTTGAACCACCACTCCCGGCCGGACCAGCACAAGGCAAGGAAGAGATCGATAAACCACTGCGCCCAGACACCACC

file2.fa

>seq1
CTAACATCAAATTCTATCCATCAGATCCAGCATCTACGGCTGAGTGCAATTCAACCTGAGGATGGCGCCCTATTC
>seq2
CTCCAATCCCCTCTGGTATTTGGACGCCCAAAC
>seq3
TGCACTTTCGGCCCTTAAACTTGTCTGGC
>seq4
GAAACAATATCTTAGGGAGTGTTTGGTTTCTAGAGACTAATTTATAGTCCTTCCATTTTATTC
>seq5
CACGTCAGAGCTTCAACGGTCGGAACCCTACGACCTGGTGACGTGGCTGGCGCACC
>seq6
AGACCTAACACCGCGAATTAATCATCCATGCCATGGAAGCAGCATATGC
>seq7
ACAGCCTTGTTGCCATGTACACACTCAGATATCAACAGTCTCTTATATTTGCTGTTACTACCTGTTA
>seq9
AGCTGATAGCTA

output file

>chr5:111261537-111261740
AGAGTGCTAACATGCTAACATCAAATTCTATCCATCAGATCCAGCATCTACGGCTGAGTGCAATTCAACCTGAGGATGGCGCCCTATTCCGTGAAACCGTAGAGATTCTTTAGAAAATAGTCTAATAGAGGAGGTGGGATTGTACTTAGCCGTAGATGGTGGATCCGATGGTCAGGATTTGATGTTAGCA$
>chr5:121061210-121061338
GCTAGTTTCGGAGCCACAAAACCGGAGGGGATTAGAGGGGCTAAAATCCTCTCCTTATTTAAAATATTGAATAAGGAGGGGATTTTAGCCCCTCCAATCCCCTCTGGTATTTGGACGCCCAAACTAGC
>chr5:132306562-132306879
TGCACTTTCGGCCCTTAAACTTGTCTGGCTGTGTCAACCCGGTCCATAAACTTTTAAAGTATGTCTTTTAAGAACCTAATCTTTGTTCGGGTCTCTATTACGTCCTAACGGTTTGCTGTGGCGCACATGTCAATTGTCATGGTCTATTGGACCGCGTCCACCCTTCTCCCTCCTTCCTCGACGACTAACA$
>chr1:13506448-13506642
GAAACAATATCTTAGGGAGTGTTTGGTTTCTAGAGACTAATTTATAGTCCTTCCATTTTATTCCATTTTAGTTCTTAAATTACTAATTTCTGTATTTGACAGTTTAATTACTAAAATAGAATAAAATTGATGGATTAAAAATTAGTCCCTAGAAACAAAACACCGAGAACTGACAATAATCTAAGATACA$
>chr1:130644204-130644503
CAGTCTACAGGACTTGTCCGGTGCACCACCGGACAGCCAGGCGGGCCCACACGTCAGAGCTTCAACGGTCGGAACCCTACGACCTGGTGACGTGGCTGGCGCACCGGACTGTCCGGTGCACCATGCGACAGACAGCCTCCACCAAACAGCTAGTTTGGTGGAGGCTGTCTGTCGCATGGTGCACCGGACA$
>chr1:13560286-13560440
GCAGTTGAGGGGAATGTTGTCTGGTTGGAGACCTAACACCGCGAATTAATCATCCATGCCATGGAAGCAGCATATGCCCGCCTGCATCTATCCATGCATGATGGTGGAAGGTTTCGGACCAGGCTTCATTCCCCCCAACTCAACACTCAGCTGC
>chr5:160574908-160575134
GAGAGAGGGAGAGGGAGAAACTATGATATTTCATCAAGGCTAAAGGGAGAAGAGAGAGAGTACAGCCTTGTTGCCATGTACACACTCAGATATCAACAGTCTCTTATATTTGCTGTTACTACCTGTTATTAGTGTCCAGTGATGATGAAGGCTGTACCCTCTCTCTTCTTCTCTTAGCTTCTTGATGCAA$
sequence • 6.4k views
ADD COMMENT
0
Entering edit mode

what do you mean by partial match? If the sequence from 2 files match, print the one from file 1 along with header?

ADD REPLY
0
Entering edit mode

yes exactly. So in the example I give e.g. seq1 is part of the first fasta sequence of file one. So I would like to have printed that fasta sequence from file1 in output

ADD REPLY
0
Entering edit mode

Can you elaborate on your requirement of partial matching; from your example, it seems like you want some kind of partial substring matching (a substring that partially matches the longer superstring will report true on a match). Likely, you won't find a one-liner for this.

In order to compare sequences of differing lengths with tolerance for mismatches, you'll have to perform some sort of alignment or string comparison. We can suggest good algorithms for this, but it would be helpful to know what level of tolerance in partial matching that you desire (fixed # of mismatches, some other distance metric, etc.).

ADD REPLY
0
Entering edit mode

I would say 1 or 2 mismatches in the string

ADD REPLY
2
Entering edit mode
8.4 years ago
gangireddy ▴ 160

I would suggest blasting file2 against file1 and extract sequences sequences with hits from file1. But, it won't be simple one liner.

 makeblastdb -in file1.fasta -parse_seqids -dbtype nucl && blastn -query file2.fasta -db file1.fasta -outfmt 6 |awk '{print $2}' | sort | uniq | xargs samtools faidx file1.fasta > output.fasta
ADD COMMENT
0
Entering edit mode

I get the following error with this: xargs: samtools: No such file or directory. How can I overcome this? Thanks.

ADD REPLY
0
Entering edit mode

samtools is not installed. try downloading it from here and install.

ADD REPLY

Login before adding your answer.

Traffic: 1701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6