|
Say we have two FASTA files that contain similar sequences but with different IDs. |
|
|
|
$ cat f1.fasta |
|
>F1 |
|
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT |
|
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA |
|
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA |
|
AGCGTGGGGAGCAAACAGGATTAGATACCCTAGTAGTC |
|
>F2 |
|
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT |
|
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG |
|
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC |
|
TAGGGGAGCGAATGGGATTAGATACCCTAGTAGTC |
|
>F3 |
|
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT |
|
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG |
|
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC |
|
TAGGGGAGCGAATGGGATTAGATACCCGAGTAGTC |
|
>F4_C1 |
|
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT |
|
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG |
|
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC |
|
TAGGGGAGCGAATGGGATTAGATACCCTTGTAGTC |
|
>F5_C2 |
|
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT |
|
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA |
|
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA |
|
AGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTC |
|
|
|
$ cat f2.fasta |
|
>C1_F4 |
|
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT |
|
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG |
|
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC |
|
TAGGGGAGCGAATGGGATTAGATACCCTTGTAGTC |
|
>C2_F5 |
|
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT |
|
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA |
|
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA |
|
AGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTC |
|
>C3 |
|
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT |
|
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA |
|
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA |
|
AGCGTGGGGAGCAAACAGGATTAGATACCCTAGTAGTC |
|
>C4 |
|
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT |
|
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA |
|
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA |
|
AGCGTGGGGAGCAAACAGGATTAGATACCCGGGTAGTC |
|
>C5 |
|
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT |
|
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG |
|
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC |
|
TAGGGGAGCGAATGGGATTAGATACCCTGGTAGTC |
|
|
|
We can use the following one-liner to identify list of unique sequences per line with duplicates as comma separated list: |
|
|
|
$ cat f1.fasta f2.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' | awk 'NR>1{ if(NR%2==0){gsub(">","",$1);h=$1} else {a[$1]++;b[$1]=b[$1]","h}} END { for (n in a) {gsub("^,","",b[n]);print b[n]} }' |
|
F3 |
|
F4_C1,C1_F4 |
|
C4 |
|
C3 |
|
F2 |
|
C5 |
|
F5_C2,C2_F5 |
|
F1 |
|
|
|
Here, the first awk statement linearizes the FASTA file so that header and sequence alternate on separate lines. |
|
|
|
To identify duplicates only, just add an extra awk statement at the end: |
|
|
|
$ cat f1.fasta f2.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' | awk 'NR>1{ if(NR%2==0){gsub(">","",$1);h=$1} else {a[$1]++;b[$1]=b[$1]","h}} END { for (n in a) {gsub("^,","",b[n]);print b[n]} }' | awk -F, 'NF>1' |
|
F4_C1,C1_F4 |
|
F5_C2,C2_F5 |
This seems a very elegant way to identify duplicates. However I can't be certain my two files contain exact duplicates in terms of length.
I have updated my original post to answer this issue!