Remove duplicates in fasta files based on a specific value with awk
2
0
Entering edit mode
3.2 years ago
Mgs3 ▴ 30

I have a FASTA file organized as such:

>Prevalence_Sequence_ID:13|ARO_Name:AxyX|ARO:3004143|Detection_Model:Protein Homolog Model
ATGAAGCAAAGAGTCCCTCTACGCACGTTCGTCCTATCTGCCGTATTAATTCTTATTACTGGTTGCTCGAAACCGGAAACCCAACCAGCCGCCGACGCCCCGGCGGAGAT
>Prevalence_Sequence_ID:14|ARO_Name:adeF|ARO:3004143|Detection_Model:Protein Homolog Model
ATGAATATCTCGAAATTCTTCATCGACCGGCCGATCTTCGCCGGCGTGCTTTCGATCCTGGTGTTGCTGGCGGGCATACTGGCCATGTTCCAGCTGCCCATTTCCGAGTACCCGGAAGTGGTGCCGCCGTCGGTGGTGGTGCGCGCGCAGTATCCGGGCGCCAACCCCAAGGTCATCGCCGAAACCGTGGCCTCGCCGCTGGAGGAG

I need to remove sequences that share the same ARO code (such as those above), keeping only one. is there a simple solution to this problem using awk? In alternative, i can use python.

sort fasta sed awk • 1.5k views
ADD COMMENT
7
Entering edit mode
3.2 years ago
awk -F '|' '/^>/ {printf("%s%s,%s\t",(N>0?"\n":""),$3,$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | sort -t, -k1,1 -u | cut -d, -f2- | tr "\t" "\n"
ADD COMMENT
1
Entering edit mode

The awk-magician at it again :)

ADD REPLY
1
Entering edit mode

This solution is perfect, i would very appreciate a simple explanation for it.

ADD REPLY
2
Entering edit mode
1. awk -F '|' '/^>/ {printf("%s%s,%s\t",(N>0?"\n":""),$3,$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa : 

Prints 3 fields - string to be compared, sequences header and sequence it self. First delimiter is comma, second one is tab, in the output. While consuming input, field delimiter is "pipe |" and line should start with > (fasta header).

2.  sort -t, -k1,1 -u : 

uses comma as delimiter, sorts on first field and prints unique lines based on first field

3.  cut -d, -f2-   

Cuts field 2 onwards and delimiter for cutting is comma (this would keep sequence header and sequence)

4.  tr "\t" "\n" :  

Replace tab with new line.

It can be further tightened:

$ awk -F '|' '/^>/ {printf("%s%s,%s\t",(N>0?"\n":""),$3,$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' test.fa | sort -k1,1 -u | awk -F'\t' -v OFS="\n" '{print $2,$3}'

However this can be tricky as you would not have control over which sequence to be included (for eg. short vs long). In those cases, I would suggest datamash.

Let us say you would like to extract larger sequence (between/among duplicate records), use this (assuming that fasta sequences are in single line):

$ awk -F '|' -v OFS="\t" '/^>/ {getline seq} {print $3,$0,seq, length(seq)}' test.fa | datamash  -fs -g1 max 4 | awk -F "\t" -v OFS="\\n" '{print $2,$3}'

Let us say you would like to extract smaller sequence (between/among duplicate records), use this:

$ awk -F '|' -v OFS="\t" '/^>/ {getline seq} {print $3,$0,seq, length(seq)}' test.fa | datamash  -fs -g1 min 4 | awk -F "\t" -v OFS="\\n" '{print $2,$3}'

However, above code (from my post) works only if sequences are single line. For converting multi-line fasta records to single line fasta records (flattened format) , there are awk scripts or you can use seqkit.

ADD REPLY
1
Entering edit mode
3.2 years ago

with seqkit copy/pasted from here:

$ seqkit rmdup  --id-regexp "ARO:([0-9]+)" test.fa
$ seqkit rmdup  --id-regexp "ARO:(\d+)" test.fa
ADD COMMENT

Login before adding your answer.

Traffic: 5713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6