Entering edit mode
3.2 years ago
Mgs3
▴
30
I have a FASTA file organized as such:
>Prevalence_Sequence_ID:13|ARO_Name:AxyX|ARO:3004143|Detection_Model:Protein Homolog Model
ATGAAGCAAAGAGTCCCTCTACGCACGTTCGTCCTATCTGCCGTATTAATTCTTATTACTGGTTGCTCGAAACCGGAAACCCAACCAGCCGCCGACGCCCCGGCGGAGAT
>Prevalence_Sequence_ID:14|ARO_Name:adeF|ARO:3004143|Detection_Model:Protein Homolog Model
ATGAATATCTCGAAATTCTTCATCGACCGGCCGATCTTCGCCGGCGTGCTTTCGATCCTGGTGTTGCTGGCGGGCATACTGGCCATGTTCCAGCTGCCCATTTCCGAGTACCCGGAAGTGGTGCCGCCGTCGGTGGTGGTGCGCGCGCAGTATCCGGGCGCCAACCCCAAGGTCATCGCCGAAACCGTGGCCTCGCCGCTGGAGGAG
I need to remove sequences that share the same ARO code (such as those above), keeping only one. is there a simple solution to this problem using awk? In alternative, i can use python.
The awk-magician at it again :)
This solution is perfect, i would very appreciate a simple explanation for it.
Prints 3 fields - string to be compared, sequences header and sequence it self. First delimiter is comma, second one is tab, in the output. While consuming input, field delimiter is "pipe |" and line should start with
>
(fasta header).uses comma as delimiter, sorts on first field and prints unique lines based on first field
Cuts field 2 onwards and delimiter for cutting is comma (this would keep sequence header and sequence)
Replace tab with new line.
It can be further tightened:
However this can be tricky as you would not have control over which sequence to be included (for eg. short vs long). In those cases, I would suggest datamash.
Let us say you would like to extract larger sequence (between/among duplicate records), use this (assuming that fasta sequences are in single line):
Let us say you would like to extract smaller sequence (between/among duplicate records), use this:
However, above code (from my post) works only if sequences are single line. For converting multi-line fasta records to single line fasta records (flattened format) , there are awk scripts or you can use
seqkit
.