Hello,
I'm trying to remove duplicates in a text file, here is an extract :
>KJ636215.1_Tripyla_glomerans
ATGTCTAAGCACAGCCCTTGAATGGTAAAGCCGCGAATGGCTCATTACAACAGCCACAGTTTATTGGGTC TCCTTTTACTTGGATAACTGAGCTAATTGTTGAGCTAATACACGCACCAAAGCTTCGACCTCACGGAAGG AGCGCATTTATTAGAACAAAACCAATCGGACTTCGGTCCGTCCATTGGTGAATCTAAATAACTCGGCCGA TCGCATGGTCTCGCACCGGCGACGCACCTTTCAAATGTCTGCCTTATCACCTTTCGATGGTAGTTTATAC
>KJ636220.1_Chromadorina_bioculata
ATGTCTAAGAATAAACCGAATATGGTAAATCCGCGAATGGCTCATTACAACAGCCATAGTTTATTGGATC TAATATCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACACGCACTCAAGCCCCGACTTCGGAAAG GGCGCATTTATTAGAACAAGACCAATTGGCTTCGGCCATCTATTGGTGAATCTGAATAACTACGCAGATC GCACAGGCTTGTCCTGGCGACATATCCTTCAAGTGTCTGCCTTATCAACTGTCGATGGTAGTTTATTGGA
>KJ636220.1_Chromadorina_bioculata
ATGTCTAAGAATAAACCGAATATGGTAAATCCGCGAATGGCTCATTACAACAGCCATAGTTTATTGGATC TAATATCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACACGCACTCAAGCCCCGACTTCGGAAAG GGCGCATTTATTAGAACAAGACCAATTGGCTTCGGCCATCTATTGGTGAATCTGAATAACTACGCAGATC GCACAGGCTTGTCCTGGCGACATATCCTTCAAGTGTCTGCCTTATCAACTGTCGATGGTAGTTTATTGGA CTACCATGGTTGTAACGGGTAACGGAGAATTAGGGTTCGACTCCGGAGAGGGAGCCTGAGATACGGCTAC
>FJ040471.1_Chromadorina_sp
ATGTGTAAGAATAAACCGAATATGGTAAATCCGCGAATGGCTCATTATTCAGCCTCAATTTATTAGATCT AATCAGTTACTTGGATAACTGTTCAAAAGGAAGAGCTAAGACATGCCTCGAAAGTGTAGCGCAAGCTATA CTGCACTTCTTAGAAAAAACCGATTGGCTTCGGCCATCCATTGGTGAATCTTCTGAAATTCGCAGATCGC
To do this I write a python script. I try to make a regex selecting in a first group the accession number and then in a second group the dna sequence corresponding to the accession number. Unfortunately I can't do this regex.
Here is my code beginning :
import re
from collections import defaultdict
with open ("Mixed-Sequences.txt","r") as f1:
for lignes in f1:
lignes=lignes.rstrip("\n")
match=re.search("^(>..........)_\S+\n([ATCG]+\n)+",lignes)
if match:
print(match)
Can you help me please ?
Thanks
You had already used the
code
button to format your code but it also helps to format your input example with it as well. I have done this for you but if the format does not match what you actually have (if file is not plain fasta) then please edit original post again and change as needed.