Hi guys :D
I'm working with distance matrices produced by clustal omega for moderately large fasta files combining sequences of two different plant species in each .
When I was about to finish the script and code the final pipeline step ; which is retrieving the actual sequences corresponding to ID's given in the distance matrices using the biopython function SeqIO.index()
... I realized that the original fasta files have duplicate ID's for different sequences resulting from different positions of SSR's on the same sequence , in which I extracted the left and right flanking regions for each SSR .
Traceback (most recent call last):
File "C:\Users\Al-Hammad\Desktop\Test Sample\dictionary.py", line 9, in <module>
dictionary=SeqIO.index("Left(Brachypodium_Brachypodium).fasta","fasta",IUPAC.unambiguous_dna)
File "C:\Python34\lib\site-packages\Bio\SeqIO\__init__.py", line 856, in index
key_function, repr, "SeqRecord")
File "C:\Python34\lib\site-packages\Bio\File.py", line 275, in __init__
raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'BRADI5G06067.1'
Tool completed with exit code 1
Here's a sample of one of my files :
>BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=24 Startpos_here=24 Length=26
ATGTATCTCCAACAACAACAACA
>BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=54 Startpos_here=54 Length=34
ATGTATCTCCAACAACAACAACAACGACGACGACGACGACGACGACGACAACG
>BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=102 Startpos_here=102 Length=26 ATGTATCTCCAACAACAACAACAACGACGACGACGACGACGACGACGACAACGACAACAACAACAACAACAACAACAACAACAACAAGAACGACGACGACG
My question is: What is the best, safest and most efficient way to rename the duplicate ID's for different sequences ?! and do I have to recompute the distance matrices again with the unique ID's after renaming or can I simply map the duplicates with their corresponding new unique values on the surface ?!
I'm really confused about that , and a little worried about the recomputing if considered since it's time consuming and takes nearly 4 days to produce the matrices .
I found this: http://stackoverflow.com/questions/7815553/iterate-through-fasta-entries-and-rename-duplicates/7836747#7836747 but it wasn't useful in my case, I'm working on a windows 7 64bit platform and python 3.4
Also I found this: Is There A Way To Skip Existing Keys In Seq.Io.To_Dict? Or Is There A Better Way Altogether? but I believe it was the opposite of my case , I tried it though and ran on my files infinitely !! It wasn't that clear to me , for my bad luck :\
I desperately need this :( 😔
Any help would be appreciated, thanks in advance.
@RamRS ... Would you please explain more?!
I didn't get it clearly of how replacing with double underscore would solve the problem!!
With black spaces, FASTA parsers split the header into ID and description at the first blank space. When no spaces are encountered in the header, the entire header is taken as the ID.
All your headers are unique, the ID segments parsed by the FASTA parser are not. Replaced with underscores, the IDs become unique and get rid of the duplicate ID problem.
@RamRS... So, you're saying that following the approach you provided the header will be taken as whole to be unique, if so then can I access the ID segments later by
str.split("__")
or simplyseq.id
?! I need them to be written to a csv file as a reference to their sequences.BTW: how can I do what you suggested in python?! obviously, I can't do that manually.
Like Brian said, step line by line in the file, if the line starts with a '>', replace blank spaces with __ and write to output file. Else, write input line as such.
And you can access the individual segments by
seq.id.split('__')
Also, if you're doing bioinformatics on Windows, you're doing it wrong. Dual boot Ubuntu or install it on a VM. It will mark your official entry into the world of bioinformatics :)
@RamRS ... I managed to solve this issue below based on your guidance and instructions , seriously I can't thank you enough :)
Regarding your golden advice about Ubuntu and bioinformatics best practices and supportive platforms ... I really appreciate it and will take it into serious consideration in my next Phd. research with GOD willing :D
You can use the "key_function" parameter of
SeqIO.to_dict
to substitute the white space with double underscores.This looks like a good option, Siva. But this will still parse duplicate IDs and a description with spaces replaced by __s. I think you might need to join ID and desc with a __ as well.
Thanks for the input, Ram. What you are saying makes sense. However, it seems the whole FASTA header line except the
>
symbol is read as description (without the need for concatenating the ID). I used the sample FASTA sequences from the OP and this is the output ofAnd this is the output of printing the value of first key (EDIT: first sequence in the example. I had to escape the double quotes)
That is so weird! This is the first time I'm seeing ID being recorded as part of the description!
@RamRS ... why do you think that's happening in your opinion ?!
I wish I knew. I'll probably take a look into the why when I get some free time. Thank you for this unique case, BTW!
@Siva ... I truly appreciate your effort in this solution, it seems very promising.
I will try it on my files some other time when I'm not in a rush for fast solutions.
Many thanks.