Rename Duplicate ID's for Different Fasta Sequences on Windows 7
3
1
Entering edit mode
9.9 years ago
Bara'a ▴ 270

Hi guys :D

I'm working with distance matrices produced by clustal omega for moderately large fasta files combining sequences of two different plant species in each .

When I was about to finish the script and code the final pipeline step ; which is retrieving the actual sequences corresponding to ID's given in the distance matrices using the biopython function SeqIO.index() ... I realized that the original fasta files have duplicate ID's for different sequences resulting from different positions of SSR's on the same sequence , in which I extracted the left and right flanking regions for each SSR .

Traceback (most recent call last):
  File "C:\Users\Al-Hammad\Desktop\Test Sample\dictionary.py", line 9, in <module>
    dictionary=SeqIO.index("Left(Brachypodium_Brachypodium).fasta","fasta",IUPAC.unambiguous_dna)
  File "C:\Python34\lib\site-packages\Bio\SeqIO\__init__.py", line 856, in index
    key_function, repr, "SeqRecord")
  File "C:\Python34\lib\site-packages\Bio\File.py", line 275, in __init__
    raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'BRADI5G06067.1'

Tool completed with exit code 1

Here's a sample of one of my files :

>BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=24 Startpos_here=24 Length=26 
ATGTATCTCCAACAACAACAACA 
>BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=54 Startpos_here=54 Length=34 
ATGTATCTCCAACAACAACAACAACGACGACGACGACGACGACGACGACAACG 
>BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=102 Startpos_here=102 Length=26 ATGTATCTCCAACAACAACAACAACGACGACGACGACGACGACGACGACAACGACAACAACAACAACAACAACAACAACAACAACAAGAACGACGACGACG

My question is: What is the best, safest and most efficient way to rename the duplicate ID's for different sequences ?! and do I have to recompute the distance matrices again with the unique ID's after renaming or can I simply map the duplicates with their corresponding new unique values on the surface ?!

I'm really confused about that , and a little worried about the recomputing if considered since it's time consuming and takes nearly 4 days to produce the matrices .

I found this: http://stackoverflow.com/questions/7815553/iterate-through-fasta-entries-and-rename-duplicates/7836747#7836747 but it wasn't useful in my case, I'm working on a windows 7 64bit platform and python 3.4

Also I found this: Is There A Way To Skip Existing Keys In Seq.Io.To_Dict? Or Is There A Better Way Altogether? but I believe it was the opposite of my case , I tried it though and ran on my files infinitely !! It wasn't that clear to me , for my bad luck :\

I desperately need this :( 😔

Any help would be appreciated, thanks in advance.

windows7 biopython SeqIO.index duplicate • 7.4k views
ADD COMMENT
4
Entering edit mode
9.9 years ago
Ram 44k

For the moment, I'd suggest replacing blank spaces in the header with double underscore characters (just a placeholder, you can use anything but underscores are among the most docile of characters)

You can install GnuWin32 tools for windows. Then, use this command:

sed -re "/^>/s/ /__/g" input.fa | sed -e "s/\_\_$//" >trInput.fa
ADD COMMENT
0
Entering edit mode

@RamRS ... Would you please explain more?!

I didn't get it clearly of how replacing with double underscore would solve the problem!!

ADD REPLY
1
Entering edit mode

With black spaces, FASTA parsers split the header into ID and description at the first blank space. When no spaces are encountered in the header, the entire header is taken as the ID.

All your headers are unique, the ID segments parsed by the FASTA parser are not. Replaced with underscores, the IDs become unique and get rid of the duplicate ID problem.

ADD REPLY
0
Entering edit mode

@RamRS... So, you're saying that following the approach you provided the header will be taken as whole to be unique, if so then can I access the ID segments later by str.split("__") or simply seq.id?! I need them to be written to a csv file as a reference to their sequences.

BTW: how can I do what you suggested in python?! obviously, I can't do that manually.

ADD REPLY
2
Entering edit mode

Like Brian said, step line by line in the file, if the line starts with a '>', replace blank spaces with __ and write to output file. Else, write input line as such.

And you can access the individual segments by seq.id.split('__')

Also, if you're doing bioinformatics on Windows, you're doing it wrong. Dual boot Ubuntu or install it on a VM. It will mark your official entry into the world of bioinformatics :)

ADD REPLY
0
Entering edit mode

@RamRS ... I managed to solve this issue below based on your guidance and instructions , seriously I can't thank you enough :)

Regarding your golden advice about Ubuntu and bioinformatics best practices and supportive platforms ... I really appreciate it and will take it into serious consideration in my next Phd. research with GOD willing :D

ADD REPLY
2
Entering edit mode

You can use the "key_function" parameter of SeqIO.to_dict to substitute the white space with double underscores.

from Bio import SeqIO
handle = open("yourFile.fa", "rU")
record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"), key_function = lambda rec : rec.description.replace(" ", "__"))
handle.close()
print(sorted(record_dict))
ADD REPLY
1
Entering edit mode

This looks like a good option, Siva. But this will still parse duplicate IDs and a description with spaces replaced by __s. I think you might need to join ID and desc with a __ as well.

record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"), key_function = lambda rec : rec.id + "__" + rec.description.replace(" ", "__")
ADD REPLY
1
Entering edit mode

Thanks for the input, Ram. What you are saying makes sense. However, it seems the whole FASTA header line except the > symbol is read as description (without the need for concatenating the ID). I used the sample FASTA sequences from the OP and this is the output of

print(sorted(record_dict))

['BRADI5G06067.1__cdna:novel__chromosome:v1.0:5:7642747:7642899:-1__gene:BRADI5G06067__transcript:BRADI5G06067.1__description:""__Startpos_in_parent=102__Startpos_here=102__Length=26', 'BRADI5G06067.1__cdna:novel__chromosome:v1.0:5:7642747:7642899:-1__gene:BRADI5G06067__transcript:BRADI5G06067.1__description:""__Startpos_in_parent=24__Startpos_here=24__Length=26', 'BRADI5G06067.1__cdna:novel__chromosome:v1.0:5:7642747:7642899:-1__gene:BRADI5G06067__transcript:BRADI5G06067.1__description:""__Startpos_in_parent=54__Startpos_here=54__Length=34']

And this is the output of printing the value of first key (EDIT: first sequence in the example. I had to escape the double quotes)

print record_dict["BRADI5G06067.1__cdna:novel__chromosome:v1.0:5:7642747:7642899:-1__gene:BRADI5G06067__transcript:BRADI5G06067.1__description:\"\"__Startpos_in_parent=24__Startpos_here
=24__Length=26"]

ID: BRADI5G06067.1
Name: BRADI5G06067.1
Description: BRADI5G06067.1 cdna:novel chromosome:v1.0:5:7642747:7642899:-1 gene:BRADI5G06067 transcript:BRADI5G06067.1 description:"" Startpos_in_parent=24 Startpos_here=24 Length=26
Number of features: 0
Seq('ATGTATCTCCAACAACAACAACA', SingleLetterAlphabet())
ADD REPLY
1
Entering edit mode

That is so weird! This is the first time I'm seeing ID being recorded as part of the description!

ADD REPLY
0
Entering edit mode

@RamRS ... why do you think that's happening in your opinion ?!

ADD REPLY
1
Entering edit mode

I wish I knew. I'll probably take a look into the why when I get some free time. Thank you for this unique case, BTW!

ADD REPLY
0
Entering edit mode

@Siva ... I truly appreciate your effort in this solution, it seems very promising.

I will try it on my files some other time when I'm not in a rush for fast solutions.

Many thanks.

ADD REPLY
3
Entering edit mode
9.9 years ago

You can do this with the BBMap package's reformat tool. If a file already has unique names it won't do anything. Otherwise, it will append _2 to the second instance of a name, and so forth.

In Linux:

reformat.sh in=sequences.fasta out=renamed.fasta uniquenames

In Windows you would replace "reformat.sh" like this:

java -ea -Xmx1g -cp C:\bbmap\current\ jgi.ReformatReads in=sequences.fasta out=renamed.fasta uniquenames

...where that would be the correct command if you extracted BBMap to the C: root.

Edit - looking back, I'm not sure if this is relevant or not. My comment is for situations where you actually have identical names. If the problem is just that a parser is truncating your names due to the presence of whitespace, then RamRS's suggestion of replacing whitespace with underscores would be the way to go.

ADD COMMENT
0
Entering edit mode

@BrianBushnell ... I don't think it's a matter of truncating name's whitespaces, it's about identical names (identifiers) for different sequences.

I need them to be renamed, so I can refer to them as dictionaries ... that's all.

Is there anyway rather than installing external tools ... python related solutions, maybe?! like modules with renaming facilities .

ADD REPLY
1
Entering edit mode

If you don't want to install addition tools, then you could just write a program step through the file line-by-line, and add "_(line number)" to lines that start with ">".

ADD REPLY
0
Entering edit mode

@Brian Bushnell ... that's exactly how I solved this issue, thank you very much :)

ADD REPLY
3
Entering edit mode
9.9 years ago
Bara'a ▴ 270

So , After looking at the options I have so far from the answers above ... I picked this approach to solve the issue for the reasons listed below and it works very fine for me :D

from Bio import SeqIO

from Bio.Alphabet import IUPAC

files=["Left(Aestivum_Japonica).fasta" , "Right(Aestivum_Japonica).fasta"]

for i in range(len(files)):
    output_handle = open("$"+files[I],"w")
    for idx,record in enumerate(SeqIO.parse(files[I],"fasta",IUPAC.unambiguous_dna)):
        record.id=record.id+"_"+str(idx)
        SeqIO.write(record,output_handle,"fasta")
    output_handle.close()

dictionary=SeqIO.index("$Left(Brachypodium_Brachypodium).fasta","fasta",IUPAC.unambiguous_dna)

Reasons why I picked this solution:

  1. It's more robust, integrated and consistent with the pythonic complicated pipeline I'm working on.
  2. It's more safe - to me - to rename the duplicates than taking the risk of assigning the whole header as a unique identifier like suggested by RamRS (I hesitated 'cause I also found it a little confusing to see the ID as part of the description)!!
  3. Renaming all sequences based on sequence No. in the file is much faster and more efficient - to me - than checking every single id in the file for duplicates and renaming with unique ones.

I'm pretty sure that the other provided solutions will work just as expected, I haven't tried them though.

Again , thank you guys so much ... I'm so grateful for your help.

ADD COMMENT

Login before adding your answer.

Traffic: 3526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6