Dear All!
It's a trivial question, but I have not found anything similar in biostars.org
Probably my search was not good enough, sorry!
I have several hundreds of bacterial genomes. Let's look at one of them.
All its chromosomal protein sequences are stored in NC_\d+.faa
- fasta-file.
All its "pre-translation"sequences, CDS, are stored in *.ffn
fasta-file with the same order of corresponding sequences and the same name of file, NC_\d+.ffn. (I consider NCBI now.)
Only file extensions are different. I need a tool to reach nucleotide sequence quickly, if I take some protein sequence. E-utilities didn't help me. The best idea I have so far is to count entries in both fasta-files, make a hash, call a protein sequence a hash-key (in Perl) and the respective nucleotide, "pre-translation" sequence of the protein - the hash value. But I will lose the file-name in this case.
Another way is to take the file-name without extension as a hash-key and a pair of the sequences
of the same count-number as a hash-value. But I will have to put each pair of sequences
into the array, and make a reference to this array become the hash-value. It's too complicated in my opinion. Please, give me a hint how to make the procedure shorter and easier? Python - advice will also be fine, but bio-python does not work with my genomes, I've already tried it. Many thanks!
Sincerely yours,
Natasha
It is not clear whether you want to retrieve the nucleotide sequence using a protein sequence query, or a name based query. However, to directly answer your question, you can also have a hash of arrays, or a dictionary of lists, in both Perl and python. Using your second approach for example, you can use the file name as the key, and the value will be an array where the first element will be the protein sequence, and the second element will be the nucleotide sequence.
If this aproach works for you, try it. I can help you with the code if there is any syntax-specific problem.
Edit:
Alternatively, you can also have two separate hashes, the keys of both will be the file name, and one will have the protein sequence and one will have the nucleotide sequence.
Edit 2:
I was planning to add a reply to your comment, but don't have enough reputation to post. So here is what I had to say:
By 2-index arrays, do you mean a matrix? Yes, matrices (aka array of arrays) are also allowed in Perl. One of the key philosophies of Perl is not to have unnecessary restrictions on the size and shape of data structures. However, in this case, you don't need an array of arrays. What you need is a hash with 2-element arrays as values. I have posted a simple example based on bioperl as an answer. Please see if that helps.
Cheers
Dear Tej,
Exactly, I was not clear here. I want to retrieve the nucleotide sequence using a protein sequence query, providing the corresponding key. The key is in the header of my nucleotide fasta-file, as well as my protein file, just extetion is different . The key is actually the file name without extention, you were right. I don't want to make a database of nucleotide sequences and search inside it - I would like to have both sequences (protein and nucleotide ones) together somehow, I don't know a fast approach to get the nucleotide sequence when I chose a protein with a key in its header just after
>
sign. I would use "the file name as the key, and the value will be an array where the first element will be the protein sequence, and the second element will be the nucleotide sequence", it's a very nice suggestion. Does Perl tolerate 2-index arrays? I have some doubts. But Python does. The code hints will be greatly appreciated! Many thanks! Natasha