Entering edit mode
7.5 years ago
r.tor
▴
50
When I employ the "Sequence Alignment application" of Matlab on big size genes, I stop by an error regards to the matrix array size. Because, the requested array exceeds maximum array size preference. For instance, it goes to be an array of "300020x300020" which, needs 83.8GB of RAM in order to process!
Is there any alternative solution able to deal with this matter?
I do appreciate if someone can guide me!
And what are you trying to achieve? I'm not familiar with sequence alignment in matlab, there might very well be a better alternative.
Is there a reason you're using matlab? There are a number of more standard tools for this sort of thing.
Thanks for the response. I am gonna exploit it as part of an application which able to convert two different genome coordinates. The script has been written through Matlab. The inquiry and subject sequences with some modifications get from NCBI via a Perl script and the other parts of the script including App's GUI have been done through Matlab. The App works properly, but the problem occurs during alignment when the extracted sequences are too long. Coz the created matrix is too big as I mentioned above. I am looking for an alternative approach dealing with.
To add to the other comments, depending on the task, alignment algorithms can be very memory-hungry and slow. Many tools have been developed and optimized for different cases. I doubt that generic Matlab implementations can compete with them. In general, Matlab is not used very much in the bioinformatics community.
If you have a Matlab-specific question, you should maybe ask Matlab's support, after all, you paid for the software and for the additional bioinformatics toolbox.
Thanks for the response. Is there any specific code or package that you may know that is able to deal with this issue which is optimized considering the memory or specific algorithm that you are familiar with. I would very appreciate if you can guide me.
You don't say which alignment algorithm you use. My guess is that Matlab uses a vanilla dynamic programming approach whereas bioinformatics tools often include heuristics for speed and reduced memory usage. There are plenty of tools, some with optimization for specific cases. Here are a few:
- NCBI's BLAST
- exonerate
- The EMBOSS suite
See this wikipedia page for more.
I use the "sequence alignment application" of Matlab Biology package to perform a global alignment. The query is the whole sequence of a gene from hg19 coordinate and the subject is the sequence of the same gene from NG RefSeq coordinate. I am gonna match both to identify the coordinate of each nucleotide of hg19 coordinate based on the NG and vice versa point by point. This is a tool to converter coordinates from hg19 assembly to NG RefSeq. Actually I've already used the Emboss, but it failed in some genes, for instance, CACNA1A. Now I am wondering about splitting the sequences in Matlab, but I do not know how it would be possible. Could you please guide me more about it.
If tools from EMBOSS fail, you should probably investigate why: which one did you use and was there an error message and if so which one. This could give you some clue about what to expect in your data because the problem may reappear with other tools. The other issue is that RefSeq may not contain genomic regions for all the human genes so this could explain why some genes will fail to map.
For a task like this, my first option would be to search if a database doesn't already have the information. Second, I would first try to identify which NG_ record contain my gene of interest and then as a second step do the detailed mapping. You could probably do both steps in one go using blastn and post-processing the results.