Following up on this question: http://biostar.stackexchange.com/questions/8010/searching-200-400bp-matches-against-mammalian-genome-human-mouse-distance
I would like to know what is the most sensitive method to search sequences ranging from 100bp to 1000bp against a whole genome assembly of another species, for genetic distances similar to human-mouse.
As an example, if I have human sequences ranging from 100-1000bps, what is the most sensitive tool and parameter options to find hits in the mouse genome?
I've been recommended to look at blast and blast+, but it would be great to hear about specific parameter settings or different options.
First, I must say that I don't quite understand the difference between this question and the previous one (for which you provided some decent blast parameters yourself). You are asking for 'the most sensitive' method, not for the fastest one. Thus, I would probably discount blat, ssaha and their ilk, you might consider discontinuous megablast, but I would probably stick to blast+ for a general-purpuse comparison. I am sure that at human-mouse distance, you will find >95% or all relevant similarities by discontiguous megablast or by a simple blastn with default parameters.
If you want to get the other 5%, the strategy depends on why they were missed in the first place.
If the input sequence was rather
short (<200bp), it might be advisable
to lower the word size from 11 to 9
or even 7.
If there are repeats involved (ERVs,
Alus, transposons, whatever), it is
possible that dust and/or repmasker
will have killed everthing alignable.
These cases are inherently difficult.
It might be necessary to switch off
the repeat masking, but this could
mean sifting through thousands of
'signficant' alignments
If the sequences are just too
divergent, this could be another
tough problem. This can happen even
at human-mouse distance. If there is
the chance that your query sequence
has some coding bits, it might be
useful to also try tblastx.
In your situation I would probably set up a pipeline, first looking for matches by a fast and not-so-sensititve method. If nothing is found, I would probably try different things in parallel: using a shorter or longer word size, and also try using a different filter database (e.g. one that contains only repetitive elements found in both species). Repeats that are only present in one of the species will not generate spurious matches but might mask valuable sequence material in the query or target.
ADD COMMENT
• link
updated 5.2 years ago by
Ram
44k
•
written 13.5 years ago by
Lyco
★
2.3k
0
Entering edit mode
+1 for word size and repeat masker. I think Aviella should take subset from his dataset and fine-tune the parameters. In my experience, word size modifications may required if sequence length is <100bp.
Albert, you did not mention what speed you intend to achieve. If you only have a couple of sequences, you can just use Smith-Waterman. BWT-SW is also a good choice for mammalian genomes. It has a similar speed to blast for short sequences while gives identical alignment to Smith-Waterman. It is frequently overlooked. BWA-SW can be a reasonable choice when speed is critical. You need to increase "-z" (e.g. 10 or even 100) to achieve high sensitivity. SSAHA2 is also a decent choice. The default setting would not work. You need to ask Zemin about the options.
For mammalian genomes, blast is too slow. Algorithms indexing the genome are usually much faster. I do not know if blast+ gives significant boost. Probably not.
EDIT: People should read the BWT-SW paper [PMID:18227115]. It performs the exact SW algorithm while is faster than blast in the 100-1000bp range. The BWA-SW paper [PMID:20080505] shows that SSAHA2 performs well for 10% divergence. I do not see why it cannot be tuned to work for sequences with the human-mouse divergence. BWA-SW is using the default settings. It may work well with a more sensitive configuration. I am not sure about this bit, though, as I have not tried. Also, modern SSE2/CUDA powered SW is approaching the speed of blast (not yet). PatternHunter is known to work better than blast for cross-species alignment. A few groups (I cannot remember which) have improved blast, though these are less known. A lot happened in the past 10 years in the area of sequence analysis, but blast has not been changed much so far as I know. These efforts are not in vain.
I guess Albert is mapping transcriptome contigs. I think he needs a fast aligner.
blast+ is not much faster than blast. On a decent computer (with a couple of GB RAM) blast is ok for running a sequence fragments against a whole mammalian genome (I normally use overlapping 100-500kb fragments rather than chromosome-size pieces). I wouldn't use it routinely for genome-genome comparison. However, if you want to compare entire genomes, you should definitively consider conserved synteny. I assume that the question was about searching with unconnected pieces.
@lh3: "For mammalian genomes, blast is too slow." I am wondering how or why ? Do you have a reference for this ? Also, you may already know that you can index a blast database in fasta format. BLAST is an ideal option for cross-genome searches. Aviella's sequences are +100bp, so most of the searches on default option should work.
These are not transcriptome contigs but they do come from a next-gen sequencing pileup/assembly step. I end up with a range of 100-1000bp and about 1e5-1e6 of them, but so far blast+ and blastall take too long for what I would like to do at a bigger scale. For example, blastall takes 2758.1034 CPU hours for one example dataset, ncbi-blast+ blastn is around 2000 CPU hours, whereas "bwasw -z 100" has just finished in 126.0136 CPU hours on the same dataset. Still looking at the results...
@avilella: let me know if bwa-sw is sensitive enough. I have not tried given ~20% divergence. FYI: for infinite -z, bwa-sw is identical to SW in theory, but when -z goes beyond some threshold, it will be slower than BWT-SW. Practically, to employ heuristics, BWA-SW loses the ability to do fast exact SW alignment. BWT-SW is the best choice to perform such alignment.
I would use the classical concept of synteny or othology for such cross-genome comparisons. As you are already advised BLAST+ / older version of BLAST(BLAST 2.2.) are good choice for both.
Depending up on the length of your sequence, you may have use the right matrices and word-lenghth. If you are using the latest version of BLAST+, most of these features are already inbuilt to the algorithm (See the FAQ here, hope this is applicable to CLI as well. ).
You can also consult specialized cross-genome comparison tools like Vista tools, Cinteny etc.
If your query fraogments are from a mapped genome, I second the idea of considering synteny conservation. I had the impression that the query fragments are meant to be unconnected, though.
I think avilella is only using human-mouse as an example. Probably he does not have the genome sequence; otherwise there are much better ways than directly mapping these short sequences.
I would suggest MOSAIK aligner. It can acheive high sensitivity. It uses Smith-Waterman algorithm for alignemnet. Indexes both read and references and creates a jump database to speed up the alignment process. It has multiple options (for gap opening/extension; allowed mismatches; hash size...) to optimise the alignment process similar to that of blast. It also has option to use the IUPAC ambiguity codes during alignment.
To use mosaik, one needs a lot of parameter tuning. We are talking about 20% sequence divergence, while mosaik is designed for a few percent divergence. Actually when I wrote the BWA-SW paper, I did test mosaik, but it performance was clearly worse than bwasw/ssaha2/blat. I was afraid of doing something stupid, so dropped it in the end.
To use mosaik, one needs a lot of parameter tuning. We are talking about 20% sequence divergence, while mosaik is designed for a few percent divergence.
avilella is talking about the "human-mouse genetic distance". That as I remember is 80-85% nucleotide identity. I could be wrong, though. I have not looked at the human-mouse alignment for years.
Blat is not designed for 20% divergence, either. Unlike ssaha2, blat does not refine the final alignment with SW, which is particularly problematic for high divergence.
Do the 100bp to 1000bp sequences include intergenic regions?
How many sequences do you have?
every set ranges from 100K to several million sequences.