Most Sensitive Method To Search 100-1000Bp Sequences To Genome At Human-Mouse Genetic Distances
5
5
Entering edit mode
13.4 years ago

Following up on this question: http://biostar.stackexchange.com/questions/8010/searching-200-400bp-matches-against-mammalian-genome-human-mouse-distance
I would like to know what is the most sensitive method to search sequences ranging from 100bp to 1000bp against a whole genome assembly of another species, for genetic distances similar to human-mouse.
As an example, if I have human sequences ranging from 100-1000bps, what is the most sensitive tool and parameter options to find hits in the mouse genome?
I've been recommended to look at blast and blast+, but it would be great to hear about specific parameter settings or different options.

blast blast genome search homology • 5.5k views
ADD COMMENT
0
Entering edit mode

Do the 100bp to 1000bp sequences include intergenic regions?

ADD REPLY
0
Entering edit mode

How many sequences do you have?

ADD REPLY
0
Entering edit mode

every set ranges from 100K to several million sequences.

ADD REPLY
4
Entering edit mode
13.4 years ago
Lyco ★ 2.3k

First, I must say that I don't quite understand the difference between this question and the previous one (for which you provided some decent blast parameters yourself). You are asking for 'the most sensitive' method, not for the fastest one. Thus, I would probably discount blat, ssaha and their ilk, you might consider discontinuous megablast, but I would probably stick to blast+ for a general-purpuse comparison. I am sure that at human-mouse distance, you will find >95% or all relevant similarities by discontiguous megablast or by a simple blastn with default parameters. If you want to get the other 5%, the strategy depends on why they were missed in the first place.

  • If the input sequence was rather short (<200bp), it might be advisable to lower the word size from 11 to 9 or even 7.
  • If there are repeats involved (ERVs, Alus, transposons, whatever), it is possible that dust and/or repmasker will have killed everthing alignable. These cases are inherently difficult. It might be necessary to switch off the repeat masking, but this could mean sifting through thousands of 'signficant' alignments
  • If the sequences are just too divergent, this could be another tough problem. This can happen even at human-mouse distance. If there is the chance that your query sequence has some coding bits, it might be useful to also try tblastx.

In your situation I would probably set up a pipeline, first looking for matches by a fast and not-so-sensititve method. If nothing is found, I would probably try different things in parallel: using a shorter or longer word size, and also try using a different filter database (e.g. one that contains only repetitive elements found in both species). Repeats that are only present in one of the species will not generate spurious matches but might mask valuable sequence material in the query or target.

ADD COMMENT
0
Entering edit mode

+1 for word size and repeat masker. I think Aviella should take subset from his dataset and fine-tune the parameters. In my experience, word size modifications may required if sequence length is <100bp.

ADD REPLY
4
Entering edit mode
13.4 years ago
lh3 33k

Albert, you did not mention what speed you intend to achieve. If you only have a couple of sequences, you can just use Smith-Waterman. BWT-SW is also a good choice for mammalian genomes. It has a similar speed to blast for short sequences while gives identical alignment to Smith-Waterman. It is frequently overlooked. BWA-SW can be a reasonable choice when speed is critical. You need to increase "-z" (e.g. 10 or even 100) to achieve high sensitivity. SSAHA2 is also a decent choice. The default setting would not work. You need to ask Zemin about the options.

For mammalian genomes, blast is too slow. Algorithms indexing the genome are usually much faster. I do not know if blast+ gives significant boost. Probably not.

EDIT: People should read the BWT-SW paper [PMID:18227115]. It performs the exact SW algorithm while is faster than blast in the 100-1000bp range. The BWA-SW paper [PMID:20080505] shows that SSAHA2 performs well for 10% divergence. I do not see why it cannot be tuned to work for sequences with the human-mouse divergence. BWA-SW is using the default settings. It may work well with a more sensitive configuration. I am not sure about this bit, though, as I have not tried. Also, modern SSE2/CUDA powered SW is approaching the speed of blast (not yet). PatternHunter is known to work better than blast for cross-species alignment. A few groups (I cannot remember which) have improved blast, though these are less known. A lot happened in the past 10 years in the area of sequence analysis, but blast has not been changed much so far as I know. These efforts are not in vain.

I guess Albert is mapping transcriptome contigs. I think he needs a fast aligner.

ADD COMMENT
1
Entering edit mode

blast+ is not much faster than blast. On a decent computer (with a couple of GB RAM) blast is ok for running a sequence fragments against a whole mammalian genome (I normally use overlapping 100-500kb fragments rather than chromosome-size pieces). I wouldn't use it routinely for genome-genome comparison. However, if you want to compare entire genomes, you should definitively consider conserved synteny. I assume that the question was about searching with unconnected pieces.

ADD REPLY
0
Entering edit mode

@lh3: "For mammalian genomes, blast is too slow." I am wondering how or why ? Do you have a reference for this ? Also, you may already know that you can index a blast database in fasta format. BLAST is an ideal option for cross-genome searches. Aviella's sequences are +100bp, so most of the searches on default option should work.

ADD REPLY
0
Entering edit mode

These are not transcriptome contigs but they do come from a next-gen sequencing pileup/assembly step. I end up with a range of 100-1000bp and about 1e5-1e6 of them, but so far blast+ and blastall take too long for what I would like to do at a bigger scale. For example, blastall takes 2758.1034 CPU hours for one example dataset, ncbi-blast+ blastn is around 2000 CPU hours, whereas "bwasw -z 100" has just finished in 126.0136 CPU hours on the same dataset. Still looking at the results...

ADD REPLY
0
Entering edit mode

@avilella: let me know if bwa-sw is sensitive enough. I have not tried given ~20% divergence. FYI: for infinite -z, bwa-sw is identical to SW in theory, but when -z goes beyond some threshold, it will be slower than BWT-SW. Practically, to employ heuristics, BWA-SW loses the ability to do fast exact SW alignment. BWT-SW is the best choice to perform such alignment.

ADD REPLY
1
Entering edit mode
13.4 years ago

I would use the classical concept of synteny or othology for such cross-genome comparisons. As you are already advised BLAST+ / older version of BLAST(BLAST 2.2.) are good choice for both.

Depending up on the length of your sequence, you may have use the right matrices and word-lenghth. If you are using the latest version of BLAST+, most of these features are already inbuilt to the algorithm (See the FAQ here, hope this is applicable to CLI as well. ).

You can also consult specialized cross-genome comparison tools like Vista tools, Cinteny etc.

ADD COMMENT
1
Entering edit mode

Avilella is only use human-mouse as an example. Probably he does not have the genome sequence.

ADD REPLY
0
Entering edit mode

If your query fraogments are from a mapped genome, I second the idea of considering synteny conservation. I had the impression that the query fragments are meant to be unconnected, though.

ADD REPLY
0
Entering edit mode

@Lyco: Yes, the search is between human and mouse.

ADD REPLY
0
Entering edit mode

I think avilella is only using human-mouse as an example. Probably he does not have the genome sequence; otherwise there are much better ways than directly mapping these short sequences.

ADD REPLY
0
Entering edit mode

Precisely as lh3 says, I have the target genomes but my short 100-1000bp queries are for a non-model species with no genome.

ADD REPLY
1
Entering edit mode
13.4 years ago
Rm 8.3k

I would suggest MOSAIK aligner. It can acheive high sensitivity. It uses Smith-Waterman algorithm for alignemnet. Indexes both read and references and creates a jump database to speed up the alignment process. It has multiple options (for gap opening/extension; allowed mismatches; hash size...) to optimise the alignment process similar to that of blast. It also has option to use the IUPAC ambiguity codes during alignment.

ADD COMMENT
1
Entering edit mode

Example options: MosaikAligner -in myreads.dat -out h_sapiens_aligned. dat -ia h.sapiens.dat -hs 15 -mm 4 -mhp 100 -act 20 -j h.sapiens_15 -p 10

ADD REPLY
1
Entering edit mode

To use mosaik, one needs a lot of parameter tuning. We are talking about 20% sequence divergence, while mosaik is designed for a few percent divergence. Actually when I wrote the BWA-SW paper, I did test mosaik, but it performance was clearly worse than bwasw/ssaha2/blat. I was afraid of doing something stupid, so dropped it in the end.

ADD REPLY
0
Entering edit mode

@lh3 how did you the number %20

ADD REPLY
0
Entering edit mode

@lh3 how did you get the number: 20% divergance?

ADD REPLY
0
Entering edit mode

To use mosaik, one needs a lot of parameter tuning. We are talking about 20% sequence divergence, while mosaik is designed for a few percent divergence.

ADD REPLY
0
Entering edit mode

avilella is talking about the "human-mouse genetic distance". That as I remember is 80-85% nucleotide identity. I could be wrong, though. I have not looked at the human-mouse alignment for years.

ADD REPLY
0
Entering edit mode
13.4 years ago

For this you can use Blat or MegaBlast.

ADD COMMENT
1
Entering edit mode

Blat is not designed for 20% divergence, either. Unlike ssaha2, blat does not refine the final alignment with SW, which is particularly problematic for high divergence.

ADD REPLY

Login before adding your answer.

Traffic: 1936 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6