Question

Issue With Alignment Of Short Dna Sequences

0

Entering edit mode

13.2 years ago

Arpssss ▴ 30

Can anybody inform me, in the algorithms for "alignment of short DNA sequences to the human genome" whether the repeats in genome is considered as Junk or not ? More clearly, as an example: BowTie , Blat etc considers repeats in the genome or not while building index. Thanks.

blast dna sequencing genome bowtie • 2.5k views

ADD COMMENT • link updated 13.2 years ago by Vikas Bansal ★ 2.4k • written 13.2 years ago by Arpssss ▴ 30

score 4 · Answer 1 · 2012-03-22

4

Entering edit mode

13.2 years ago

Istvan Albert 102k

First of all the concept of junk DNA is both outdated and misleading due to the secondary meaning of the word. So don't use it, there is no such thing as junk DNA, only DNA with yet unknown function.

As for your question, a mappers job is to align a read to a genome, forget about index building that is just a intermediate step that is not relevant to their function. When a read maps to multiple locations then the mapper may report all the mappings (usually with cutoffs) or report one location with a flag indicating that the read maps to other locations as well. This behavior is mapper specific.

ADD COMMENT • link 13.2 years ago by Istvan Albert 102k

0

Entering edit mode

@Istvan Albert, Thanks. However, is Mapper (mostly) take into consideration the repeats in the genome ? Example: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ here Repeats are shown in lower case (a,c,g,t). Is the mapper take those a,c,g,t or only cosiders A,C,G,T (non-repeating sequence is shown in upper case) ? Thanks.

ADD REPLY • link 13.2 years ago by Arpssss ▴ 30

0

Entering edit mode

I think in most cases one is better off aligning the reads then deal with the reads map that map to multiple locations rather than ignoring parts of the genome

ADD REPLY • link 13.2 years ago by Istvan Albert 102k

score 1 · Answer 2 · 2012-03-23

Yes, Istvan is right. It depends on the mapper and parameters you are using. If you do not want to map your reads to repeats, just mask them with N's (or download pre masked genomes available at repeatmasker <http://www.repeatmasker.org> ) as most of the mapping tools will skip N's.

@Istvan: May be I am wrong but as you said "forget about index building", I think sometimes it is important. In this case, like Arpssss said repeats are shown in lower case - Novoalign says that "indexing process can optionally ignore lowercase nucleotides in the reference genome" . Now if some one wants to ignore lower case, then he/she should do the indexing carefully.

I read about novoalign here and it says -

Question: How does novoalign handle lowercase masking?

Answer The indexing process can optionally ignore lowercase nucleotides in the reference genome/nucleotide database. This means the lower case letters cannot be used to initiate an alignment however the Needleman-Wunsch alignment process can extend into lower case sequence as long as an alignment was initiated in indexed upper case sequence.