Can anybody inform me, in the algorithms for "alignment of short DNA sequences to the human genome" whether the repeats in genome is considered as Junk or not ? More clearly, as an example: BowTie , Blat etc considers repeats in the genome or not while building index. Thanks.
First of all the concept of junk DNA is both outdated and misleading due to the secondary meaning of the word. So don't use it, there is no such thing as junk DNA, only DNA with yet unknown function.
As for your question, a mappers job is to align a read to a genome, forget about index building that is just a intermediate step that is not relevant to their function. When a read maps to multiple locations then the mapper may report all the mappings (usually with cutoffs) or report one location with a flag indicating that the read maps to other locations as well. This behavior is mapper specific.
@Istvan Albert, Thanks. However, is Mapper (mostly) take into consideration the repeats in the genome ? Example: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ here Repeats are shown in lower case (a,c,g,t). Is the mapper take those a,c,g,t or only cosiders A,C,G,T (non-repeating sequence is shown in upper case) ? Thanks.
I think in most cases one is better off aligning the reads then deal with the reads map that map to multiple locations rather than ignoring parts of the genome
Yes, Istvan is right. It depends on the mapper and parameters you are using. If you do not want to map your reads to repeats, just mask them with N's (or download pre masked genomes available at repeatmasker <http://www.repeatmasker.org> ) as most of the mapping tools will skip N's.
@Istvan: May be I am wrong but as you said "forget about index building", I think sometimes it is important. In this case, like Arpssss said repeats are shown in lower case - Novoalign says that "indexing process can optionally ignore lowercase nucleotides in the reference genome" . Now if some one wants to ignore lower case, then he/she should do the indexing carefully.
Question: How does novoalign handle lowercase masking?
Answer The indexing process can optionally ignore lowercase nucleotides in the reference genome/nucleotide database. This means the lower case letters cannot be used to initiate an alignment however the Needleman-Wunsch alignment process can extend into lower case sequence as long as an alignment was initiated in indexed upper case sequence.
what I wanted to say is that index building is a tool specific step that may or may not be required, it is similar to setting a certain parameter or cutoff - there is no generic advice that we could give on how to index in general
@Istvan Albert, Thanks. However, is Mapper (mostly) take into consideration the repeats in the genome ? Example: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ here Repeats are shown in lower case (a,c,g,t). Is the mapper take those a,c,g,t or only cosiders A,C,G,T (non-repeating sequence is shown in upper case) ? Thanks.
I think in most cases one is better off aligning the reads then deal with the reads map that map to multiple locations rather than ignoring parts of the genome