Question

BLAT: how to select best hit at one genomic position? (queries are repeats)

0

Entering edit mode

8.7 years ago

goubert.clement ▴ 30

Hello everyone,

Searching on several forums, I can't find how to solve my problem, however I'm sure this has been done before. Here is the problem: I'm mapping on one genome a library of repeat sequences (Transposable Elements, TEs) so one query can have multiple match on the genome. However, at one given genomic position, I can have multiple match of different TE from my library, and I want to sort the output file to only keep the best hit at one given location of the genome.

Below is an example seen in the genome browser: the darker long bar represents the best repeat matching one genomic position. I want to select those one among hits.

Blast out

Usually, to perform this kind of analysis, RepeatMasker is used, but I'm not totally satisfied by the result I have, and the way it works is a kind of blackbox. I'm considering using sliding window approach to select of at one given base what is the best TE hit among the different possibilities but I have no clue at all how to do that (and no competencies!).

Thanks a lot for your help and advises,

Best,

Clément

Blat .psl best hit genome Transposable Elements • 3.3k views

ADD COMMENT • link updated 8.7 years ago by Amitm ★ 2.3k • written 8.7 years ago by goubert.clement ▴ 30

0

Entering edit mode

Can you tell us what you are using as query? Is it fasta derived from fastq reads or something else?

ADD REPLY • link 8.7 years ago by GenoMax 148k

0

Entering edit mode

Hello, this is fasta sequences (sizes range from 200-10000bp). I is a manually curated TE library.

ADD REPLY • link 8.7 years ago by goubert.clement ▴ 30

0

Entering edit mode

Are the query sequences full length? Are you are trying to locate their positions in the scaffolds? Do the query sequences share long stretches of similarity (i.e. can some of the smaller sequences be fully contained in larger ones)?

ADD REPLY • link 8.7 years ago by GenoMax 148k

0

Entering edit mode

Are the query sequences full length? -->YES and NO, some are other are only partial pieces of TEs

Are you are trying to locate their positions in the scaffolds? -->YES

Do the query sequences share long stretches of similarity (i.e. can some of the smaller sequences be fully contained in larger ones)? --> I have clustered them before in order to avoid that at the maximum. They are clustered such as the sequences on one cluster have at least 80% identity, the shortest sequences are clustered to the reference sequence of one cluster if 80% of itslef is in the alignment. Then, I keep only the reference sequences and I map on the genome. The problem is, sometimes some pieces of different cluster match the same genomic position.

Thanks!

ADD REPLY • link 8.7 years ago by goubert.clement ▴ 30

0

Entering edit mode

Interesting and difficult problem. I take it that the reference used to cluster the TE's is different than the scaffolds being searched against.

If you stick to local alignments then look for hits that cover 100% of the query (or close to) in addition to increasing gap open/extension penalties to filter out some of the partial matches. I am also wondering if you need to start looking at a program that does global alignments instead of local.

ADD REPLY • link 8.7 years ago by GenoMax 148k

0

Entering edit mode

It would help to know what your goal is with this analysis, even in general terms. For example, are you trying to identify TEs in another genome using a custom repeat library, and if so, how divergent are the species you are comparing?

Whatever you goal is, BLAT is probably not the right tool for this task. You can tile across regions to get a contiguous set of hits with this output, but the assumptions of the program aren't going to be appropriate for most applications involving TEs (mainly due to divergence). Most of these issues have been handled by programs like RepeatMasker, so you might want to elaborate on what you are not happy about with that approach.

ADD REPLY • link 8.7 years ago by SES 8.6k

score 0 · Answer 1 · 2016-04-21

0

Entering edit mode

8.7 years ago

Amitm ★ 2.3k

hi, I am assuming that the TEs are aligning in one contiguous stretch, or small gaps but not like intronic regions. In this scenario, maybe you could try using Bowtie to align. Under default settings it shall report the best hit only. Using the -D param to increase the search space would make it more sensitive when you suspect the query to have multiple hits.

ADD COMMENT • link 8.7 years ago by Amitm ★ 2.3k

0

Entering edit mode

Hello Amitm,

Thanks for your answer. Actually, I have a really complex genome where TEs are highly fragmented and inserted the one in the others. This is why I use Blat, that can resolve gaps in one sequence alignment. In addition, I want to be very sensitive in the mapping, because the TE copies can be highly divergent from their consensus (< 80% identity), so I think bowtie wont be that sensitive, but I'll give a try!

ADD REPLY • link 8.7 years ago by goubert.clement ▴ 30

0

Entering edit mode

Keep in mind that bowtie v1 has an upper query size limit of ~1000 bp.

ADD REPLY • link 8.7 years ago by GenoMax 148k

0

Entering edit mode

Yes, Bowtie 2 is actually referred to in the hyperlink which has a stated no-limit input length. But after reading the detail of Clement's scenario, I think Bowtie2 might be very inadequate in resolving the mapping.