Hello everyone,
Searching on several forums, I can't find how to solve my problem, however I'm sure this has been done before. Here is the problem: I'm mapping on one genome a library of repeat sequences (Transposable Elements, TEs) so one query can have multiple match on the genome. However, at one given genomic position, I can have multiple match of different TE from my library, and I want to sort the output file to only keep the best hit at one given location of the genome.
Below is an example seen in the genome browser: the darker long bar represents the best repeat matching one genomic position. I want to select those one among hits.
Usually, to perform this kind of analysis, RepeatMasker is used, but I'm not totally satisfied by the result I have, and the way it works is a kind of blackbox. I'm considering using sliding window approach to select of at one given base what is the best TE hit among the different possibilities but I have no clue at all how to do that (and no competencies!).
Thanks a lot for your help and advises,
Best,
Clément
Can you tell us what you are using as query? Is it fasta derived from fastq reads or something else?
Hello, this is fasta sequences (sizes range from 200-10000bp). I is a manually curated TE library.
Are the query sequences full length? Are you are trying to locate their positions in the scaffolds? Do the query sequences share long stretches of similarity (i.e. can some of the smaller sequences be fully contained in larger ones)?
Are the query sequences full length? -->YES and NO, some are other are only partial pieces of TEs
Are you are trying to locate their positions in the scaffolds? -->YES
Do the query sequences share long stretches of similarity (i.e. can some of the smaller sequences be fully contained in larger ones)? --> I have clustered them before in order to avoid that at the maximum. They are clustered such as the sequences on one cluster have at least 80% identity, the shortest sequences are clustered to the reference sequence of one cluster if 80% of itslef is in the alignment. Then, I keep only the reference sequences and I map on the genome. The problem is, sometimes some pieces of different cluster match the same genomic position.
Thanks!
Interesting and difficult problem. I take it that the reference used to cluster the TE's is different than the scaffolds being searched against.
If you stick to local alignments then look for hits that cover 100% of the query (or close to) in addition to increasing gap open/extension penalties to filter out some of the partial matches. I am also wondering if you need to start looking at a program that does global alignments instead of local.
It would help to know what your goal is with this analysis, even in general terms. For example, are you trying to identify TEs in another genome using a custom repeat library, and if so, how divergent are the species you are comparing?
Whatever you goal is, BLAT is probably not the right tool for this task. You can tile across regions to get a contiguous set of hits with this output, but the assumptions of the program aren't going to be appropriate for most applications involving TEs (mainly due to divergence). Most of these issues have been handled by programs like RepeatMasker, so you might want to elaborate on what you are not happy about with that approach.