hello, I have a fungal genome assembled into 222 contigs (contig.fasta). I would like to detect characteristic telemeric repeats (TTAGGG/CCCTAA) in the contigs. My approach was to create an indexed databse using the fasta file containing 222 contigs and performing blastn using a very short quesry sequence (tel1.fasta) (6 nucleotides only).
>query sequence
TTAGGG
following is the command I used. It does not return matches surprisingly even though I know that there are prenty of matches. Can you explain why it is happening and how to prevent it?
blastn -query tel1.fasta -db contig.fasta -task "blastn-short" -outfmt 7 -max_target_seqs 10 -evalue 0.5 -perc_identity 90
Is there any other way to detect telomeres in a given sequence assembly?
Thanks
I don't think you can use the
"
. Try justblastn-short
.That works only for sequences 10 bp and longer according to this post: A: Blast Settings For Short Sequences
You may want to try
fuzznuc
from EMBOSS for real short sequences like that.The minimal seed length is 7 for blast actually. Still too much for this, unless you blast on two or more tandem repeats. For instance:
That's a good one. Yes, i noticed that there were tandem repeats.
Thanks genomax. I need to try fuzznuc as you suggested.
Do you actually need alignment for this, or are the matches (reasonably) well conserved? Fuzzy matching, as already mentioned, or a regex approach might be sufficient?
Thanks for the suggestion. This sequence is a characteristic repeat. regex worked!!!!!