I did a local standalone blast pre-miRNAs against a genome in tabular format (-m 8
) and got the results. My next steps for refining the results include GC content analysis, RepeatMasker etc. I am currently developing a Perl program to extract the part of the sequence that has matched to any pre-MiRNAs from the Tabular column. My logic include
- Matching the supercontig name with a specific sequence block name in the genome file.
- Extracting the matched area in between the sequence match and end points mentioned in the tabular file.
In the tabular output, there is query sequence match start and end point as well as subject sequence start and end point. Which should I be using as a start and end point for sequence extraction? Query sequence or subject sequence?
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node93.html
I have read published papers on miRNA where they mention a sliding window of 70 or 100 nucleotides on either side of the match area. I presume that these researchers extract 70 nucleotides before the start of the match area as well as 70 nucleotides after the end of the match area. Am I right in presuming this and should I be doing the same thing?
Please help
It would be helpful if you could just answer any one part of the question as well.