Entering edit mode
5.1 years ago
ancient_learner
▴
680
Hi
I hope you all are doing good. I am dealing with IGHD genes (Mouse genome) which are small in size. We have got our HTGTS-Seq done and have got the reads I want to particularly check for D segment in the IGH locus. So my reference file here is D gene segments file to which some of my reads shall align. If I am keeping e-value cutoff as 0.01
it is not returning all possible alignments for gene segments as some of the D gene segments are 10-15bps in length. But if I change it to 0.1
I am able to get some. So the question is how should I determine the cutoff?
I personally would say, don't use e-value as a "cutoff" kind of parameter. Use the default, which is 10.0 on top of my head. And use values like identity and coverage to filter the output. One value to calculate the e-value is the size of the database, so if you happen to find more reference sequences and add that to your database the e-value can change again. I don't know your expectations or goal but with such small sequences it is sometimes worth something the change the word_size.
I think II should explain it better. I have a 400 bp sequence (which is a query in my case) like this
I have a reference file in which each entry would be close to 20 bp or even less
So my question is when I run blastn what parameters should I consider to get the good hits.
You could use the longer sequence as the reference. Would you have reasons not to do that?
As I told it's IGH locus and am looking at IGHD gene segments which are small
Not sure why that changes anything. If you use seq1 as the reference and blast Dgene1 against it you will get a match or not. If the match is perfect (100% coverage and identity) you know that seq1 contains a IGHD segment.
Besides, blast uses the query coverage to calculate scores. If you blast a much longer sequence against a shorter one this will be extremely low. Let's say that you have a perfect match Dgene2 and your sequence is 150pb. 11/150*100=7.3% the coverage will be 7.3% which is really low. But it is a 100% percent match.
In case of Dgene1 it will be 11.3% coverage, so you will have two perfectly matching segments, one with a coverage of 7.3 and the other 11.3... If you have two perfectly matching sequences you should have two times 100%. Otherwise it is impossible to set thesholds.