I am interested in assembling a part of the genome, not the whole genome . I have the read file for the whole genome. Is there any possible way by which i can get a subset of reads for the part of the genome that i am interested in and then i can apply assembly over that subset of reads.
I am newbie to this field . Can you point me out to the literature where people have done this thing.
I also thought of the same.But then we would be unable to capture the diversity in the unknown genome region. In fact we filter out the reads by mapping position for region in the reference genome and then assemble that subset of reads, we will get back the region of reference genome itself. How to avoid this?
You could lower the mapping stringency and also use multiple mapped reads also covering this region.
I can lower the mapping stringency but is there any way to quantify that.?How much lower ? Because by lowering i will get more and more number of reads.
maybe this gives you an idea however i guess it is, unfortunatelly, trail and error... sry about that
You can also, rather than mapping, use kmer-matching. This can be more sensitive, depending on the parameters. For example:
This will capture all the reads that share a 27-mer with the region, allowing one mismatch.
Thanks for the reply. Can you elaborate on
bbduk.sh
?I've described it here. You can run the script with no parameters (or edit it) to get a list of parameters and their meanings.
Essentially, in this mode, it will retain every read that has a 27-mer match to the reference. You can also use the flag
mkf=0.5
, for example, which stands for "min kmer fraction", to require reads to share at least 50% of their kmers with the reference.