Hi all,
I have a read that consists of two parts as shown in the picture.
The first part is a sequence of a virus (it's either virus A or B) and the second part (H) is a human DNA sequence. We also know that only part of the virus is inserted in the DNA. The length of this part varies across cells. I know the exact start point of the virus sequence, but don't know the end. I know that the length of whole read (A+H) is about 500bp. Also, the exact sequences of virus A and B are available, which I consider as a reference. I want to see which type of virus exists in my DNA sequence. How can I use BWA to align my reads to the reference and detect the type of virus?
Thank you in advance.
Shiva.
If you use virus A or B as sole reference
bwa
should align part of the read that matches the virus and should soft-clip the rest. You could look for soft clipped reads that show full match on part of the read?Is it of interest where the sequence align? Or is the question just "Do I have virus A and/or virus B" sequence in?
For the later one bbduk might be helpful. See the usage example about "Kmer filtering".
Thank you for your reply. BBDuck seemed very useful!
My final goal is to find the location of "Y" sequence in the genome. This sequence is important to me because it was followed by "X" sequence. My plan was to first align the whole read to A and B individually to detect the type of virus. Then cut the "X" part from read and align only the "Y" part to the human reference genome and find the location of it in the genome. Do you think this approach is suitable? Do you know any tools that can do this analysis in one step?