Entering edit mode
5.7 years ago
bas1993
▴
60
So I have pacbio and MinION reads and I want to know if a genetically modified region is 1 time present or multiple times. Normal mapping doesn't give me the answers that I want as the reads are soft-clipped on the known sequence so I have no information about what else is in the read. So instead of using the known sequence as reference and the reads as query I was thinking about using the reads in a multi-fasta file as a reference and use the known sequence as query. Which tool is the best for this? a local BLAST? Or am I thinking in a wrong way.
minimap2
should be the aligner of choice in this case. Is that what you already used?You could do it with BLAST but depending on what kind of local alignments you see (if there are smaller repeat regions) you may have to play with BLAST settings to get the result you need. If you expect the 7kb sequence (or parts of it) to be present as is in the reads then even
blat
may be useful to find them.But will I not get the same problem as with other mappers. I know the genetic modification is present in the reads, but I would like to know if it is present multiple times (or 1 time completely and then half of the sequence).
Do you not need to assemble first to figure this out? How will you (easily) detect multiple genetically modified versions when you will already get many, many reads mapping?
I already did an assembly and in the .gfa files it looked like circular contig that is connected to the chromosome. So I hypothised that maybe the assembly was influenced by some of the reads that are shorter than the modification. That is why I extracted all the reads that are longer than the modification and wanted to some kind of alignment.
I did not do this with a 7 kb sequence but with a test sequence (there are two copies in second read) I see the following.
Using the following, I see the two copies of query sequence
So you essentially knocked in a certain sequence which is not part of the genome, you don't know its location and want to know how many integration locations you have?
To give some more background information, I didn't knock in the gene myself but a whole plasmid is integrated in the chromosome by an external company in a food bacteria. The documentation of this modification is not correct and that is why we sequenced it. The plasmid is around 7000 basepairs and I know the location. So far I extracted some reads manually and BLASTed them. For some reads I could see 2 plasmids but other times I got some conflicting results and it is also not feasible to do this manually for thousands of reads.