I've been given a project to find an plasmid insertion of a small genome that has been sequenced using Illumina that has a reference sequence available. The plasmids used are large 10kb.
I'm used to using Novoalign3 mapping and GATK to find SNPs and INDELs but I'm not sure that such a large INDEL will be detected using this method, I'll attempt but I was wondering if there is a more appropiate method to do this, I was thinking possibly de novo assembly and compare with reference using mummer. Any thoughts on the best method to detect plasmid insertions that are going to be 1-10kb?
Thanks for the quick response, not sure I understand this though. Do you mean map the reads to the plasmid sequence rather than the reference and then take the overhang sequences either side of the plasmid from those reads that mapped, join them and search for it in the reference?
I meant, use both sequences (plasmid+genome) for your bwa reference. and bwa will tell you where some reads overlap a junction.
Ah - I replied assuming that the sequence of the plasmid was unknown Pierre. I'm sure Rob234 can clarify
Thanks, still good to know that could be done without though. Yea I know what the sequence should be but the plasmid is placed in the sequence randomly. I'm not sure what is meant by overlap a junction? if I add the plasmid to the reference genome it's like separate contigs the mapper doesn't try to map across the two contigs? and the plasmid isn't attached to either of the ends it's inserted in it. Most likely don't understand how BWA-MEM is working, does it report reads using a special flag that can be split and mapped to both contigs (junctions)?
If you are using paired end reads, then the mapper will tell you if one end maps to one chromosome and the other maps to the plasmid
BWA will tell you in a SAM if ONE read maps two regions: the best hit is in the regular record (say
chr1:12345 cigar:50M50S
) and the 2nd hit in the metadata (plasmid:6789 cigar:50S50M
)