Hello,
I want to store detect my variants in a VCF format. For that I need to map my PE reads (100x2) to reference. What I have discovered however is that for a lot of my PE reads, the fragments size is smaller than 200bp, which means some PE reads overlap, increasing the coverage of a region artificially.
I thought I would merge the PE reads with SeqPrep to place all PE reads that overlap in a merged.fastq file.
My question is, how do I then map both unmerged PE reads and merged SE reads to the genome with bwa? I only know how to do it if it's only SE or only PE.
Also, does this approach sound reasonable? Adrian
At least GATK is supposed to handle this properly. In fact, I'd have serious misgivings about a variant caller that handled this situation incorrectly.
@dpryan79 : very useful, thanks.
" increasing the coverage of a region artificially. " why ? here, a fragment of the genome is sequenced twice -> it increases your confidence of the calling.
Yes, the caller does not know that, it thinks 2 separate reads are calling one variant. What if there is more than 1 variant at that position? What if it's a bi-allelic SNP, and one alleles happens to get more calls because of fragments being sequenced twice.