I'm trying to align some MiSeq reads (paired end) with bwa aln.
The size of the data is relatively small but when I get to the point that I want to convert the SAI to SAM files in the process of alignment, I get stuck at bwa_sai2sam_pe_core
part. But what strikes the most is the very large inferred maximum insert size! For example for an amplicon of 250 b, sometimes I get inferred insert sizes of over 40,000!
This is an example of one run:
Sai to SAM
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (298, 311, 13759)
[infer_isize] low and high boundaries: 250 and 40681 for estimating avg and std
[infer_isize] inferred external isize from 21882 pairs: 4924.136 +/- 6392.622
[infer_isize] skewness: 0.661; kurtosis: -1.563; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 45262 (6.31 sigma)
[bwa_sai2sam_pe_core] time elapses: 1.43 sec
[bwa_sai2sam_pe_core] changing coordinates of 0 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
Any thoughts will be appreciated.
First, I'd review in a browser. I have seen this kind of thing happen if there is a mis-pairing in the FASTQ files, resulting in lots of chimeras and large insert sizes. Checking the number of reads in read1 and read2 is a quick first pass look.
I think there must be something with the read files (not having the same number of reads or even not same length). These reads are the output of a trimming pipeline. The reason I think this way is this kind of error only happens with some of my samples not all!
Differing read lengths are not a problem, but reads that are dropped by the trimming software because they are too short will lead to what I describe above. Sounds like you'll need to be a bit more clever about your read trimming.
What's the reference to which you are mapping reads? Is it some well studied genome or your own assembly?
It's the reference human genome assembly we use at our center. I am certain it's not about the reference genome.