Very large inferred maximum insert size!
1
2
Entering edit mode
10.2 years ago
Nikleotide ▴ 130

I'm trying to align some MiSeq reads (paired end) with bwa aln.

The size of the data is relatively small but when I get to the point that I want to convert the SAI to SAM files in the process of alignment, I get stuck at bwa_sai2sam_pe_core part. But what strikes the most is the very large inferred maximum insert size! For example for an amplicon of 250 b, sometimes I get inferred insert sizes of over 40,000!

This is an example of one run:

Sai to SAM

[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (298, 311, 13759)
[infer_isize] low and high boundaries: 250 and 40681 for estimating avg and std
[infer_isize] inferred external isize from 21882 pairs: 4924.136 +/- 6392.622
[infer_isize] skewness: 0.661; kurtosis: -1.563; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 45262 (6.31 sigma)
[bwa_sai2sam_pe_core] time elapses: 1.43 sec
[bwa_sai2sam_pe_core] changing coordinates of 0 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...

Any thoughts will be appreciated.

sai2sam miseq bwa • 4.0k views
ADD COMMENT
3
Entering edit mode

First, I'd review in a browser. I have seen this kind of thing happen if there is a mis-pairing in the FASTQ files, resulting in lots of chimeras and large insert sizes. Checking the number of reads in read1 and read2 is a quick first pass look.

ADD REPLY
0
Entering edit mode

I think there must be something with the read files (not having the same number of reads or even not same length). These reads are the output of a trimming pipeline. The reason I think this way is this kind of error only happens with some of my samples not all!

ADD REPLY
2
Entering edit mode

Differing read lengths are not a problem, but reads that are dropped by the trimming software because they are too short will lead to what I describe above. Sounds like you'll need to be a bit more clever about your read trimming.

ADD REPLY
1
Entering edit mode

What's the reference to which you are mapping reads? Is it some well studied genome or your own assembly?

ADD REPLY
0
Entering edit mode

It's the reference human genome assembly we use at our center. I am certain it's not about the reference genome.

ADD REPLY
1
Entering edit mode
10.2 years ago
Rad ▴ 810

I had that problem once, check your command or share it here, are you using R1 and R2 reads ? it is a misparing issue, check your command/code for bugs

ADD COMMENT
0
Entering edit mode

Most probably it is a problem with unequal number of reads in R1 and R2 since I don't seem to have this problem with some other samples' paired end reads. There must be something with the quality trimming or the original raw reads. I will check this out and keep here updated.

ADD REPLY
1
Entering edit mode

I would do a sanity check with FastQC or something similar

ADD REPLY
0
Entering edit mode

Thanks Rad. After checking different inputs, I realized that the problem is definitely uneven number of reads in Read1 and Read2 after filtering for quality. Do you have any suggestion how to leave out the unpaired reads from Read1 and Read2 fastq files?

ADD REPLY
1
Entering edit mode

It is a good idea to check the FASTQ files before trimming. My guess is that the trimming step is removing reads. If that is the case, you may have to adjust your trimming strategy a bit. If the FASTQ files do not agree before any processing, then I think I would spend some time trying to figure out why since this generally shouldn't happen from a sequencer.

ADD REPLY
1
Entering edit mode

Check this that can be what you're looking at

ADD REPLY

Login before adding your answer.

Traffic: 1634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6