Estimaing insert size from a pair-end library using bbmap
0
2
Entering edit mode
4.2 years ago

Hi biostars.

I'm trying to estimate the insert size of a pair-end illumina library. I have limited information about library preparation and sequencing protocol (2x125). To estimate the insert size I used mapped the reads to a genome using bbmap. The reads were processed using trimmomatic (to remove adaptors and only keep reads with size = 125).

java -jar trimmomatic-0.39.jar PE -threads 10 -phred33 R1_001.fastq.gz R2_001.fastq.gz R1.adaptor.cut.fastq.gz R1.unpaired.fastq.gzip R2.adaptor.cut.fastq.gz R2.unpaired.fastq.gzip NexteraPE-PE.fa:2:30:10 MINLEN:125 

bbmap.sh ref=cro_genome.fasta in1=R1.pair.fastq.gz in2=R2.pair.fastq.gz -ihist=ihist.txt reads=5000000

From the result file:

Mean 427.278

Median 295

Mode 234

STDev 536.271

The mean value seems to correspond to expected values. However, I find the std deviation value a little bit too high. It this the typical standard deviation in a pair-end library?

RNA-Seq bbmap • 2.4k views
ADD COMMENT
1
Entering edit mode

I assume you left out the reference option (ref=) in your bbmap.sh command? I suggest that you do the insert size estimation using original data. Use one of the three methods @Brian mentions here.

ADD REPLY
0
Entering edit mode

Yes I forgot to paste the ref= option. Why use original data instead of data without adaptors? (just for reference, only a small amount of reads pairs are discarded after trimmomatic filters by size.

I seems however that the pairlen=2000 option is important. Here is the reestimation:

Mean 392.938

Median295

Mode 232

STDev 317.279

ADD REPLY
1
Entering edit mode

Is the reference genome a high-quality, chromosome-level genome? Or is it a draft genome, with hundreds or thousands of contigs? Is the reference genome from the same species as the reads, or a close, but different species? What are the mapping stats?

ADD REPLY
0
Entering edit mode

Its a draft genome still, with 2090 contigs, from the same species. BBMAP is able to map around 95% of the reads. I have obtained the same values with illumina

ADD REPLY
0
Entering edit mode

Using the merge option gives a good estimate of insert size (in case of long reads) so having the adapters allows identification of fragments which may have small inserts. pairlen= is the max distance a read pair would allowed to be mapped apart to be considered properly paired. See if tightening that number further improves the stats.

ADD REPLY
0
Entering edit mode

okay I tested two settups:

Original untrimmed data with merge:

bbmerge.sh in1=R1.untrimmed.fastq.gz in2=R2.untrimmed.fastq.gz reads=2m ihist=ihist.txt

Pairs: 2000000

Joined: 674961 33.748%

Ambiguous: 1323904 66.195%

No Solution: 1135 0.057%

Too Short: 0 0.000%

Avg Insert: 170.3

Standard Deviation: 51.3

Mode: 232

Despite the good metrics only 33% of the reads overlapped. I tried to decrease the pairlen option to 400 with the bbmap above:

Mean 298.182

Median 277

Mode 229

STDev 118.369

PercentOfPairs 98.642

ADD REPLY
1
Entering edit mode

I would not worry about the STDev since you don't have a good reference with a small number of chromosomes. Your insert sizes otherwise should be good with the mapping protocol you did above. I suggest that you go with that value i.e. mean of around 400 bp.

ADD REPLY
0
Entering edit mode

Hi again. It seems I forgot one important detail in the question. The data comes from RNA-seq data. I believe since there could be introns, the min length should be decreased. Or maybe should I used an entirely different tool. Sorry about that it has been a long day

ADD REPLY
0
Entering edit mode

Min length for what? You are doing fine. No need to be sorry.

ADD REPLY
0
Entering edit mode

I meant the pairlen= option.

I've checked this link here http://seqanswers.com/forums/showpost.php?p=158099&postcount=2 and I ran the the bbmap command as follows:

bbmap.sh ref=cro_genome.fasta in1=R1.pair.fastq.gz in2=R2.pair.fastq.gz -ihist=ihist.txt reads=1M maxindel=200000 pairlen=500

I'm not sure if running with both those options makes sense in this case.

I also aligned the same sample to my genome using hisat2 and used samtools stats:

hisat2 -q --dta -p -x cro_index -1 R1_cut_paired.fastq.gz -2 R2_cut_paired.fastq.gz -S idio_1.sam
samtools view -Su idio_1.sam | samtools sort -o idio_1_sorted.bam
samtools stats

In both cases, the estimated average insert was 315 and the std was 145. I've decided to accept these values for the following analysis for now

ADD REPLY

Login before adding your answer.

Traffic: 1938 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6