Entering edit mode
4.2 years ago
Rogerio Ribeiro
▴
110
Hi biostars.
I'm trying to estimate the insert size of a pair-end illumina library. I have limited information about library preparation and sequencing protocol (2x125). To estimate the insert size I used mapped the reads to a genome using bbmap. The reads were processed using trimmomatic (to remove adaptors and only keep reads with size = 125).
java -jar trimmomatic-0.39.jar PE -threads 10 -phred33 R1_001.fastq.gz R2_001.fastq.gz R1.adaptor.cut.fastq.gz R1.unpaired.fastq.gzip R2.adaptor.cut.fastq.gz R2.unpaired.fastq.gzip NexteraPE-PE.fa:2:30:10 MINLEN:125
bbmap.sh ref=cro_genome.fasta in1=R1.pair.fastq.gz in2=R2.pair.fastq.gz -ihist=ihist.txt reads=5000000
From the result file:
Mean 427.278
Median 295
Mode 234
STDev 536.271
The mean value seems to correspond to expected values. However, I find the std deviation value a little bit too high. It this the typical standard deviation in a pair-end library?
I assume you left out the reference option (
ref=
) in yourbbmap.sh
command? I suggest that you do the insert size estimation using original data. Use one of the three methods @Brian mentions here.Yes I forgot to paste the ref= option. Why use original data instead of data without adaptors? (just for reference, only a small amount of reads pairs are discarded after trimmomatic filters by size.
I seems however that the pairlen=2000 option is important. Here is the reestimation:
Mean 392.938
Median295
Mode 232
STDev 317.279
Is the reference genome a high-quality, chromosome-level genome? Or is it a draft genome, with hundreds or thousands of contigs? Is the reference genome from the same species as the reads, or a close, but different species? What are the mapping stats?
Its a draft genome still, with 2090 contigs, from the same species. BBMAP is able to map around 95% of the reads. I have obtained the same values with illumina
Using the merge option gives a good estimate of insert size (in case of long reads) so having the adapters allows identification of fragments which may have small inserts.
pairlen=
is the max distance a read pair would allowed to be mapped apart to be considered properly paired. See if tightening that number further improves the stats.okay I tested two settups:
Original untrimmed data with merge:
Pairs: 2000000
Joined: 674961 33.748%
Ambiguous: 1323904 66.195%
No Solution: 1135 0.057%
Too Short: 0 0.000%
Avg Insert: 170.3
Standard Deviation: 51.3
Mode: 232
Despite the good metrics only 33% of the reads overlapped. I tried to decrease the
pairlen
option to 400 with the bbmap above:Mean 298.182
Median 277
Mode 229
STDev 118.369
PercentOfPairs 98.642
I would not worry about the STDev since you don't have a good reference with a small number of chromosomes. Your insert sizes otherwise should be good with the mapping protocol you did above. I suggest that you go with that value i.e. mean of around 400 bp.
Hi again. It seems I forgot one important detail in the question. The data comes from RNA-seq data. I believe since there could be introns, the min length should be decreased. Or maybe should I used an entirely different tool. Sorry about that it has been a long day
Min length for what? You are doing fine. No need to be sorry.
I meant the
pairlen=
option.I've checked this link here http://seqanswers.com/forums/showpost.php?p=158099&postcount=2 and I ran the the bbmap command as follows:
I'm not sure if running with both those options makes sense in this case.
I also aligned the same sample to my genome using hisat2 and used samtools stats:
In both cases, the estimated average insert was 315 and the std was 145. I've decided to accept these values for the following analysis for now