Question

Different Insert Size Value For Rna-Seq Data Between Illumina Protocol And Picard

1

Entering edit mode

13.9 years ago

Junfeng ▴ 330

Hi, we have RNA-Seq data from Illumina HiSeq 2000. Based on Illumina RNA-seq library protocol that we follow, the range of insert length is 120-210 bp, median insert length: 155 bp.

After doing an alignment with tophat, I got the stats on the insert based on the resulting BAM file.Based on the result, it seems that the median insert size is larger than 155bp. For tumor, the median insert size is 188bp, and for control sample, it is 280bp. The detailed results are as follows,

For tumor sample,

MEDIAN_INSERT_SIZE    MIN_INSERT_SIZE    MAX_INSERT_SIZE    MEAN_INSERT_SIZE    STANDARD_DEVIATION    READ_PAIRS    PAIR_ORIENTATION    WIDTH_OF_10_PERCENT    WIDTH_OF_20_PERCENT    WIDTH_OF_30_PERCENT    WIDTH_OF_40_PERCENT    WIDTH_OF_50_PERCENT    WIDTH_OF_60_PERCENT    WIDTH_OF_70_PERCENT    WIDTH_OF_80_PERCENT    WIDTH_OF_90_PERCENT    WIDTH_OF_99_PERCENT
188    75    227724412    721.727402    1451.554057    21582855    FR    25    49    71    91    113    143    253    1269    4501    41471

For control sample,

MEDIAN_INSERT_SIZE    MIN_INSERT_SIZE    MAX_INSERT_SIZE    MEAN_INSERT_SIZE    STANDARD_DEVIATION    READ_PAIRS    PAIR_ORIENTATION    WIDTH_OF_10_PERCENT    WIDTH_OF_20_PERCENT    WIDTH_OF_30_PERCENT    WIDTH_OF_40_PERCENT    WIDTH_OF_50_PERCENT    WIDTH_OF_60_PERCENT    WIDTH_OF_70_PERCENT    WIDTH_OF_80_PERCENT    WIDTH_OF_90_PERCENT    WIDTH_OF_99_PERCENT
280    74    242555584    394.905075    468.383792    50410660    FR    21    45    79    131    193    245    287    339    1491    14367
3762    74    242542706    3676.159079    492.235063    1560925    RF    11    19    27    41    65    79    101    6921    7299    20051069

So I wondered which insert size value I should use? And do I need to run TopHat again based on the insert size value from Picard? Thank you in advance.

rna paired • 12k views

ADD COMMENT • link updated 13.9 years ago by Arun 2.4k • written 13.9 years ago by Junfeng ▴ 330

score 6 · Answer 1 · 2011-10-15

In order to use tophat, you'll have to have used the "-r" parameter. I guess you used -r value as 155 as the mean inner distance. However, its equally important to also provide the "--mate-std-dev".

The way I go about figuring these 2 parameters is this:
I use BWA and align the paired end reads and obtain PE.sam file. From that, I use picard tools to obtain only those reads that are aligned to the reference genome. From this sam file, I calculate the inner distance of all pairs. From the vector of all such inner distances, I compute the 1st and 3rd quantiles, Q1 and Q3 and then calculate the inter quartile range IQ = Q3-Q1. From here, I take all the values that fall between Q1 - 2 * IQ to Q3 + 2 * IQ. I compute the mean and standard deviation of all these values (this is similar to what BWA does) and provide it to tophat. It seems to work great. Most of my reads are mapped under the given inner distance and standard deviation and the other (few) reads with larger inner distance. There'll always be a few reads that'll have larger inner distance and I consider them outliers and discard in the computation of mean and SD.

Hope this helps.

score 1 · Answer 2 · 2011-10-14

1

Entering edit mode

13.9 years ago

learnerforever ▴ 520

The first thing to do might be to work with the wet-lab and understand the library size, adapter lengths, etc. and get an idea of the estimated insert size by doing math, such as fragment_lenght - primer lengths - adapter lengths etc.

In my experience, tophat is sensitive to the "inner mate distance", so specifying appropriate value for each T and N sample would be better.

Not a comprehensive answer, but just my 2c.

ADD COMMENT • link 13.9 years ago by learnerforever ▴ 520

0

Entering edit mode

Thanks. The wet-lab told us that they followed the standard fragmentation procedure and median insert length should be 155bp according to this protocol. So here I am not sure which insert size value we should use.

ADD REPLY • link 13.9 years ago by Junfeng ▴ 330

0

Entering edit mode

Keep in mind that "insert size" can also mean different things to different programs. It sometimes includes the entire length of both reads and sometimes one of the reads. The TopHat "mate inner distance" does not include any of the reads but only includes the distance between them.

ADD REPLY • link 13.9 years ago by Mikael Huss 4.8k

0

Entering edit mode

Mikael, I have use -r value as 1 (155bp-77bp*2) since the read length is 77bp. Thanks for your comment.

ADD REPLY • link 13.9 years ago by Junfeng ▴ 330

score 1 · Answer 3 · 2011-10-15

1

Entering edit mode

13.9 years ago

brentp 24k

First, I'm not sure it makes sense to do this for RNA-Seq. Your insert sizes are going to be skewed by reads that span an intron--though I guess the median should be OK.

Second, your max insert size is 215 or 245 million bases and your standard deviations are pretty high. What percentage of reads were mapped?

I'd suggest you try with another RNA-Seq tool such as GSNAP.

Either way, I think if you map the reads separately, then use the reported median and it should work fine. If you're mapping them together, then use larger value or something in between.

ADD COMMENT • link 13.9 years ago by brentp 24k

0

Entering edit mode

brentp, tophat specifically asks for average inner distance between the two paired end reads from the fragment. Assuming your fragments are of size 400bp and reads 50bp, your inner distance is 300bp. Since this is not possible, we calculate the average and estimate the dispersion (as SD) and provide both parameters.

ADD REPLY • link 13.9 years ago by Arun 2.4k

0

Entering edit mode

[junfenx@bsbl samtools-0.1.17]$ ./samtools flagstat /scratch/junfenx/RNA_Seq/tophat_1/accepted_hits.bam 52068218 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 52068218 + 0 mapped (100.00%:nan%) 52068218 + 0 paired in sequencing 26140275 + 0 read1 25927943 + 0 read2 31988230 + 0 properly paired (61.44%:nan%) 44955682 + 0 with itself and mate mapped 7112536 + 0 singletons (13.66%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 wi

ADD REPLY • link 13.9 years ago by Junfeng ▴ 330

0

Entering edit mode

brentp, what you mean is that I can use the insert size value from Picard and run TopHat again, is it correct?

ADD REPLY • link 13.9 years ago by Junfeng ▴ 330

0

Entering edit mode

Here is the tumor results of samtools flastat, samtools flagstat /tumor/accepted_hits.bam 52068218 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 52068218 + 0 mapped (100.00%:nan%) 52068218 + 0 paired in sequencing 26140275 + 0 read1 25927943 + 0 read2 31988230 + 0 properly paired (61.44%:nan%) 44955682 + 0 with itself and mate mapped 7112536 + 0 singletons (13.66%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

ADD REPLY • link 13.9 years ago by Junfeng ▴ 330

0

Entering edit mode

Here is the tumor results of samtools flastat, samtools flagstat /tumor/accepted_hits.bam 52068218 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 52068218 + 0 mapped (100.00%:nan%) 52068218 + 0 paired in sequencing 26140275 + 0 read1 25927943 + 0 read2 31988230 + 0 properly paired (61.44%:nan%) 44955682 + 0 with itself and mate mapped 7112536 + 0 singletons (13.66%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5

ADD REPLY • link 13.9 years ago by Junfeng ▴ 330

0

Entering edit mode

@Junfeng, that's what I was suggesting (re running tophat with parameters) though I'd guess it will not make too much difference.