Entering edit mode
8.7 years ago
Antonio Camargo
▴
160
I'm aligning some mouse RNA-Seq libraries against the genome using STAR. After that, I started to analyse the insert size with Picard. I noticed two types of histogram.
The one showed in Histogram A have a single peak, around 180 bp. The type represented in Histogram B, however, have multiple peaks.
As the aligment was done against the genome, I expected that all libraries would have displayed a pattern like the one in Histogram B, as the spliced inserts would produce "artificial" inserts sizes. So, why did I got these two kinds of histograms?
Thanks a lot!
Histogram A:
Histogram B:
Are all of the samples that produce histrogram like that in (A) also total RNAseq rather than mRNAseq? The GEO entry doesn't mention any sort of ribo-depletion.
Two libraries produce histograms like B, they were both produced by the same team. The remaining libraries (3) have histograms like the A.
I suppose that they're all poly-A positive, but this information is not explicit in all libraries.
Do you get the reason of Histogram B? How do you solve the problem? I have a similar problem...
Mapping against the transcriptome should give a better representation of the actual insert-size distribution. Also, if your reads are paired and mostly overlapping, you can get an alignment-free insert-size distribution with BBMerge:
Or, in fact, you can get that even if they are NOT mostly overlapping, using kmer frequencies to bridge the gap in the middle (this is slower and uses more memory, and requires sufficient depth):