Question

Properly displaying read length distribution of different datasets together

0

Entering edit mode

7.8 years ago

mz1101 • 0

Hi,

What would be a proper way of plotting read length distributions from different datasets (e.g. raw reads vs. reads after quality filtering, with one dataset x2 bigger than the other) together?

Plotting the density distributions is fine, but doesn't show low counts (length >40000) very well:

enter image description here

But I'd prefer a plot where you can eyeball the number of reads. I'd also like to have maximal bin sizes so that there's one bin per any length. I try to do that by specifying some absurdly large bin size e.g. breaks=100000 in hist. But if I use the same bin size for different datasets it doesn't look correct:

enter image description here

The "before" dataset has 300k reads and the "after" dataset 192k reads. The frequency for the bins seems to be totally off. Why do the bins look so different? I've read that the breaks= parameter is just forwarding the number as a suggestion to the function which actually calculates the bins based on your data.

Decreasing the bin size for the "after" dataset makes it look better, but at this point I have no idea what is going on:

enter image description here

I want to plot as fine as possible because by zooming into different regions I can display that certain spurious reads are gone e.g. after quality filtering:

enter image description here

So I guess my question would be: How to choose an appropriate bin size to plot the frequency of every single read length or is there a better way than using histograms?

Thanks!

R sequence • 2.3k views

ADD COMMENT • link 7.8 years ago by mz1101 • 0

0

Entering edit mode

In your graphs what axis represents the read length and what axis the number of reads? Why should the size of the data matter if you are plotting length distributions?

ADD REPLY • link 7.8 years ago by GenoMax 151k

0

Entering edit mode

x-axis: read length
y-axis: freq/count or density

I thought the size would matter when you specify the number of bins. In this case where I'd want to have 1 bin per length, shouldn't the the larger dataset have more bins (because it has lengths which have been filtered out in the other dataset)? If I am plotting the distribution with bins, I suppose the bins should be the same size in order to show a valid comparison, no?

ADD REPLY • link 7.8 years ago by mz1101 • 0