Question

apply Negative Binomial Distribution (NBD) to ribosome profiling data.

1

Entering edit mode

9.7 years ago

xiangwulu ▴ 120

Hi,

I want to apply Negative Binomial Distribution to my ribo-seq data simulation process in order to mimic the real data.

The reason of doing this is because I want to compare with the analysis and results of real human ribo-seq data, for my other part of the work.

I have:

a number of RefSeq human transcripts (e.g. the NM_ ) as the source of simulation
read length distribution from 26bp-32bp (derived from real ribo-seq data)

The real ribo-seq data would have a character that the footprint for transcripts will be different between each sub-codon position and reflect the correct Open Reading Frame. (e.g. http://lapti.ucc.ie/bicoding/Known_frameshift/NM_001172437.png)

I thought the distribution would mainly reflect this.

But I am very confused where to start with, e.g. how to map the distribution model into my case. I wish someone would give me some hints or advises on this, thanks.

negative-binomial-distribution ribo-seq • 2.8k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by xiangwulu ▴ 120

0

Entering edit mode

the NBD is usually applied to sum total read counts found at a gene. You can define that gene however you like, but we shouldn't be talking about codons or read lengths or even the number of transcripts. I think you're confusing several issues. Which of these numbers did you mean to simulate?

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by karl.stamm 4.1k

0

Entering edit mode

@karl hi, thanks for your reply. maybe I didn't explain clearly my problem, sorry about that.

The read length and no. of transcripts are secondary, there is no need to apply NBD here.

I think the codons or "the number of reads fall in different Open Reading Frame" is the question I am think about.

If the reads are randomly sampled, after align to the reference, the reads footprint could be like this:

https://www.dropbox.com/s/aia5tc5hzxbm21v/NM_01825.png?dl=0, (in SAM file, count number of alignment on each position, 3 colors means different reading frames ( +1, +2, +3))

But, ideally they are not just randomly fall across everywhere in the transcript, but they have high count on some positions, low counts or 0 counts on some other locations, e.g.

http://lapti.ucc.ie/bicoding/Known_frameshift/NM_001172437.png

http://lapti.ucc.ie/bicoding/AT_AS/NM_000883.png

(in SAM file, count the number of alignment on each position, 3 reading frames are in 3 different plot)

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by xiangwulu ▴ 120

Ram · Answer 1 · 2015-09-10

Sorry for the confusion in my question, I was confused for while too. Now I have figured it out.

Look at the plot: https://www.dropbox.com/s/wxrua0k52nbycm3/NM_005321.footprint.tiff?dl=0

Comparison of profiles from human ribo-seq real data and NBD sampled variates. In common, the footprint of real ribo-seq data (top plot) could have 0 in many positions, and there will be peaks and explicit (or implicit) triplet periodicity.

I want to do some tests with simulated ribo-seq data, and I want profile of simulated data looks like the real data (middle plot).

Not like this (data simulated with other RNA-seq simulator): http://https://www.dropbox.com/s/072ag1q9kwpcdqv/NM_005321.subcodon_simulated.tiff?dl=0

When I talked about the ORFs and codons, I meant that the profile of 3 separate frames in ORF would be different depending if it's translated (top plot: red, green, blue), so in the simulation, the data should be simulated separate for each individual frames (bottom plot), to reflect the real data (ideally).