Hi,
I want to apply Negative Binomial Distribution to my ribo-seq data simulation process in order to mimic the real data.
The reason of doing this is because I want to compare with the analysis and results of real human ribo-seq data, for my other part of the work.
I have:
- a number of RefSeq human transcripts (e.g. the NM_ ) as the source of simulation
- read length distribution from 26bp-32bp (derived from real ribo-seq data)
The real ribo-seq data would have a character that the footprint for transcripts will be different between each sub-codon position and reflect the correct Open Reading Frame. (e.g. http://lapti.ucc.ie/bicoding/Known_frameshift/NM_001172437.png)
I thought the distribution would mainly reflect this.
But I am very confused where to start with, e.g. how to map the distribution model into my case. I wish someone would give me some hints or advises on this, thanks.
the NBD is usually applied to sum total read counts found at a gene. You can define that gene however you like, but we shouldn't be talking about codons or read lengths or even the number of transcripts. I think you're confusing several issues. Which of these numbers did you mean to simulate?
@karl hi, thanks for your reply. maybe I didn't explain clearly my problem, sorry about that.
The read length and no. of transcripts are secondary, there is no need to apply NBD here.
I think the codons or "the number of reads fall in different Open Reading Frame" is the question I am think about.
If the reads are randomly sampled, after align to the reference, the reads footprint could be like this:
https://www.dropbox.com/s/aia5tc5hzxbm21v/NM_01825.png?dl=0, (in SAM file, count number of alignment on each position, 3 colors means different reading frames ( +1, +2, +3))
But, ideally they are not just randomly fall across everywhere in the transcript, but they have high count on some positions, low counts or 0 counts on some other locations, e.g.
http://lapti.ucc.ie/bicoding/Known_frameshift/NM_001172437.png
http://lapti.ucc.ie/bicoding/AT_AS/NM_000883.png
(in SAM file, count the number of alignment on each position, 3 reading frames are in 3 different plot)