I am currently investigating the different spliceforms in an experimental sample. To get a better understanding of how the different spliceform finding software works, I created a program that generates fake fastq data.
Here is how it works :
- Read in GTF file
- Select transcript of interest from the GTF file.
- Generate random numbers for the start position of the read. So if my random number is 54, my read will start at position 54.
Step 3 is where I get into trouble. I'm not sure how to handle the end of the transcript. For example, say that I want 100 base reads in my fastq file. Let's say the transcript of interest is 2000bases long. If I draw a random number between 1-1900, I am fine. However, if I draw a number between 1901-2000, say 1950, I get into trouble because I don't know what to make the remaining 50 bases of the read.
A couple potential solutions I thought of:
- Randomly add sequences to the 3' end
- Pretend that I read into the Illumina (or similar) adapter.
What experimentally happens in this situation? Is there a bias against the ends of transcripts when doing size selection in RNA-Seq?
Actually the bias is towards the 3'end if one is doing poly-A selection. I don't think #1 is a good idea, it would not be biologically relevant. You could look into 3'-UTR (or are you already taking those) and/or doing #2.
You could also look at published datasets where the truth is known (to some extent). Someone here may be able to provide a good example.