Question

Creating a fastq generator : how to handle the 3' ends of transcripts.

0

Entering edit mode

9.2 years ago

irritable_phd_syndrome ▴ 130

I am currently investigating the different spliceforms in an experimental sample. To get a better understanding of how the different spliceform finding software works, I created a program that generates fake fastq data.

Here is how it works :

Read in GTF file
Select transcript of interest from the GTF file.
Generate random numbers for the start position of the read. So if my random number is 54, my read will start at position 54.

Step 3 is where I get into trouble. I'm not sure how to handle the end of the transcript. For example, say that I want 100 base reads in my fastq file. Let's say the transcript of interest is 2000bases long. If I draw a random number between 1-1900, I am fine. However, if I draw a number between 1901-2000, say 1950, I get into trouble because I don't know what to make the remaining 50 bases of the read.

A couple potential solutions I thought of:

Randomly add sequences to the 3' end
Pretend that I read into the Illumina (or similar) adapter.

What experimentally happens in this situation? Is there a bias against the ends of transcripts when doing size selection in RNA-Seq?

RNA-Seq transcript • 1.8k views

ADD COMMENT • link 9.2 years ago by irritable_phd_syndrome ▴ 130

0

Entering edit mode

Actually the bias is towards the 3'end if one is doing poly-A selection. I don't think #1 is a good idea, it would not be biologically relevant. You could look into 3'-UTR (or are you already taking those) and/or doing #2.

You could also look at published datasets where the truth is known (to some extent). Someone here may be able to provide a good example.

ADD REPLY • link 9.2 years ago by GenoMax 153k