What Is The Best Way To Simulate Reads From Reference Transcriptome With Certain Error Rate?
3
4
Entering edit mode
13.5 years ago
Geparada ★ 1.5k

Hi!

I need to test the capability of some mappers to align reads with different error rates (mismatch and indels). That's why I want to simulate pools of reads with different error rates from a reference transcriptome. Do you know some tool wich can help me?

Thanks !!

next-gen sequencing read alignment • 5.3k views
ADD COMMENT
4
Entering edit mode
13.5 years ago
brentp 24k

If, as you suggest, you have a reference transcriptome, then it's no different than doing a whole genome sequence simulation. Try DNAA's wgsim with your transcriptome as the input reference.

ADD COMMENT
1
Entering edit mode

Actually, depending on why you are doing this. My answer may be less than helpful. if you want to simulate different transcript and gene frequencies, the wgsim won't do much for you. It'll just sample from (with your chosen error rate) what's there for the transcriptome.

ADD REPLY
0
Entering edit mode

Thanks brentp, think I'll try it

ADD REPLY
4
Entering edit mode
13.5 years ago

For Illumina RNA-seq data, I've used simLibrary and simNGS (http://www.ebi.ac.uk/goldman-srv/simNGS/) in the past to simulate transcriptome reads. The following commands will simulate a FRTseq (--bias 1) library construction of average insert size 300 (--readlen 300) of a transcript CDS sequence, producing the input to be sequenced. The gel_cut option makes sure you only sequence above 200bp and below 1000bp. In the transcript CDS sequence you can add UTRs or introns if you want:

[?]

After that, I rename the sequences for tracking (scripts available here http://github.com/avilella/hashbrown) and run them through simNGS for 125 cycles, paired-end, producing the output fastq files.

[?]

The runfiles contain information about the intensity values given by a real machine for a real run. There are different example runfiles available in simNGS, both from real Illumina GA2 machines and Illumina HiSeq2000 machines. The runfiles have comments about when/where was the sequencing done, how well did it go, etc. If you want to simulate data as close as existing sequencing runs that you've already done in your facility, you can build your own runfiles using AYB against example .cif files from your own sequencer.

Hope it helps.

ADD COMMENT
0
Entering edit mode

thanks for your guide avilella!

ADD REPLY
2
Entering edit mode
13.5 years ago
Benm ▴ 710

I wrote a script for it, it can simulate mismatch, indels and also SVs, it was uploaded to SourceForge: http://sourceforge.net/projects/simulateseq/files/0.2.2

ADD COMMENT
1
Entering edit mode

Hi BENM. Nice to post a link to your script :) I have two suggestions. Take'em or leave'em, they are really just suggestions. All your code, including the comments and documentation, uses long lines and won't display well on terminals and even on the sourceforge page, in fact. You can think of line-wrapping it in order for it to display in a more readable way. I suggest 80 characters per line or less (78 displays well everywhere). Second suggestion, maybe you could update your website in the user area? Thanks again for the link. Cheers

ADD REPLY
0
Entering edit mode

the link says: "We are unable to display the page you requested".

ADD REPLY
0
Entering edit mode

nice script, thanks !

ADD REPLY
0
Entering edit mode

Thank you for your kind advice. It is too busy these days, I will update soon according to your suggestions.

ADD REPLY

Login before adding your answer.

Traffic: 2445 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6