Question

What Is The Best Way To Simulate Reads From Reference Transcriptome With Certain Error Rate?

4

Entering edit mode

13.6 years ago

Geparada ★ 1.5k

Hi!

I need to test the capability of some mappers to align reads with different error rates (mismatch and indels). That's why I want to simulate pools of reads with different error rates from a reference transcriptome. Do you know some tool wich can help me?

Thanks !!

next-gen sequencing read alignment • 5.3k views

ADD COMMENT • link updated 13.6 years ago by 2184687-1231-83- ★ 5.1k • written 13.6 years ago by Geparada ★ 1.5k

score 4 · Answer 1 · 2011-06-09

4

Entering edit mode

13.6 years ago

brentp 24k

If, as you suggest, you have a reference transcriptome, then it's no different than doing a whole genome sequence simulation. Try DNAA's wgsim with your transcriptome as the input reference.

ADD COMMENT • link 13.6 years ago by brentp 24k

1

Entering edit mode

Actually, depending on why you are doing this. My answer may be less than helpful. if you want to simulate different transcript and gene frequencies, the wgsim won't do much for you. It'll just sample from (with your chosen error rate) what's there for the transcriptome.

ADD REPLY • link 13.6 years ago by brentp 24k

0

Entering edit mode

Thanks brentp, think I'll try it

ADD REPLY • link 13.6 years ago by Geparada ★ 1.5k

Ram · Answer 2 · 2011-06-09

For Illumina RNA-seq data, I've used simLibrary and simNGS (http://www.ebi.ac.uk/goldman-srv/simNGS/) in the past to simulate transcriptome reads. The following commands will simulate a FRTseq (--bias 1) library construction of average insert size 300 (--readlen 300) of a transcript CDS sequence, producing the input to be sequenced. The gel_cut option makes sure you only sequence above 200bp and below 1000bp. In the transcript CDS sequence you can add UTRs or introns if you want:

[?]

After that, I rename the sequences for tracking (scripts available here http://github.com/avilella/hashbrown) and run them through simNGS for 125 cycles, paired-end, producing the output fastq files.

[?]

The runfiles contain information about the intensity values given by a real machine for a real run. There are different example runfiles available in simNGS, both from real Illumina GA2 machines and Illumina HiSeq2000 machines. The runfiles have comments about when/where was the sequencing done, how well did it go, etc. If you want to simulate data as close as existing sequencing runs that you've already done in your facility, you can build your own runfiles using AYB against example .cif files from your own sequencer.

Hope it helps.

score 2 · Answer 3 · 2011-06-09

2

Entering edit mode

13.6 years ago

Benm ▴ 710

I wrote a script for it, it can simulate mismatch, indels and also SVs, it was uploaded to SourceForge: http://sourceforge.net/projects/simulateseq/files/0.2.2

ADD COMMENT • link 13.6 years ago by Benm ▴ 710

1

Entering edit mode

Hi BENM. Nice to post a link to your script :) I have two suggestions. Take'em or leave'em, they are really just suggestions. All your code, including the comments and documentation, uses long lines and won't display well on terminals and even on the sourceforge page, in fact. You can think of line-wrapping it in order for it to display in a more readable way. I suggest 80 characters per line or less (78 displays well everywhere). Second suggestion, maybe you could update your website in the user area? Thanks again for the link. Cheers

ADD REPLY • link 13.6 years ago by Eric Normandeau 11k

0

Entering edit mode

the link says: "We are unable to display the page you requested".

ADD REPLY • link 13.6 years ago by Geparada ★ 1.5k

0

Entering edit mode

nice script, thanks !

ADD REPLY • link 13.6 years ago by Geparada ★ 1.5k

0

Entering edit mode

Thank you for your kind advice. It is too busy these days, I will update soon according to your suggestions.

ADD REPLY • link 13.6 years ago by Benm ▴ 710