Question

Effect of Read Length?

0

Entering edit mode

9.3 years ago

jfontana317 • 0

For the past month or so I've been using discoSNP++ with test data sets to see if it will work for particular needs. (It works great!!). The test data sets I have been using are 100bp reads. For my future experimental data sets I am wondering about the need for 100bp reads. It would be easier to piggyback my sequencing with other experiments if I used 50bp reads, but I'm wondering about the effect this would have on the efficiency of the program to detect the SNPs I'm looking for. Can you comment on the effects of using 50bp vs 100bp reads (let's assume ~40M reads per set)? If it matters for your answer, I've been using -b 1 -D 0 -P 1 -k 31 -c 4 -C 2147483647 -d 1

Thanks

gatb discosnp • 2.6k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.3 years ago by jfontana317 • 0

0

Entering edit mode

Added a "gatb" tag.

ADD REPLY • link 9.3 years ago by lh3 33k

Ram · Answer 1 · 2015-08-31

Hi,

Thanks for the question,

The variant prediction phase of discoSnp is based only on k-mers (with k=31 by default). Thus if all k-mers from 100 bp reads also exist with 50 bp reads, the result should be the same.

However, the read coverage with 50 bp reads must be higher than with 100 bp reads for obtaining a similar set of k-mers. This is due to the following reason: A read of length L contains L-k+1 k-mers of length k.

This means that, with k=31, a read of length 100 contains 60 k-mers while a read of length 50 contains only 20 kmers. Thus, in broad terms, the coverage with L=50 should be three times bigger than the coverage with k=100 for obtaining the same results.

Best,
Pierre

Ram · Answer 2 · 2015-08-28

1

Entering edit mode

9.3 years ago

Chris Miller 22k

There are many portions of the genome that are unalignable with 50bp reads. I imagine these are more problematic when trying to do reference-free assembly.

If you really want the answer, though, you should run a test. Take one of your current data sets, chop each read down to 50 bp, then run the algorithm again and compare to the original results.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.3 years ago by Chris Miller 22k

0

Entering edit mode

Thank you for the suggestion Chris. That was a great idea. Re-run with all the same parameters, the 50bp data sets only picked up 1/10th the SNPs that the 100bp data sets did. It also missed 25% of my artificially introduced SNPs. I will play a bit with parameters and see what happens, but that really helped a lot. Thanks again.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.3 years ago by jfontana317 • 0