Question

Assemble short reads based on k-mers

0

Entering edit mode

9.4 years ago

venu 7.1k

Hello all,

I am completely new for this kind of tasks. I have data like this,

>in0
GATCCTCGAAGTTACACGGG
>in1
TACGTCGACGTCAATCCGGG
>in2
TACACGGGCCGCTCCTGGGC
>in3
ACGGGGTACTACGAGACGCG
>in4
AGGGGGAATGTGGTCCACAT
>in5
TCCACATGGCTTGCTCCTGA
>in6
CTTGACGTTATGAATTTCGC

and so on..I need to assemble these short reads. I want to use perl for this. I just need a pseudo code on how to do this or direct me to a good resource. At the end I need a single string containing consensus sequence.

perl k-mer assembly • 3.4k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.4 years ago by venu 7.1k

2

Entering edit mode

Is their a reason you want to reinvent the assembly wheel? There are a good number of assemblers already written, why bother writing yet another one without a good reason?

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

Can you give me some examples, so that I can find them directly on the internet.

ADD REPLY • link 9.4 years ago by venu 7.1k

2

Entering edit mode

You could try google

ADD REPLY • link updated 9.4 years ago by Devon Ryan 104k • written 9.4 years ago by dylan.storey ▴ 60

0

Entering edit mode

As orange said, SOAPdenovo is one option. Others would include Trinity and Minia. There are quite a few of these if you just search pubmed for "DNA assembler" or "DNA assemble".

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

score 0 · Answer 1 · 2015-07-21

0

Entering edit mode

9.4 years ago

orange ▴ 30

why not try SOAPdenovo instead of perl ? Did you not aseembly genome sequence before ?

ADD COMMENT • link 9.4 years ago by orange ▴ 30

Ram · Answer 2 · 2015-07-21

0

Entering edit mode

9.4 years ago

thackl ★ 3.0k

Assuming you want to stick to Perl for educational purpose:

Here is some code to quite efficiently create kmers with Perl: https://github.com/thackl/perl5lib-Kmer/blob/master/lib/Kmer.pm.

Perl is not really made for handling graph structures, but there is one module that you could use to set up a De-Bruijn structure: http://search.cpan.org/~jhi/Graph-0.96/lib/Graph.pod. I played around with it some time ago but did not follow through.

ADD COMMENT • link 9.4 years ago by thackl ★ 3.0k

0

Entering edit mode

I just need a single string of consensus sequence from the above shown file

ADD REPLY • link 9.4 years ago by venu 7.1k

0

Entering edit mode

But definitely in Perl?

ADD REPLY • link 9.4 years ago by thackl ★ 3.0k

0

Entering edit mode

Not exactly, but any simple program that receives the above file as input and outputs the consensus string. I can understand the perl code easily, so the tag.

ADD REPLY • link 9.4 years ago by venu 7.1k

0

Entering edit mode

But your data sets are small - you want to do some form of microassembly?

ADD REPLY • link 9.4 years ago by thackl ★ 3.0k

0

Entering edit mode

Yes. I've 200 such reads in a file and I want an output like

 GGCATTTAACCGAAGCCGGTGGGTTAGACTATGATCCTCGAAGTTACACGGGCCGCTCCTGGGCGTGGCTGCTCCCAGCCCTAGCCCCAATGTAATATAAAGGTCGTGCCCAGTTAGCGTTAAGCAAGAGGTGTTACAAATATCTTGGAGAGTCATGTCGCAATTCTTGACGTTATGAATTTCGCGGTGAACAATGTCGCCCAGAATGGCAGGTCATGAAAAGCTTCAGCGGGAACCAGCAC....

ADD REPLY • link 9.4 years ago by venu 7.1k

0

Entering edit mode

What is the coverage of the reads and the length?

ADD REPLY • link 9.4 years ago by thackl ★ 3.0k

0

Entering edit mode

I am doing this kind of work for the first time. What I know is each read has different lengths of k-mers.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.4 years ago by venu 7.1k

2

Entering edit mode

Take a look at this: http://www.homolog.us/Tutorials/index.php?p=1.1&s=2

This should give you a basic overview about assemblies

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.4 years ago by nterhoeven ▴ 120

Ram · Answer 3 · 2015-07-21

You can use SPAdes:

spades.py --only-assembler --sc -k 33 -s in.fq -out asm
  # results are in asm/contigs.fa

This should also work for lowish coverage (5-10X) but assumes little to no errors in your data. Also, there can be multiple contigs and regions with very low coverage (just a 1-3 reads at a certain position, e.g. the ends of you target sequence) will be missing.