Question

Large Variation In Base Space Vs Color Space Assembly

1

Entering edit mode

12.1 years ago

lin.barnum ▴ 230

I used the pipeline given at Solid Software Tools: DeNovo Assembly/XSQ Tools pipeline mirrored at BioStar to perform a solid assembly. It ran successfully. However, when I look at the statistics of the nucleotide-space assembly and the double-encoded colorspace assembly, they are significantly off. What's the reason?

contigs$ cat n50.stats.txt 

perc A               : 31
perc C               : 22
perc G               : 20
perc T               : 24
perc N               :  0
Sum contig length    : 182066280
Num contigs          : 1204729
Mean contig length   : 151
Median contig length : 128
N50 value            : 154
Max                  : 5517

nt_contigs$ cat n50.stats.txt 

perc A               : 55
perc C               :  0
perc G               :  0
perc T               : 44
perc N               :  0
Sum contig length    : 199569293
Num contigs          : 1204729
Mean contig length   : 165
Median contig length : 140
N50 value            : 166
Max                  : 5013

scaffolds$ cat n50.stats.txt 

perc A               : 10
perc C               :  7
perc G               :  6
perc T               :  7
perc N               : 67
Sum contig length    : 563660388
Num contigs          : 855887
Mean contig length   : 658
Median contig length : 140
N50 value            : 3997
Max                  : 74154

nt_scaffolds$ cat n50.stats.txt 

perc A               : 55
perc C               :  0
perc G               :  0
perc T               : 44
perc N               :  0
Sum contig length    : 200084049
Num contigs          : 855887
Mean contig length   : 233
Median contig length : 146
N50 value            : 242
Max                  : 18952

The N50 value in the case of scaffolds is really off. Also, the GC% in nt_contigs and nt_scaffolds is zero which is odd.

solid assembly velvet • 2.8k views

ADD COMMENT • link updated 12.1 years ago by Istvan Albert 102k • written 12.1 years ago by lin.barnum ▴ 230

score 2 · Answer 1 · 2013-06-17

2

Entering edit mode

12.1 years ago

Istvan Albert 102k

Remember that the double encoded colorspace is a redundant representation.

Two entirely different looking double encoded sequences could represent identical base space sequences.

That being said getting a zero percentage for GC base representation does look like something went wrong, unless you have reason to expect that

ADD COMMENT • link 12.1 years ago by Istvan Albert 102k

0

Entering edit mode

While that is true why is the max length and so on different? Shouldn't colorspace be just 1 less than the length of the basespace sequence.

ADD REPLY • link 12.1 years ago by lin.barnum ▴ 230

1

Entering edit mode

once you convert to colorspace the sequences change altogether, two different looking sequences may convert to the same sequence see also this: Transforming and manipulating color space reads

ADD REPLY • link 12.1 years ago by Istvan Albert 102k

0

Entering edit mode

Another question here. How could two different colorspace sequences represent the same basespace sequence? I can see that each colorspace sequence could represent 4 different basespace sequences. A toy example would be useful here. Thanks.

ADD REPLY • link 12.1 years ago by lin.barnum ▴ 230