I used the pipeline given at Solid Software Tools: DeNovo Assembly/XSQ Tools pipeline mirrored at BioStar to perform a solid assembly. It ran successfully. However, when I look at the statistics of the nucleotide-space assembly and the double-encoded colorspace assembly, they are significantly off. What's the reason?
contigs$ cat n50.stats.txt
perc A : 31
perc C : 22
perc G : 20
perc T : 24
perc N : 0
Sum contig length : 182066280
Num contigs : 1204729
Mean contig length : 151
Median contig length : 128
N50 value : 154
Max : 5517
nt_contigs$ cat n50.stats.txt
perc A : 55
perc C : 0
perc G : 0
perc T : 44
perc N : 0
Sum contig length : 199569293
Num contigs : 1204729
Mean contig length : 165
Median contig length : 140
N50 value : 166
Max : 5013
scaffolds$ cat n50.stats.txt
perc A : 10
perc C : 7
perc G : 6
perc T : 7
perc N : 67
Sum contig length : 563660388
Num contigs : 855887
Mean contig length : 658
Median contig length : 140
N50 value : 3997
Max : 74154
nt_scaffolds$ cat n50.stats.txt
perc A : 55
perc C : 0
perc G : 0
perc T : 44
perc N : 0
Sum contig length : 200084049
Num contigs : 855887
Mean contig length : 233
Median contig length : 146
N50 value : 242
Max : 18952
The N50 value in the case of scaffolds is really off. Also, the GC% in nt_contigs and nt_scaffolds is zero which is odd.
While that is true why is the max length and so on different? Shouldn't colorspace be just 1 less than the length of the basespace sequence.
once you convert to colorspace the sequences change altogether, two different looking sequences may convert to the same sequence see also this: Transforming and manipulating color space reads
Another question here. How could two different colorspace sequences represent the same basespace sequence? I can see that each colorspace sequence could represent 4 different basespace sequences. A toy example would be useful here. Thanks.