Question

Is Conversion From Colorspace To Base Space Lossy?

5

Entering edit mode

12.5 years ago

Daniel Standage 4.1k

This week I have had my first experience working with ABI SOLiD data and all of the wonderful subtleties of colorspace data, double encoded Fasta, etc. I know there are tools that support processing data in these formats, which makes me wonder...why? Why would you want to work with colorspace or double-encoded data if you could just convert to the traditional base space? Is this conversion lossy? Is there some benefit of encoding dinucleotides as opposed to single nucleotides?

solid • 6.0k views

ADD COMMENT • link updated 12.5 years ago by Niek De Klein ★ 2.6k • written 12.5 years ago by Daniel Standage 4.1k

0

Entering edit mode

we have a tutorial for this: http://www.biostars.org/post/show/43855/transforming-and-manipulating-color-space-reads/

ADD REPLY • link 12.5 years ago by Istvan Albert 101k

0

Entering edit mode

Excellent tutorial by the way. BioStar is the first place I came looking for answers, and that tutorial is what gave me my first lesson in colorspace, double encoding, et al. I guess I just missed the significance of errors during my first read-through, and how this is much more manageable if the reads are kept in colorspace.

ADD REPLY • link 12.5 years ago by Daniel Standage 4.1k

score 3 · Answer 1 · 2012-05-24

3

Entering edit mode

12.5 years ago

Niek De Klein ★ 2.6k

The double coloring is an accuracy check. Because of the double coloring each base is interrogated twice. If you have the sequence AGGC, you get a color for AG, GG, and GC. AG would be yellow, GG blue and GC red. Now, if you would get the color yellow (AG), green (TG) and red (GC) it is more likely that there was a wrong color call for the secon color than an actual mutation. Because yellow says the second letter has to be a G, and the green says the second letter has to be a T.

Appareantly it is lossy , as Brent has commented.

It's been a while since I learnt this so I would advice reading this article: http://bioinformatics.oxfordjournals.org/content/26/6/849.full and this document: http://www3.appliedbiosystems.com/cms/groups/mcb_marketing/documents/generaldocuments/cms_058265.pdf , in which it is explained more accuratly.(and/or hope for someone else to explain it more completely here).

ADD COMMENT • link 12.5 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

Just to clarify, the naive conversion to base-space is lossy. The conversion to double-encoded (a la BWA) is not.

ADD REPLY • link 12.5 years ago by brentp 24k

0

Entering edit mode

I'm not sure I follow. When they say each base is interrogated twice, aren't they simply referring to the fact that any particular base will be involved in two "transitions" (or dinucleotides, which is what the colors represent)?

I really don't understand your example either. If we know the first base is A and then we have "yellow", "green", "red", how can we tell that's an error? You say green codes for TG, but it also codes for GT (and AC and CA), so that color sequence is a perfectly valid encoding for the nucleotides AGTA.

ADD REPLY • link 12.5 years ago by Daniel Standage 4.1k

0

Entering edit mode

The error we mentioned is Not in this "translate" step, but started in the "Color call" step, which is the machine read the light, and decide which color it is. Sometimes,especially at the last few runs, the color is really hard to tell~ that is may mis-read the green for yellow~

ADD REPLY • link 12.5 years ago by GAO Yang ▴ 250

0

Entering edit mode

I understand that the error is in color calling, not in translating. I was just pointing out what I think is a flaw in his example--he says that the colors "yellow", "green", and "red" likely contain a calling error. But how can we distinguish such a calling error from, for example, the sequence AGTA which would have the same color encoding?

ADD REPLY • link 12.5 years ago by Daniel Standage 4.1k

0

Entering edit mode

Yeah, base-seq may also have errors, that why we need the quality files(.qual for SOLiD). If such error happens on base-seq, it only affect one base each time. But if it's a color-space read, and you want to translate the color-space to base, the error will mess up all the base behind this error position~ say, that's why we shouldn't translate them at the beginning

ADD REPLY • link 12.5 years ago by GAO Yang ▴ 250

0

Entering edit mode

Yes, which is exactly what Jeremy said...which is why I ended up accepting his answer. I'm still struggling to see the relevance (and perhaps even accuracy) of Niek's example.

ADD REPLY • link 12.5 years ago by Daniel Standage 4.1k

1

Entering edit mode

So, you want to discuss about the meaning of color-space's existing? It actually has one advantage: cause each position will appear in two adjacent color, say "double-checked", so the accuracy will be a little better than other platform. And inconvenience coming with color-space is also great,which make SOLiD not very popular, and I think this color-space format will soon disappear~

ADD REPLY • link 12.5 years ago by GAO Yang ▴ 250

score 2 · Answer 2 · 2012-05-24

2

Entering edit mode

12.5 years ago

Jeremy Leipzig 22k

The conversion is "lossy" in the sense that you can easily get lost - one incorrect color/transition will propagate such that all the remaining bases will be incorrect upon conversion. Good stuff.

ADD COMMENT • link 12.5 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Yes, I understand the consequences of an incorrect transition early in the read. So I guess the main advantage of keeping the data in colorspace for assembly, mapping, etc, is the low probability that the same error will occur in the same position on multiple reads?

ADD REPLY • link 12.5 years ago by Daniel Standage 4.1k

0

Entering edit mode

Yes that is one way of putting it. A misread transition in colorspace if aligning to a colorspace reference will behave as a misread base in a normal basespace alignment - no biggie. A misread in colorspace converted to basespace is pure garbage - utterly corrupt.

ADD REPLY • link 12.5 years ago by Jeremy Leipzig 22k

score 0 · Answer 3 · 2012-05-24

0

Entering edit mode

12.5 years ago

Ashutosh Pandey 12k

Daniel,

I am sure by now you must have known why it is not advisable to convert csfasta files to nucleotide reads directly. The aligners including SHRiMP2, NovoalignCS, BWA, MAQ will do it for you. I mean the output of these programs will be a BAM file that contains the nucleotide sequence for the aligned reads. This conversion is not a direct one as you suggested but involves or considers all the factors like whether a color call was an error or real SNP and all that. Once you have a BAM file you can use for anything.

ADD COMMENT • link 12.5 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I have indeed noticed that many aligners support colorspace, but not as much for (transcriptome) assembly. I am experimenting with velvet right now, but I've stumbled a few times along the way...motivating this question!

ADD REPLY • link 12.5 years ago by Daniel Standage 4.1k

0

Entering edit mode

If you want assembly color-space reads, denovo2 (tool written by SOLiD support) will be fine choice.It's also based on Velvet.

ADD REPLY • link 12.5 years ago by GAO Yang ▴ 250