I would like to make a fake quality file for a Velvet de novo assembly. What is the average quality of a velvet assembly?
I've been told that Velvet assemblies are basically infallible. If they are, then for an assembly I might say that there is less than one error in the entire genome,
phred=-10*log(1/G)
Where G is the size of the genome in nucleotides. If the size of my bacterial genome is 2 MB, then the quality of the assembly would be greater than -10log(1/(210^6)), or 63.
So, does anyone have a good idea what a Velvet quality file would look like? Do you agree with my "63" assessment?
A colleague told me that he resequenced a genome using Illumina. He found 3 errors that even a Sanger whole genome sequence project missed, which he verified by looking at original chromatograms.
Does Velvet take base caller confidence scores into account?
No it currently does not, although it would be easy to implement. The reason we have not done it yet is because a lot can be inferred from coverage alone.
where E is the error rate (probailty of a correct call), and Q is phred quality. So, Q=10 means E=0.1, Q=20 gives E=0.01 (1% errors). Conversely
Q = -10*log_10(E)
Which is the formula you used, so your analysis and arithmetic looks good to me.
That aside, I'm not sure why you want to do this. The point of having quality associated with an assembly is to identify the variations, some areas are bound to have lower coverage or an accumulation of poor quality reads - which the quality score can help you to identify.
My problem with Velvet is that it does not output quality scores. It might be a significant project to produce accurate cumulative quality scores in the final assembly. However, a fake quality file such as the one above would be all right. Or, if I could produce a quality file based on coverage it might be even better.
The biggest reason I want this quality file is so that I can merge its assembly more accurately with another assembly to produce a comprehensive final assembly.
So you would say to look at the coverage and create a quality score based on how many reads cover a base? Has this been done before with Velvet output?
I'm not familiar with Velvet, but most sequencing technologies produce reads with associated quality of some sort. The assembler should use this quality to derive a new quality for each position, so that if the bases of all (or most) reads agree, this should result in a higher quality base call in the assembly, than if more reads disagree.
It looks like the quality could actually be in the afg file of the Velvet output. Can anyone confirm? I had to use the latest version of BioPerl to get some of the methods.
ADD REPLY
• link
updated 5.2 years ago by
Ram
44k
•
written 14.1 years ago by
Lee Katz
★
3.2k
0
Entering edit mode
I have difficulties making sense out of this in conjunction with Falcatrua's answer. Are these real qualities, or not? Anyway, I just recalculate qualities by independently mapping reads back to the assembly.
Just noticed that DiGuistini et al. used Velvet to pre-assemble Illumina reads (as Forge apparently couldn't handle the size of the dataset), and they assigned a flat Phred 20 quality to the assemblies in further processing (midway down in the left column, second to last page of the paper)
Basicaly, what Velvet does with quality values when converting .afg to .ace (amos2ace) for files without phred qualities is convert the sequence (bases) to ASCII characters.
You can check this by looking in a .ace file created by amos2ace for assembly files without phred values. Get a contig sequence in the ace and look at it`s quality values, same bases always have the same values.
Velvet does this thing because when there are no quality values in your reads submitted to assembly, the quality values in the .afg file is the sequence itself (.seq and .qlt are redundant)
You can also check amos2ace code (in perl), and you will see that it gets this .qlt values (same thing as the sequence) and convert the sequence bases to ascII characters using ord (), that is a perl function to do that conversion.
Let me get this straight: it assigns a fixed quality value to each nucleotide - so that e.g. all 'A's in the sequence will have the same quality, and all 'C's will have the same value, but different from 'A'? I hope I'm misunderstanding you here.
It sounds like a strange claim to me that Velvet assemblies are "basically infallible". What is this statement based on?
A colleague told me that he resequenced a genome using Illumina. He found 3 errors that even a Sanger whole genome sequence project missed, which he verified by looking at original chromatograms.
Also, in the Velvet website (http://www.ebi.ac.uk/~zerbino/velvet/):
Does Velvet take base caller confidence scores into account?
No it currently does not, although it would be easy to implement. The reason we have not done it yet is because a lot can be inferred from coverage alone.
I'm spawning a new question out of this instead. Velvet: Retain Read Names In Afg File