Hi all,
I've recently downloaded the simple somatic mutation (SSM) file for clear cell renal cell carcinoma (ccRCC) from the ICGC Data Repository, but I've been having some trouble interpreting the quality score column.
Below is a snippet of my data ( .tsv file)
chromosome chromosome_start chromosome_end chromosome_strand mutation_type reference_genome_allele mutated_from_allele mutated_to_allele quality_score probability total_read_count
1 224822287 224822287 1 single base substitution T T G 223 46 26
1 224822287 224822287 1 single base substitution T T G 223 46 26
However, I'm not sure why the quality score is so high. For every entry the quality score is between 100 and 223. Some have said that Phred scores can in fact range from 0 to infinity (http://gatkforums.broadinstitute.org/discussion/4260/how-should-i-interpret-phred-scaled-quality-scores), while others say that scores in the 200 range probably means that the signal was too low (http://seqanswers.com/forums/showthread.php?t=23770).
The ICGC website has described the quality score column to be that of the mutation call and not that of alignment etc. (http://docs.icgc.org/simple-somatic-mutations-ssm-primary-analysis-file-p).
The rest of the columns say that samtools pileup was used for the raw variant calls among other analysis algorithms such as GATK, Picard, VCF tools etc. For all calls no verification with an orthogonal platform or biological validation was carried out.
Can anyone confirm whether this does in fact infer great quality or if I should be looking out for something else.
Thanks in advance,
Tracey
Thank you for your response Ying, but I used the EU/FR data set since they carried out whole genome sequencing (https://dcc.icgc.org/repository/current/Projects/RECA-EU). They used and samtools mpileup for variant calling. Thank you for going through the trouble of pasting the link for the VarScan documentation.
If you also have some experience with samtools, I would be happy to hear your thoughts on the quality scores.
tbh I'm not very sure how samtools pileup/mpileup outputs quality values and which one is being used for the ssm file. There are multiple posts on this website asking about samtools/pileup/mpileup and quality values. To go back to your original question, I would assume that the high quality values mean that they are good enough for your purposes since they are being distributed, the lower quality variants were probably filtered. If you don't trust it, you would have to look for the raw data and do variant calling yourself (which you will have to get authorization for since tumor/normal bam files are protected patient data). I was under the impression that the data on icgc website will eventually have normalized variant calling data using the same pipeline.
Hi Ying,
I've gone through the questions about samtools/mpileup but none of them seem to address the issue of the quality score. I did write to ICGC about two weeks ago and again today. I'm awaiting a response. Thank you.