Hello everyone,
I run into weird results when calling a list of variants from a gold standard genome NA12878. My goal is to evaluate how well perform two different calling tools, so I aim to produce a vcf containing a list of variants from a known genome and then compare it with the high confidence list produced by Illumina in Platinium Genomes (https://www.illumina.com/platinumgenomes.html) to find the true positive, false positive etc... It is quite similar with the work done in those two papers: http://www.nature.com/articles/srep17875 http://www.nature.com/nbt/journal/v32/n3/full/nbt.2835.html
To do so, I have downloaded the sequences located here : ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/ So from what I understand, it corresponds to one of the sequencing runs for NA12878.
I then have merged the fastq in two files (paired end), trimmed the first 10 bp and aligned them using BowTie and the human genome 38 reference on Ensembl. ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz The output was a bam and a bai. So far so good. I hope.
Now I want to call variants so I compare them with the variants list provided by Illumina for NA12878 at this adress: ftp://platgene_ro@ussd-ftp.illumina.com/2016-1.0/hg38/small_variants/NA12878/NA12878.vcf.gz
After importing the BAM/BAI, I use the annotation file located on Ensmbl : ftp://ftp.ensembl.org/pub/release-87/regulation/homo_sapiens/homo_sapiens.GRCh38.motiffeatures.20161111.gff.gz and run the calling software.
Two problem arise:
- The Variant Effect Annotation step tells me the annotation contains no protein gene records. Alright, I can do without this information as long as I have a list of SNPs and indels as an output but that's weird.
- The actual SNP/indel output has identified something like 700 SNPs with default parameters. With parameters not stringent at all, I can find up to 3000 SNPs but nothing even remotely close to the millions found by Illumina.
Any comparison at this point is quite irrelevant. I probably have a problem with the various files I input, but can't figure out where is the problem coming from. Any thoughts on this question ? Is my "roadmap" to produce the list of SNPs and compare it to Illumina's one correct?
Thanks a lot for your input!
Thanks for all those links; I found them helpful! Although, I've actually run into a lot of trouble determining what reference NIST used for aligning its reads. Their reference links are dead, and the description is extremely vague, so their VCF is useless to me for now. Instead I'm using Illumina's VCF. Note that you can't trivially wget Illumina's files - they require a blank password.
If anyone knows where to download the reference NIST is using for alignment, I'd be happy to hear it... it is NOT this:
ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
I don't know what that is, but please don't use it for anything. It contains 48 Gbp of sequence, 94% of which is N. For reference, the actual human genome is around 3 Gbp. On the upside, it does compress very well.
On the other hand, Illumina is using HG19 which is very easy to acquire. For some reason they decided to omit chromosome MT, so of course you should not follow their example exactly because they don't seem to follow best practices.
Hello Brian, Thanks a lot for you reply!
You are completely right for the HG38 file! After checking more in details, it contains 44807691997 N characters out of 48669025517 total characters (!). So if I understand, I should:
Is this alright? It's really hard not to get confused by the different builds/reference, so I prefer to be sure. Another question, do you know where I can find the variants called for NA12878 by NIST ? I can't find the corresponding vcf anywhere. If the NIST list exists, has anyone compared it with the list produced by Illumina (always for NA12878) ?
Cheers,
EDIT: oh, and on a side note, I found this program to compare VCF, seems very useful : https://github.com/Illumina/hap.py
From Ensembl description of your file (README file in the directory of GRCh38 link you provided):
What you need is a set of chromosomes with repeats and low complexity regions masked for alignment. If you want to use alternative haplotypes (ie patches) here is a good tutorial http://gatkforums.broadinstitute.org/gatk/discussion/8017/how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38
I disagree. Masking prior to alignment is a bad idea, if you care about reads mapping correctly. It causes false positives, false negatives, and incorrect mapq calculation.
I just corresponded with NIST, and they now have a new location for the reference:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
Personally, I recommend mapping to chr1-22 plus X, Y, and MT, and discarding alt/unplaced sequences, which are low-quality and often nonhuman. Whoever is responsible for curating the human reference genome is not doing a good job.
As for your question - I'd stay away from ensembl as they seem to be making a corrupt human reference publicly available, unless I'm misunderstanding something. Illumina's VCFs are very good, but have Illumina biases; NIST has VCFs that list multiple different platforms (on a per-variant basis it lists which platforms support that variant), and are thus probably better.
Thank you both Petr and Brian for your help.
It baffles me that NIST / Ensembl are not curating human reference genome properly. I can't imagine the time lost by research labs across the world because of that issue. Anyway, I will proceed with the new reference you provided. Thanks a lot for the link!