First post here, complete newbie. I'm not sure if this is the relevant place to ask this, so please excuse me if I'm wrong.
I have done WGS at Dante Labs and imported the BAM (100GB) file in sequencing.com. The import in Promethease is still pending. I would like to be able to explore "my own" configuration of all known SNPs up to date. For example if I'm reading an article that mentiones rs10994415, rs1006737 and other SNPs, I expect to copy-paste it in sequencing.com selecting the "Variant" column and to see my genotype. However some SNPs are present, but others aren't.
What am I missing? Initially I had a .VCF.GZ file which was supposed to be the "difference" between my genes with the reference genome. My assumption was that the missing SNPs were missing, because they were the same as the reference genome and hence not shown in sequencing/promethease. That's why I requested the BAM file, expecting it to be "my own differences aligned with the reference genome", so all possible SNPs would be included.
Please excuse my lack of knowledge. I'm trying to deal with something out of my expertise domain and I'm unable to understand what's going on for weeks.
German.M.Demidov, thank you! I'm waiting for the third post in your great Medium series: https://medium.com/@german.m.demidov/how-to-analyse-your-own-dna-a-personal-experience-c4057d41753f
I will appreciate if you can answer some of my questions here. Please also correct any mistakes I may have made. Your contribution will be awesome and will benefit a lot of people. I'm sure this post will be useful for ages in the future.
During the last month my knowledge improved. I spent countless hours reading and understanding what should I do. No to mention that my humble machine spent the whole month processing data 24/7.
What I did so far:
I decided to start with Dante's FASTQ files and do not use the BAM file. Why? I wanted to use GRCh38 and not GRCh37. Why? Because it will probably hold better in time.
My FASTA chosen file is: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/GRCh38.p13.genome.fa.gz
Then I used minimap2 to align my FASTQ files with the reference genome. Why not bwa-mem? Because it complained that "paired reads have different names". Then I sorted the output with samtools. I'm skipping small irrelevant steps like creating indexes and dictionaries. I'm preparing a blog post in my blog, where I will explain the details.
Then I downloaded a relatively recent build of dbSNP: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz
Newer ones like 153 with "incompatible contigs" produced an error in recent GATK 4.1.4.1, but strange enough seemed to work in GATK 3.8, which I later deprecated in "my pipeline".
For variant calling I used GATK: gatk-4.1.4.1/gatk --java-options "-Xmx12g" HaplotypeCaller --native-pair-hmm-threads 8 -R GRCh38.p13.genome.fa.gz -I myBam.bam -D dbsnp_146.hg38.vcf.gz -O genome.g.vcf.gz -ERC GVCF --output-mode EMIT_ALL_ACTIVE_SITES
The result is a 4GB (gzip compressed) gVCF file.
Am I correct assuming that the combination of "-ERC GVCF" and "--output-mode EMIT_ALL_ACTIVE_SITES" produced a gVCF file where my genotype at each possible position is present? If the answer is yes, I can probably assume that tools like Promethease will be able to know for sure my genotype (letter) for each SNP known to humanity so far. Also known SNPs so far should be annotated with the proper rsXXXXX tag, which should be irrelevant for Promethease, but a good thing to have if you search for it directly in the file to see your variant.
Also as far as I understand the gVCF file has multiple regions (called blocks/contigs?) with essentially group "equal to reference" parts together to reduce the file size while keeping the "different" parts like a regular VCF. See: https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf
Is all this correct? However I don't know if Promethease is smart enough to infer all it needs. I just uploaded the gVCF to Promethease, paid $12 and I'm waiting for the results in a matter of minutes/hours.
I examined the gVCF file and it contains two possible "line types"
1) "your DNA is equal to the reference with such and such confidence"
2) "your DNA is different that reference and here is how"
Here is an example of the first line type:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT bar
chr1 10515 . A <non_ref> . . END=16004 GT:DP:GQ:MIN_DP:PL 0/0:15:36:15:0,36,540
This means that this is a block in chromosome 1, starting at position 10515 until position 16004 where my DNA is the same as the reference genome. The statistics at the end "0/0:15:36:15:0,36,540" describe "the probability with which my DNA is equal to the reference DNA in this region". There are different such lines for many of the "equal parts" of my DNA with the reference genome only because the statistics are different, which is important in order to know the probability of this equality being true.
This should give Promethease "the assurance" to know that my "letters" are the same between those positions compared to the reference genome. Essentially this information is missing from the Dante's "non-genomic" VCF as it shows only the differences and gives no statistics about the equal parts.
Here is an example of the second line type (difference = variant):
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT bar
chr1 87021 rs188486692 T C,<non_ref> 420.60 . BaseQRankSum=-1.835;DB;DP=27;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.576;RAW_MQandDP=62908,27;ReadPosRankSum=1.117 GT:AD:DP:GQ:PL:SB 0/1:14,13,0:27:99:428,0,485,470,524,994:7,7,2,11
This means that in chromosome 1, at position 87021 (known as rs188486692 in dbSNP version 146) the reference letter is "T", but mine is "C" with 420.60 quality (whatever this means) and some statistics describing how "probable" this is.
I went through the same steps with my friend who is a programmer and not a bioinformatician like, during last month =) yeap, there are a lot of problems. I'll try to answer your questions once I'll have some time.
In a meanwhile, if you have a lot of computational resources, you can try https://github.com/imgag/megSAP - it is 300GB, but that's a clinically validated pipeline. If you fail to run it, better text here and not ask the Software Engineer there - he does not have time =) I'll try to explain different problems.
The post by Alex is also great
I also stumbled upon this post: https://apeltzer.github.io/post/02-dante-wgs/
It describes a ready-to-use pipeline which probably automates much of what I did: https://github.com/SciLifeLab/Sarek
As far as I understand you give it your FASTQ files and it does all the magic with the tools I've already figured out myself. Any comments will be welcome. I'm writing a blog post to explain all I'm talking about, but it is a work-in-progress since half a year...