as far as i know, GIAB provides High_Conf.bed + High_Conf.vcf (Available here ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/), No Fastq files are provided.
but Garvan data is available here (ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/) with the bed file where you need it for variants calling and benchmark process.
1- download the reference genome (1-22, X, Y and MT) from any database (i used USCS) join all fasta files into one fasta by 'cat' command in lunix.
2- index your reference
bwa index -a bwtsw 'filesName'.fasta
3- align your read by bwa
bwa mem 'filesName'.fasta 'FileName'_1.fastq 'FileName'_2.fastq > 'FileName'.sam
4-Convert Sam file to Bam file
samtools view -@ '#of Thread you want to use as int.' -bS 'FileName'.sam -o 'FileName'.bam
5-Add read group to bam files as needed by GATK
picard-tools AddOrReplaceReadGroups I='InFileName'.bam O='OutFileName'.bam RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM='SampleName' SORT_ORDER=coordinate CREATE_INDEX=TRUE VALIDATION_STRINGENCY=LENIENT
6-Remove duplicated read
picard-tools MarkDuplicates I='InFileName'.bam O='OutFileName'.bam M=marked_dup_metrics.txt REMOVE_DUPLICATES=true
7-Index last Bam for Varients calling
samtools index -@ '#of Thread you want to use as int.' 'FileName'.bam
8-Varients Calling
gatk HaplotypeCaller --native-pair-hmm-threads ''#of Thread you want to use as int.'' -I='InFileName'.bam -O='OutFileName'.vcf -R='Ref_SeqName'.fasta -L='BedFileName'.bed
9-Now you are ready to compare you VCF (Quary.vcf) with GIAB VCF (True.vcf)
rtg vcfeval -b True.vcf regions='GIAB_BedFile'.bed -c Quary.vcf --evaluation-regions='YourBedFile' -o 'Path to output the reult' -t 'Path to SDF file'
Note:
- Do not forget to create SDF file from you reference sequence by rtg tools format
- Please see 'https://cdn.rawgit.com/RealTimeGenomics/rtg-tools/master/installer/resources/tools/RTGOperationsManual/index.html' for better understanding
- Use Fastqc 'or similar' to check the data quality and trim them if needed
after you achieve the result you want, try to make scrip from the above code in order to process sample in batches, but it the result still not reach you needs read in each tool documents and change the parameter accordingly.
I hope it was clear!
Best wishes!
Hi yussab, few months ago i were working on the same project, but the NA12878 fastq files is generated in house from our sequencer facility. i followed GATK best practice (bwa mem, picard tools and GATK haplotypecaller) and i use our bed file and VCF file with GIAB VCF and High_Conf.bed. to benchmark the variants by both rtg tools and Hap.py (Hap.py use vcfeval from rtg). and the result from both is same. lastly i made a bash script to process samples in batch by applying for loop into fastq files. Regarding Garvan samples it's ok to used them but i did not. happy to share my experience with you !, but which reference genome are going to use ? are going to include the alt, unplaced and unlocalized count ? are going to use bwa for alignments or other aligner ? gatk haplotypecaller or samtools mpileup ? what is you filtering createria ? and the most important question why you want to do this ?
Thank you for your kind answer. For the moment I'm not gonna include alt, unplaced and unlocalized count I'm using bwa and gatk haplotypecaller following the gatk best practices. I'm doing this analysis for learning purpose, I wanted to start from GIAB fastq down to their VCF.
Hi a.alnawfal.1992, can I have your contact (e-mail?) to ask some more information about benchmarking ? Many thanks
Hi emmanouil.a, it is my pleasure to share my email with you to discuss more in benchmarking, but I believe there is no way to send a personal massage in Biostars. just let me know how and I will share it with you!
Best Wishes! Abdullah
I found your e-mail clicking in your username ... I will send you a message by e-mail ... thank you very much for your time!
Great!, Looking forward to speaking with you.
Publications, bench-marking best practices is all linked from main GIAB project page.