Question

Reproduce Genome in a Bottle analysis from NA12878 fastq?

4

Entering edit mode

4.2 years ago

yussab ▴ 100

Hi Everyone,

I recently heard about the genome in a bottle project.

I found raw fastq data for NA12878 samples from their GitHub repository. Now I'm trying to reproduce their analysis.

https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/NA12878/sequence.index.NA12878_Illumina_HiSeq_Exome_Garvan_fastq_09252015

I would like to know if it's available a reference code to compare my pipeline (not only the final result but the code).

Thank you in advance,

AY

genome in a bottle NA12878 • 4.3k views

ADD COMMENT • link updated 11 days ago by jkim ▴ 190 • written 4.2 years ago by yussab ▴ 100

1

Entering edit mode

Hi yussab, few months ago i were working on the same project, but the NA12878 fastq files is generated in house from our sequencer facility. i followed GATK best practice (bwa mem, picard tools and GATK haplotypecaller) and i use our bed file and VCF file with GIAB VCF and High_Conf.bed. to benchmark the variants by both rtg tools and Hap.py (Hap.py use vcfeval from rtg). and the result from both is same. lastly i made a bash script to process samples in batch by applying for loop into fastq files. Regarding Garvan samples it's ok to used them but i did not. happy to share my experience with you !, but which reference genome are going to use ? are going to include the alt, unplaced and unlocalized count ? are going to use bwa for alignments or other aligner ? gatk haplotypecaller or samtools mpileup ? what is you filtering createria ? and the most important question why you want to do this ?

ADD REPLY • link 4.2 years ago by a.alnawfal.1992 ▴ 370

0

Entering edit mode

Thank you for your kind answer. For the moment I'm not gonna include alt, unplaced and unlocalized count I'm using bwa and gatk haplotypecaller following the gatk best practices. I'm doing this analysis for learning purpose, I wanted to start from GIAB fastq down to their VCF.

ADD REPLY • link 4.1 years ago by yussab ▴ 100

0

Entering edit mode

Hi a.alnawfal.1992, can I have your contact (e-mail?) to ask some more information about benchmarking ? Many thanks

ADD REPLY • link 3.5 years ago by emma.a ▴ 130

0

Entering edit mode

Hi emmanouil.a, it is my pleasure to share my email with you to discuss more in benchmarking, but I believe there is no way to send a personal massage in Biostars. just let me know how and I will share it with you!

Best Wishes! Abdullah

ADD REPLY • link 3.5 years ago by a.alnawfal.1992 ▴ 370

0

Entering edit mode

I found your e-mail clicking in your username ... I will send you a message by e-mail ... thank you very much for your time!

ADD REPLY • link 3.5 years ago by emma.a ▴ 130

0

Entering edit mode

Great!, Looking forward to speaking with you.

ADD REPLY • link 3.5 years ago by a.alnawfal.1992 ▴ 370

0

Entering edit mode

Publications, bench-marking best practices is all linked from main GIAB project page.

ADD REPLY • link 4.2 years ago by GenoMax 147k

5

Entering edit mode

4.1 years ago

a.alnawfal.1992 ▴ 370

as far as i know, GIAB provides High_Conf.bed + High_Conf.vcf (Available here ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/), No Fastq files are provided. but Garvan data is available here (ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/) with the bed file where you need it for variants calling and benchmark process.

1- download the reference genome (1-22, X, Y and MT) from any database (i used USCS) join all fasta files into one fasta by 'cat' command in lunix.

2- index your reference

bwa index -a bwtsw  'filesName'.fasta

3- align your read by bwa

bwa mem 'filesName'.fasta 'FileName'_1.fastq 'FileName'_2.fastq > 'FileName'.sam

4-Convert Sam file to Bam file

samtools view -@ '#of Thread you want to use as int.' -bS  'FileName'.sam -o  'FileName'.bam

5-Add read group to bam files as needed by GATK

picard-tools AddOrReplaceReadGroups I='InFileName'.bam O='OutFileName'.bam RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM='SampleName' SORT_ORDER=coordinate CREATE_INDEX=TRUE VALIDATION_STRINGENCY=LENIENT

6-Remove duplicated read

picard-tools MarkDuplicates I='InFileName'.bam O='OutFileName'.bam M=marked_dup_metrics.txt REMOVE_DUPLICATES=true

7-Index last Bam for Varients calling

samtools index -@  '#of Thread you want to use as int.' 'FileName'.bam

8-Varients Calling

gatk HaplotypeCaller --native-pair-hmm-threads ''#of Thread you want to use as int.'' -I='InFileName'.bam -O='OutFileName'.vcf -R='Ref_SeqName'.fasta -L='BedFileName'.bed

9-Now you are ready to compare you VCF (Quary.vcf) with GIAB VCF (True.vcf)

rtg vcfeval  -b True.vcf regions='GIAB_BedFile'.bed -c Quary.vcf --evaluation-regions='YourBedFile' -o 'Path to output the reult' -t 'Path to SDF file'

Note:

Do not forget to create SDF file from you reference sequence by rtg tools format
Please see 'https://cdn.rawgit.com/RealTimeGenomics/rtg-tools/master/installer/resources/tools/RTGOperationsManual/index.html' for better understanding
Use Fastqc 'or similar' to check the data quality and trim them if needed

after you achieve the result you want, try to make scrip from the above code in order to process sample in batches, but it the result still not reach you needs read in each tool documents and change the parameter accordingly.

I hope it was clear! Best wishes!

ADD COMMENT • link 4.1 years ago by a.alnawfal.1992 ▴ 370

0

Entering edit mode

Hi a.alnawfal.1992, thank you for your kind answer... So it's not necessary to process the Variant Filtration/Refine Genotypes/Variant Annotation steps??? In this way you use raw vcf snps + indels??

ADD REPLY • link 4.1 years ago by yussab ▴ 100

0

Entering edit mode

Hi, it's completely depends on what you want to achieve. For example if you Decided to filter variants that have DP less than 5 , so your result is only for variants that have DP more than 5 in terms of false positive, false negative,..etc.. regarding annotation it's very important for detecting disease causes mutation, but you experiment for benchmark the variants not for detecting the disease causing mutation.

ADD REPLY • link 4.1 years ago by a.alnawfal.1992 ▴ 370

score 7 · Accepted Answer · 2020-10-01

Hi Yussab, as you did not replay, so i do not know what are you planning to do exactly. it seems that you do have your own pipeline which mean you have the final VCF where it's generated from NA12878. i know these 2 tools to compare you result with GIAB Data set. (I do not know if there is something else available)

1- Hap.py (https://github.com/Illumina/hap.py):

hap.py truth.vcf query.vcf -f confident.bed -o output_prefix -r reference.fa

2-RTG tools (https://github.com/RealTimeGenomics/rtg-tools):

generate SDF file from Fasta file (Your reference):

rtg format "Your_Fasta_File".fasta -o "File_Name_ToOut_Put_Result"

Compare the tow VCF file:

rtg vcfeval -b  truth.vcf -c query.vcf -o Directary_To_OUTPUT_THE_RESULT -t PATH_TO_SDF_FILE -e GIAB_BED_FILE.bed  --region= Your_Bed_File.bed

I highly recommend you to read the tools manual for better understanding.