Hi,
I am validating an in-house pipeline for calling SNP and INDELS for small genomes. For this purpose I am using the GIAB NA12878 HiSeq 2500 300X coverage dataset. ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/
I have downloaded all fastq files from this folder and merged the forward reads for two lanes into a single fastq file and same with reverse reads.
- Do i need to download and merge reads from other folders as well? example from Sample_U0b, Sample_U0c and so on with Sample_U0a. Will files from only Sample_U0a give a coverage of 300X. I can not find any explanation whether to merge files from all Samples or just from one Sample.
The samples which I often deal with are 3000 - 12000 bp long ssDNA virus or dsDNA plasmid.
- Can I map the reads from NA12878 to a specific gene (same length as above) and generate a subset of reads mapped to that region, which I can use for further analysis of SNP and INDELs? Will this approach work for validation?
Previously I was using PhiX dataset from illumina for validation of pipeline the problem with that is it has only SNPs which are validated and not INDELs.
- Is there any other plasmid/viral datatset other than PhiX which I can use for validation? It should contain both SNP's and INDELs (10bp or more long) at different variant frequencies.
Thanks in advance!!
@wouterDeCoster,
Thank you for your input. But we don't have the system capacity to analyze whole genome data and map it to the entire human genome reference. I used data in the above link and mapped to chr21 which gives very less average per base coverage of 3, which is not desired for downstream variant calls.
Can I use a read simulator like dwgsim (https://github.com/nh13/DWGSIM) to for SNP and INDEL pipeline validation?