Hi all,
I am working with Illumina Hiseq 2000 DNA-Seq data from 2 human saliva samples.
Two runs were performed per each sample: cleaned (human DNA cleaned from bacterial DNA) and non-cleaned.
My aim is to extract SNPs with rsID.
To process fastq.gz files I used bcbio-nextgen package.
I made two variant calling analyses. First only with cleaned runs (I analyzed samples together), second with concatenated fastq files from both runs for each sample.
I used vcftools to extract non-indel polymorphisms and bash scripting to extract all SNPs with rsID.
Here I have a problem: there is 5 times less polymorphisms from analysis for both runs than for cleaned DNA run.
See the table:
Cleaned All
Vcftools - #SNPs before filtering 6167677 1229010
Vcftools - #SNPs after removing indels 5213278 1024912
#SNPs with rs number 4917067 978823
Do you have any idea what could cause it? Analysis parameters in bcbio were the same for both analyses, the only difference is in the input fastq.gz files.
Thanks,
Anastassiya
P.S. here are command that I used to extract SNPs with rsID.
#remove INDELS
vcftools --vcf ket-gatk-haplotype.vcf --out ket-gatk-haplotype_noindel --remove-indels --recode --recode-INFO-all
#extract head
grep -E ^# ket-gatk-haplotype_noindel.recode.vcf > ket_ngs.head
#extract rows with rs ID
grep -E [[:space:]]rs ket-gatk-haplotype_noindel.recode.vcf > ket_ngs.rs
#concatenate two files
cat ket_ngs.head ket_ngs.rs > ket_ngs_rs.vcf
#create tped and tfam PLINK files
vcftools --vcf ket_ngs_rs.vcf --out ket_ngs_plink --plink-tped
Could you tell us some more information about the library you sequenced, such as mean coverage, genome/exome?
It was genome sequencing.
Here are values for both analyses for two samples:
For each of your samples you are removing ~1/2 of the reads as "contaminants". Could you explain this procedure a bit more? I have a feeling that including this extra coverage is adding more noise to your samples somehow. The bcbio-nextgen has many configuration options. Perhaps you could post your YAML configuration file for the run?