Hi all,
I've received a set of BAM files , the variant were called with bcftools
${bcftools_exe} mpileup -Ou -f "${REF}" \
--bam-list "${bam_list}" \
--regions-file "${bedfile}" \
--annotate 'FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR' \
--redo-BAQ --adjust-MQ 50 --min-MQ 30 |\
${bcftools_exe} call \
--ploidy GRCh37 \
--multiallelic-caller \
--variants-only -O z -o "output.vcf.gz"
but I suspect there is a cross-contamination between the sample, because many of the HOM_REF genotypes contain a few ALT allele.
The variants were called with samtools, but some genotypes called as HOM_REF contain a few ALT
+---------+---------+--------+-------+-------+-----+-----+-----------+----+
| Sample | Type | AD | ADF | ADR | DP | GT | PL | SP |
+---------+---------+--------+-------+-------+-----+-----+-----------+----+
| 28D0609 | HOM_REF | 206,15 | 97,9 | 109,6 | 221 | 0/0 | 0,255,255 | 4 |
| 37D1676 | HOM_REF | 154,10 | 89,5 | 65,5 | 164 | 0/0 | 0,229,255 | 1 |
| 13D0720 | HET | 170,59 | 92,27 | 78,32 | 229 | 0/1 | 134,0,255 | 5 |
| 37D1631 | HOM_REF | 155,16 | 73,8 | 82,8 | 171 | 0/0 | 0,76,255 | 0 |
| 57D1188 | HOM_REF | 85,0 | 39,0 | 46,0 | 85 | 0/0 | 0,255,255 | 0 |
| 14D2313 | HOM_REF | 101,0 | 50,0 | 51,0 | 101 | 0/0 | 0,255,255 | 0 |
| 24D2314 | HOM_REF | 48,0 | 18,0 | 30,0 | 48 | 0/0 | 0,144,255 | 0 |
| 24D0430 | HOM_REF | 64,0 | 31,0 | 33,0 | 64 | 0/0 | 0,193,255 | 0 |
| 18D0610 | HOM_REF | 55,0 | 29,0 | 26,0 | 55 | 0/0 | 0,166,255 | 0 |
+---------+---------+--------+-------+-------+-----+-----+-----------+----+
Some samples were sequenced in the same flowcell/lane.
How can I validate the hypothesis of a cross contamination ?
I was suggested to use verifyBamID but as far as I understand, It need another VCF called with another method (?)
I also tried to use Gatk ContEst but I've no idea of what I'm doing...
java -ja GenomeAnalysisTK.jar -T ContEst -I bam.list -R human_g1k_v37.fasta -o out.metrics --genotypes my.vcf.gz -pf 1000G_phase1.snps.high_confidence.b37.vcf --min_genotype_depth 20 -L 22
INFO 10:17:00,850 ContEst - Total sites: 31803838
INFO 10:17:00,860 ContEst - Population informed sites: 310728
INFO 10:17:00,861 ContEst - Non homozygous variant sites: 310728
INFO 10:17:00,861 ContEst - Homozygous variant sites: 0
INFO 10:17:00,861 ContEst - Passed coverage: 0
INFO 10:17:00,861 ContEst - Results: 0
any suggestion ?
I was also suggested to look for rare variants: they should not be found in unrelated samples.
Ideally if original samples are available then doing independent SNP genotyping would be the way to verify identity of samples.
verifyBamID does need a vcf, but it is a population reference VCF (1000genomes)
I've used it for detecting contamination in a targeted panel with alright results. see my question on their user group page.
from twitter: