to complete Brad's answer. Here is a part of my solution. Here I have used samtools mpileup rather than the GATK and some of my (experimental) tools.
Ok, first here is the Makefile that was used to generate the data for my students (the coverage is low as I didn't want to generate some large files and I forgot to discard the duplicates.
sam_dir=/usr/local/package/samtools-0.1.18
bwa_dir=/usr/local/package/bwa-0.6.0
bwa_bin=${bwa_dir}/bwa
sam_bin=${sam_dir}/samtools
KROM=chr22
VARKIT=/home/lindenb/src/variationtoolkit/bin
MAKEFILE=-f 20111205.mk
MPILEUP=## -B -e 10
#http://theory.uwinnipeg.ca/localfiles/infofiles/make/make_36.html
.SECONDARY:
%.txt.gz : %.txt
gzip -f --best $<
%.fq.gz : %.fq
gzip -f --best $<
%_1.fq %_2.fq:%.fa
${sam_dir}/misc/wgsim -N 500000 -r 0 -h $< $(subst .fa,_1.fq,$<) $(subst .fa,_2.fq,$<)
sed "s/^@.*/@${KROM}/" $(subst .fa,_1.fq,$<) > $(subst .fa,.tmp.fq,$<)
mv $(subst .fa,.tmp.fq,$<) $(subst .fa,_1.fq,$<)
sed "s/^@.*/@${KROM}/" $(subst .fa,_2.fq,$<) > $(subst .fa,.tmp.fq,$<)
mv $(subst .fa,.tmp.fq,$<) $(subst .fa,_2.fq,$<)
%.sai:%.fq.gz
${bwa_bin} aln ${KROM}db $< > $@
%.fai:%.fa
${sam_bin} faidx $<
%.sam:%_1.sai %_2.sai
echo $<
${bwa_bin} sampe -a 700 ${KROM}db \
$(subst .sam,_1.sai,$@) \
$(subst .sam,_2.sai,$@) \
$(subst .sam,_1.fq.gz,$@) \
$(subst .sam,_2.fq.gz,$@) > $@
%.bam:%.sam
${sam_bin} view -b -T ${KROM}.fa $< > $@.tmp
${sam_bin} sort $@.tmp $@.tmp.sorted
mv $@.tmp.sorted.bam $@
${sam_bin} index $@
%.bcf:%.bam
${sam_bin} mpileup ${MPILEUP} -uf ${KROM}.fa $(subst .bcf,.bam,$@) |\
${sam_dir}/bcftools/bcftools view -bvcg - > $@
%.vcf:%.bcf
${sam_dir}/bcftools/bcftools view $< > $@
${KROM}.fa:
${sam_dir}/samtools faidx /GENOTYPAGE/data/pubdb/ucsc/hg19/chromosomes/hg19.fa ${KROM} > ${KROM}.fa
${sam_dir}/samtools faidx ${KROM}.fa
${bwa_bin} index -a bwtsw -p ${KROM}db ${KROM}.fa
snps.txt:
mysql -N -u anonymous -D hg19 -e 'select distinct chrom,chromStart+1,"A","A",func from snp132 where chrom="chr22" and chromStart > 21304567 and chromStart < 31742029 and func="coding-synon" order by avHetSE limit 10' | head > $@
ignore.txt:
echo -e "chr22\t41696567\t41757151" > $@
capture.txt:
mysql -N -u anonymous -D hg19 -e "select chrom,exonCount,exonStarts,exonEnds from knownGene where chrom='${KROM}' and cdsStart<cdsEnd and="" txStart=""> 21304567 order by chrom " |\
awk -F ' ' '{n=int($$2); split($$3,B,"[,]");split($$4,E,"[,]"); for(i=1;i<=n;++i) printf("%s\t%d\t%d\n",$$1,int(B[i])-350,int(E[i])+350);}' |\
${VARKIT}/bedmerge > $@
father.dir: capture.txt ${KROM}.fa snps.txt ignore.txt
${VARKIT}/genomesim -o $@.tar -r 0.0001 -i ignore.txt -f capture.txt -u snps.txt -m chr22 41742029 A . ${KROM}.fa
tar xvf $@.tar
rm $@.tar
cat $@/homologous1.fa $@/homologous2.fa > $@/genome.fa
$(MAKE) $(MAKEFILE) $@/genome.vcf
grep 41742029 $@/genome.vcf
mother.dir: capture.txt ${KROM}.fa snps.txt ignore.txt
${VARKIT}/genomesim -o $@.tar -r 0.0001 -i ignore.txt -f capture.txt -u snps.txt -m chr22 41742047 G . ${KROM}.fa
tar xvf $@.tar
rm $@.tar
cat $@/homologous1.fa $@/homologous2.fa > $@/genome.fa
$(MAKE) $(MAKEFILE) $@/genome.vcf
grep 41742047 $@/genome.vcf
child.dir: father.dir mother.dir
mkdir $@
cat father.dir/homologous1.fa mother.dir/homologous1.fa > $@/genome.fa
${sam_dir}/misc/wgsim -N 500000 -r 0.00001 -h $@/genome.fa $@/genome_1.fq $@/genome_2.fq > $@/wgsim.tsv
$(MAKE) $(MAKEFILE) $@/genome.vcf
-grep 41742029 $@/genome.vcf
-grep 41742047 $@/genome.vcf
basic variants effect prediction with ucsc knownGenes:
$ cat father.dir/genome.vcf mother.dir/genome.vcf child.dir/genome.vcf |\
cut -d ' ' -f 1,2,3,4,5 | grep -v "##" |\
${VARKIT}/prediction --host localhost --port 3316 --user anonymous -f /path/to/hg19.fa |\
cut -d ' ' -f 1-20 | sort | uniq > predictions.txt
read each VCF, append the name of the SAMPLE, extract the field GT and discard the variation if it is homozygous, create a new key (chrom_pos_ref_alt):
echo "MOTHER\tmother.dir/genome.vcf\nFATHER\tfather.dir/genome.vcf\nCHILD\tchild.dir/genome.vcf" > vcf2sample.txt
${VARKIT}/scanvcf vcf2sample.txt | tr -d '\r' |\
${VARKIT}/extractformat -t GT |\
awk '(!($$11=="FATHER" && $$12=="1/1"))' |\
awk '(!($$11=="MOTHER" && $$12=="1/1"))' |\
awk '(!($$11=="CHILD" && $$12=="1/1"))' |\
awk '{printf("%s_%s_%s_%s\t%s\n",$$1,$$2,$$4,$$5,$$0);}' |\
sort -t ' ' -k1,1 > tmp1.vcf
create a new key (chrom_pos_ref_alt) for the predictions, keep the non synonymous and the stops gained/deleted:
awk '{printf("%s_%s_%s_%s\t%s\n",$$1,$$2,$$4,$$5,$$0);}' predictions.txt |\
tr -d '\r' | sort -t ' ' -k1,1 > tmp2.vcf
join the two files with the key chrom_pos_ref_alt, sort, group by gene, keep the gene having at least one mutation in the father/mother/child, keep the ucsc name, extract the gene symbols using ucsc kgXref, search the ncbi for those genes and get some information about those genes:
join -t ' ' -1 1 -2 1 tmp1.vcf tmp2.vcf | egrep '(#CHROM|NON_SYNO|STOP)' | cut -d ' ' -f2- |\
sort -t ' ' -k1,1 -k2,2 -k4,4 -k5,5 -k11 |\
${VARKIT}/groupbygene --gene 18 --sample 11 |\
awk '($$1=="GENE" || ($$7!="0" && $$8!="0" && $$9!="0"))' |\
cut -d ' ' -f 1|\
${VARKIT}/mysqlquery --host genome-mysql.cse.ucsc.edu --user genome -e 'select geneSymbol from kgXref where kgId="$$1"' |\
sort | uniq |\
awk '($$2!=".")' |\
${VARKIT}/ncbiesearch -D gene -q '"$$2"[GENE] "Homo Sapiens"[ORGN]' |\
${VARKIT}/ncbiefetch -D gene -c 3
Result:
uc002zww.2 BCR 613 BCR breakpoint cluster region HGNC=1014|Ensembl=ENSG00000186716|HPRD=01044|MIM=151410|Vega=OTTHUMG00000150655 A reciprocal translocation between chromosomes 22 and 9 produces the Philadelphia chromosome, which is often found in patients with chronic myelogenous leukemia. The chromosome 22 breakpoint for this translocation is located within the BCR gene. The translocation produces a fusion protein which is encoded by sequence from both BCR and ABL, the gene at the chromosome 9 breakpoint. Although the BCR-ABL fusion protein has been extensively studied, the function of the normal BCR gene product is not clear. The protein has serine/threonine kinase activity and is a GTPase-activating protein for p21rac. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]
uc002zwx.2 BCR 613 BCR breakpoint cluster region HGNC=1014|Ensembl=ENSG00000186716|HPRD=01044|MIM=151410|Vega=OTTHUMG00000150655 A reciprocal translocation between chromosomes 22 and 9 produces the Philadelphia chromosome, which is often found in patients with chronic myelogenous leukemia. The chromosome 22 breakpoint for this translocation is located within the BCR gene. The translocation produces a fusion protein which is encoded by sequence from both BCR and ABL, the gene at the chromosome 9 breakpoint. Although the BCR-ABL fusion protein has been extensively studied, the function of the normal BCR gene product is not clear. The protein has serine/threonine kinase activity and is a GTPase-activating protein for p21rac. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]
uc003azw.2 ZC3H7B 23264 ZC3H7B zinc finger CCCH-type containing 7B HGNC=30869|Ensembl=ENSG00000100403|HPRD=06711|Vega=OTTHUMG00000150969 This gene encodes a protein that contains a tetratricopeptide repeat domain. The encoded protein also interacts with the rotavirus non-structural protein NSP3. [provided by RefSeq, Jul 2008]
uc003biz.3 CRELD2 79174 CRELD2 cysteine-rich with EGF-like domains 2 HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292
uc003bja.2 CRELD2 79174 CRELD2 cysteine-rich with EGF-like domains 2 HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292
uc010haj.2 CRELD2 79174 CRELD2 cysteine-rich with EGF-like domains 2 HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292
uc010hak.2 CRELD2 79174 CRELD2 cysteine-rich with EGF-like domains 2 HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292
uc010hal.2 CRELD2 79174 CRELD2 cysteine-rich with EGF-like domains 2 HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292
uc010ham.2 CRELD2 79174 CRELD2 cysteine-rich with EGF-like domains 2 HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292
pubmed tells us (Rotavirus / gastroenteritis) that ZC3H7B is a good candidate and it is confirmed by looking at the variations chr22:41742029 and chr22:41742047 . We can find the domain of the protein affected by those mutations using uniprot (here, a zinc finger) :
$ grep uc003azw.2 predictions.txt |\
${VARKIT}/mysqlquery -q 'select spId from kgXref where kgId="$6"' |\
${VARKIT}/uniprot -p 14 -s 21
chr22 41742029 . T A uc003azw.2 + 41697566 41756150 41716664 41753433 EXON|EXON_STOP_GAINED 1481 494 Exon 14 . TGT TGA C * Q9UGR2-2 1 993 chain . Zinc finger CCCH domain-containing protein 7B . .
chr22 41742047 . C G uc003azw.2 + 41697566 41756150 41716664 41753433 EXON|EXON_CODING_NON_SYNONYMOUS 1499 500 Exon 14 . TGC TGG C W Q9UGR2-2 1 993 chain . Zinc finger CCCH domain-containing protein 7B . .
chr22 41742047 . C G uc003azw.2 + 41697566 41756150 41716664 41753433 EXON|EXON_CODING_NON_SYNONYMOUS 1499 500 Exon 14 . TGC TGG C W Q9UGR2-2 500 524 zinc finger region . C3H1-type 1 . .
I can understand that your students were a bit uncomfortable, I would think this is not quite trivial, even though it is quite obvious what you have to look for. What I find confusing is hint No 2: the child has never been affected by a gastroenteritis (Yep, that's the phenotype ! :-) ) vs. "Only the child is affected by the disease." What phenotype are you talking about, so there are at least two phenotypes? what do you mean by "that's the phenotype", implying there is only a single phenotype in play?
This is a fun problem, but I'd point out that normal people are quite busy the week before Christmas :)
So... what are you saying about my normality?
Nice assignment. Sounds fun.
Can we have a bed file?
How long are these reads...? BWA sampe is choking
erm... is this real data?
Are there reads from other chromosomes? Little more info please.
the quality scores are all the same. It's probably not real data :)
@zev.kronenberg wy do you want a BED ? what kind of error did you get with BWA
@russH no , the files have been generated using samtools wgsim
@zev.kronenberg you can limit your search to the human chr22 hg19
@DK: not real data: the reads have been generated with samtools wgsim, I've inserted some mutations in the fasta files of the 3 individuals.
@Pierre_Lindenbaum
1) I guess I don't need a bed if its fake data
2)
[zkronenb@browning child.dir]$ ~/tools/bwa-0.5.9/bwa sampe /data/ucsc/hg19/fasta/chr22.fa genome_1.sai genome_2.sai genome_1.fq genome_2.fq [bam_header_read] invalid BAM binary header (this is not a BAM file). [bam_header_read] invalid BAM binary header (this is not a BAM file). @SQ SN:chr22 LN:51304566 @PG ID:bwa PN:bwa VN:0.5.9-r16 [bwa_read_seq] the maximum barcode length is 15.
@Michael: Well, they were not alone. I was here to tell and explain them the commands to run. But it was not simple for them to simply move a file to another directory, using the tab-completion, etc....