Question

Missing data calling (I know, sounds weird...)

0

Entering edit mode

7.2 years ago

gerberd1990 ▴ 30

Hi everyone, I am working with ancient DNA, and I'd like to call a consensus fasta file from my mapped data (BAM). Since it is ancient DNA it is often happening that there are missing, or poorly covered positions in my BAMs, so when I want to call a consensus fasta from vcf I also want to call the missing positions as N's, and the poorly covered (DP=3>) as... something reliable. I am using bcf/samtools for the task but I have no idea, how to call missing positions for the fasta, or how to make a threshold for non-mutant positions. I know, there is schmutzi for mitochondrial data, but somehow I cant use that one (I would appreciate help for that too). Thanks in advance!

SNP sequence • 1.7k views

ADD COMMENT • link 7.2 years ago by gerberd1990 ▴ 30

0

Entering edit mode

@gerberd1990 I wrote schmutzi's endoCaller. It calls a consensus for mt data and can account for present-day human contamination and damage due to deamination. But, it is only for haploid data (mitochondrial or chloroplast) but not for nuclear data. Let me know if you need help using it.

ADD REPLY • link 7.1 years ago by Gabriel R. ★ 2.9k

score 0 · Answer 1 · 2017-10-21

You have done whole genome sequencing of the ancient DNA and have re-aligned it to the current reference genome?, and you notice many regions where there was alignment (or low alignment coverage) due to having none-to-little input DNA template?

I would see this as a de novo assembly opportunity and aim to assemble the reads you've got, which will allow you to produce a consensus set of contigs that may have gaps (Ns). You could then re-align your sample to the new new genome and call variants over it.

You could also mess around with these (or other scripts) that will easily allow you to see which regions in your BAM are at 0 or low coverage:

#Determine number of bases at 0 read depth
zero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4==0 {bpCountZero+=($3-$2)} {print bpCountZero}' | tail -1)

#Determine number of bases at >0 read depth, i.e., non-zero bases
nonzero=$(bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4>0 {bpCountNonZero+=($3-$2)} {print bpCountNonZero}' | tail -1)

#Calculate the percent of the reference genome covered by >0 read depth bases
#Round up to 6 decimal places
percent=$(bc <<< "scale=6; ($nonzero / ($zero + $nonzero))*100")

echo $percent

.

#Output all bases in the reference genome at >=3 read depth
bedtools genomecov -ibam BAM -g hg38.fasta -bga | awk '$4>2'