Soft-Clipping And Variant Calling
2
1
Entering edit mode
11.4 years ago
DoubleDecker ▴ 200

Having aligned my reads with BWA-mem, I noticed that a lot of alignments are dodgy, to put it mildly, e.g. with both ends massively soft-clipped. I now wonder if I need to filter my alignments to get rid of such soft-clipped alignments before performing variant calling or (preferably!) variant callers such as GATK, freebayes consider soft-clipping information in deciding whether to call a variant?

bwa gatk • 12k views
ADD COMMENT
1
Entering edit mode
11.4 years ago

No, soft/hard clipped alignments are ignored by the callers.

EDIT: but hard and soft clip can be used to detect structural variations.

EDIT2: a quick test (added 8 bases in 5' and 3 ' to all reads. Only 5 snps found)

sam.dir=/commun/data/packages/samtools-0.1.19
bwa=/commun/data/packages/bwa-0.7.6a/bwa
.PHONY:all reads
all: test.vcf
test.vcf: test.bam.bai
${sam.dir}/samtools mpileup -uf ex1.fa test.bam | ${sam.dir}/bcftools/bcftools view -vcg - > $@
test.bam.bai : ex1.fa.bwt f1.fq f2.fq
${bwa} mem ex1.fa f1.fq f2.fq |\
${sam.dir}/samtools view -Sbu - |\
${sam.dir}/samtools sort - test && \
${sam.dir}/samtools index test.bam
ex1.fa.bwt: ex1.fa
${bwa} index $<
f1.fq f2.fq: reads
reads: ex1.fa
${sam.dir}/misc/wgsim -N 2000 $< f1.fq f2.fq > wgsim.vcf
$(foreach F,1 2, cat f${F}.fq | awk '(NR%4==2) {printf("ATGCATGC%sATGCATGC\n",$$0);next;} (NR%4==0) {printf("22222222%s22222222\n",$$0);next;} {print;}' > a ; mv a f${F}.fq ; )
ex1.fa:
curl -o $@ "https://raw.github.com/samtools/samtools/develop/examples/$@"
view raw Makefile hosted with ❤ by GitHub
##fileformat=VCFv4.1
##samtoolsVersion=0.1.19-44428cd
##reference=file://ex1.fa
##contig=<ID=seq1,length=1575>
##contig=<ID=seq2,length=1584>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=IS,Number=2,Type=Float,Description="Maximum number of reads supporting an indel and fraction of indel reads">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=QBD,Number=1,Type=Float,Description="Quality by Depth: QUAL/#reads">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Read Position Bias">
##INFO=<ID=MDV,Number=1,Type=Integer,Description="Maximum number of high-quality nonRef reads in samples">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. Note: this version may be broken
.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality non-reference bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test.bam
seq1 993 . T A 61 . DP=136;VDB=5.035851e-03;RPB=2.097285e-01;AF1=0.5;AC1=1;DP4=44,29,35,27;MQ=60;FQ=63.5;PV4=0.73,1,1,1
GT:PL:GQ 0/1:91,0,100:93
seq2 303 . TC TCCC 103 . INDEL;IS=3,0.044118;DP=68;VDB=9.692566e-03;AF1=1;AC1=2;DP4=0,0,30,0;MQ=60;FQ=-125 GT:PL:GQ
1/1:144,90,0:99
seq2 304 . C CCA 214 . INDEL;IS=52,0.800000;DP=65;VDB=4.882629e-04;AF1=1;AC1=2;DP4=0,0,46,0;MQ=60;FQ=-173 GT:PL:GQ
1/1:255,138,0:99
seq2 837 . A C 68 . DP=128;VDB=2.587057e-02;RPB=3.599771e-01;AF1=0.5;AC1=1;DP4=41,25,30,32;MQ=60;FQ=67.5;PV4=0.15,1,1,1
GT:PL:GQ 0/1:98,0,97:97
seq2 893 . G C,A 143 . DP=110;VDB=4.352574e-03;RPB=-2.270745e+00;AF1=1;AC1=2;DP4=1,1,47,61;MQ=60;FQ=-282;PV4=1,1,1,1 GT:PL:GQ
1/1:176,255,0,178,255,175:99
view raw test.vcf hosted with ❤ by GitHub

ADD COMMENT
0
Entering edit mode

That's really great, will save me a lot of disk space not to prepare yet another set of bam files.

EDIT . OK, I am scratching my head right now... so you only got only 5 additional SNPs having added an identical string to all the reads? But it seems to me you assigned low Phred quality values (2) to the added nucleotides, so SNP caller might be biased against them for that reason?

ADD REPLY
0
Entering edit mode
11.4 years ago
Mick ▴ 30

Forgive me if I have this wrong, but....

What happens in bwa mem is that often one gets the primary alignment, and lots of much shorter secondary alignments, many of which are hard/soft clipped. These secondary alignments are essentially local/split alignments wrt to the read, and are often not what SNP discovery projects are interested in.

GATK ignores everything except the primary alignment.

However, where reads do not have a good, long primary alignment, I am guessing it's possible that the short/clipped/local alignment is marked as the primary alignment and is taken into consideration by GATK when SNP calling.

For example, if one has PhiX in the reads, but not in the reference database, then one can get short/clipped/local alignments of the PhiX reads against the reference genome. These are clearly false alignments. However, they may be marked as primary alignments, and GATK may use them to call SNPs.

I think that was the focus of the question....

ADD COMMENT
0
Entering edit mode

Yes, that's what I meant. I did not use the option to output alternative alignments in BWA mem.

ADD REPLY
0
Entering edit mode

I'd go back and re-do alignments as per http://bio-bwa.sourceforge.net/

With BWA-MEM/BWA-SW, my tools are complaining about multiple primary alignments. Is it a bug? It is not. Multi-part alignments are possible in the presence of structural variations, gene fusion or reference misassembly. However, representing multi-part alignments in SAM has not been finalized. To make BWA work with your tools, please use option -M to flag extra hits as secondary.

ADD REPLY

Login before adding your answer.

Traffic: 1871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6