I called SNPs with samtools/bcftools from RNASeq data as follows, and then try to keep variants showing a strand bias p-value < 0.01 (Fischer's exact test).
#CALLING
samtools mpileup \
--fasta-ref my_ref.fasta \
--min-MQ 1\
--min-BQ 30 \
--uncompressed \
--output-tags AD,INFO/AD,DP,SP \
--skip-indels \
input.bam \
| \
bcftools call \
--consensus-caller \
--variants-only \
--skip-variants indels \
--output-type v \
--output call.vcf
#FILTERING
bcftools filter \
--include INFO/PV4[0] < 0.01 \
--mode x \
--output call_filtered.vcf
- Since some records are missing the INFO/PV4 tag, would it be the same to filter the variants based on the FORMAT/SP value?
According to the vcf header, SP is the "Phred-scaled strand bias P-value", whereas PV4[0] is the "P-values for strand bias".
Would it mean that PV4[0] < 0.01
is equivalent to SP > 20
(Q = -10*log10(0.01) = 20)?
Are PV4[0] and SP both calculated with Fischer's exact test?
- You can also read here that GATK keeps SNPs with
strand bias p-value < 60
for filtering SNP (equivalent top-value > 0.000001
). But they also mention here that > A higher score indicates a higher probability that a particular > decision is correct, while conversely, a lower score indicates a > higher probability that the decision is incorrect.
In this case why they do not keep SNPs > 60 instead of < 60 (in my case above p-val < 0.01 should be equivalent to a strand bias Phred > 20 instead of < 20) ?