Hi - I am obtaining from a RNA-seq bam file the BCF output from samtools mpileup in order to corroborate variants seen in the exome.
Thus, for corroborating one single SNP, my command looks like this:
samtools mpileup -ug -t DP -t DV -t DP4 --min-MQ 40 --min-BQ 35 -f GRCh37.fa --region "22:29907105-29907105" RNAsample.bam | bcftools view --min-alleles 3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RNAsample
22 29907105 . G A,<X> 0 . DP=47;I16=14,12,6,7,974,36674,484,18098,520,10400,260,5200,488,11274,261,5927;QS=0.666667,0.333333,0;VDB=0.280567;SGB=-0.683931;RPB=0.814533;MQB=1;MQSB=1;BQB=0.956592;MQ0F=0 PL:DP:DV:DP4 82,0,143,160,182,243:39:13:14,12,6,7
Because, samtools is always outputting "an non-ref base 'X' represents an base has not been seen from the alignment data" I need to filter for minimum 3 alleles (one ref, one alt and the X). I find this all counter-intuitive and would like to omit the non-ref base 'X' in the output from samtools mpileup as in this case it will not be needed for anything.
Does anybody know how to do this properly (excluding sed, awk and the like ;))
EDIT:
The pileup itself for this position looks as follows:
$ samtools mpileup -t DP -t DV -t DP4 --min-MQ 40 --min-BQ 35 -f /home/mpschr/bin/bcbionextgen/data/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --region "22:29907105-29907105" RNAsample.bam
22 29907105 G 39 .,...AAA..A,,a,a..a,a,,A..aA,,a,.,..a,. DDDDDDDDDDDDJFJJEGHIIJJIJJEJDEDDJDJHDDF
where does this 'X' come from ? can you show us a few reads overlaping that position "22:29907105" ?
and what is the version of samtools/bcftools please
HI, I put the pileup in the original questions. The versions I am using are the following:
Using htslib 1.2.1
Using htslib 1.2.1
I want to see some reads (samtools view RNAsample.bam 22:29907105-29907105 ) not the ouput of mpileup
aha ok, this is quite a lot in order to paste it as is, I think - are we I looking for something specific? In any case, a working solution has been proposed. Thanks for your time!