Question

Consensus sequence from multiple BAM files on a single position

0

Entering edit mode

5.2 years ago

Federico Giorgi ▴ 740

Hi all,

I have several thousands sorted BAM files and I would like to quickly extract the consensus sequence from all the reads on a specific position. Simply using samtools view provides all the reads (including those jumpoing the position due to splicing). I could not make it work via mpileup. Here is the single file call I have been working on:

samtools view -b file.bam "chr2:5666012" | samtools mpileup -

The ideal output for a single file should be (with G consensus):

....AAAACCTT

However, I get huge outputs as also the spliced reads are employed in the pileup generation.

Any idea is welcome. Thanks! :-)

samtools bam mpileup • 1.1k views

ADD COMMENT • link updated 5.2 years ago by jkbonfield ★ 1.3k • written 5.2 years ago by Federico Giorgi ▴ 740

1

Entering edit mode

Hello,

to get the consensus sequence you first have to call variants (mpileup isn't enough) and then create the the consensus sequence with the help of your vcf file and the reference sequence. See here: Generating consensus sequence from bam file

fin swimmer

ADD REPLY • link 5.2 years ago by finswimmer 16k

score 0 · Answer 1 · 2019-09-18

I have some old perl one-liner kicking around to do this from the samtools mpileup output, but it doesn't look like it'll handle multiple chromosomes. Likely it was a one-off hack for some task. Anyway maybe it can be adapted. Be warned: it's hideous! :-)

 samtools mpileup in.bam | perl -ane 'BEGIN {$l=0} $F[4]=~s/[-+](\d+)((??{".{$1}"}))|\^.|\$//g;$F[4]=~s/\*/N/g;%C=();foreach(split("",uc($F[4]))) {$C{$_}++} @C=sort {$C{$b}<=>$C{$a}} keys %C;print "N" x ($F[1]-$l-1),$C[0];$l=$F[1]'