Question

Allele Depth (~ Dp4) Values For All (Including Multi-Allelic Alt) Possible Alleles Reported?

0

Entering edit mode

11.2 years ago

Rm 8.3k

Is there a quick way to get allele read depth for all possible alleles say from bam or mpileup?

10      101023  .       A       G,T     220     .       DP=35;VDB=0.0407;AF1=1;AC1=2;DP4=0,0,14,11;MQ=43;FQ=-99 GT:PL:GQ        1/1:253,72,0,247,52,244:99
2       89159986        .       T       A,C     193     .       DP=52;VDB=6.418245e-02;AF1=1;AC1=2;DP4=0,0,29,23;MQ=23;FQ=-181  GT:PL:DP:SP:GQ  1/1:226,154,0,211,122,208:52:0:99

one possiblity is to parse the mpileup output?.

allele counts • 6.2k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 11.2 years ago by Rm 8.3k

Ram · Answer 1 · 2013-09-26

Here is the perl script i had to write to get the all possible strand specific allele counts (DP4) from the mpileup output:

Note: It skips the indels

#!/usr/bin/perl 
use strict;
#use warnings;
&usage if (@ARGV < 1);

my $command = shift(@ARGV);
my %func = (mpileup2allele=>\&mpileup2allele);

die("Unknown command \"$command\".\n") if (!defined($func{$command}));
&{$func{$command}};
exit(0);

# mpileup2allele
sub mpileup2allele {
  die(qq/
Usage: extract.allele.counts4m.mpileup.pl mpileup2allele <input.chr.raw-mpileup> 
/) if (@ARGV == 0 && -t STDIN);

my @qual;
my @line;
my $phred=10; # Phred quality threshold filter
my $sanger=33; # Quality Encoding: Sanger/Illumina 1.8+ =33 # Solexa/Illumina (1.3+) =64 

print "#CHR\tPOS\tREF\tA\ta\tC\tc\tG\tg\tT\tt\n" ;

# Reading input mpileup 
  while (<>) {
    @line = split;
      if ($line[2] eq "*"){next;} # skipping indel row #
      $line[4] =~ s/[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVNF]+//g; # Removes indels notation from read base lines
      $line[4] =~ s/\^.//g; # Removes unwanted symbols
      $line[4] =~ s/[0-9\@\'\&\!\?\_\$\*\:\;\F\"\+\#\-\=\%\)\(\/\~\}\[\>\<]+//g; # Removes additional symbols  
      $line[4] =~ s/\./$line[2]/g; # Substituting . with reference base 
      $line[4] =~ s/\,/lc($line[2])/ge; # Substituting, with lower case reference base 

      # Store quality scores as an array
      @qual = split(//, $line[5]);
      # Transform quality values from ASCII into Sanger format
      for( my $i = 0; $i < scalar @qual; $i++ ){
       $qual[$i] = ord($qual[$i]) - $sanger ;
           }

    my $char;
    my %chars=();
    my $chars;
    my @letters = split(//, $line[4]);
    my @case = qw/A a C c G g T t/;
    my $top=0;
    print "$line[0]\t$line[1]\t$line[2]\t"; # print nucleotide position

    # Quality comparison 
    foreach $char(@letters) { if ($qual[$top] >= $phred){ $chars{ $char }++; } $top++; }
    my @keys = sort { uc $a cmp uc $b } keys %chars; # case insensitive sort

    #printing output in the required order: A a C c G g T t #
        foreach my $case(@case){
         if ( ! exists( $chars{$case})) {
                print 0, "\t" ;
              }
         else {
                print $chars{$case},"\t";
                #print $case, "\t", $chars{$case},"\t"; 
                }
         }

    print "\n";

  } # End of while

 } # End of mpileup2allele 

##################

sub usage {
my $str ='Sample Input Pileup Data
10      56256   A      16      <>>.,+4gcgc,,,,,,..,...,,  1313?5??>5<9=71570';
  die(qq/
Usage: extract.allele.counts4m.mpileup.pl mpileup2allele <input.chr.raw-mpileup>  \n
$str \n/);
}

Output:

#CHR    POS    REF    A    a    C    c    G    g    T    t
1    970567    C    0    0    0    0    0    0    0    0    
1    1581958    T    0    0    0    2    0    0    5    2    
1    1644908    C    0    0    4    4    0    0    5    4    
1    1646113    G    0    0    1    0    2    5    0    0    
1    1654067    C    0    0    11    12    0    0    0    0    
1    2241376    C    0    0    3    1    0    0    0    0

score 0 · Answer 2 · 2013-09-25

0

Entering edit mode

11.2 years ago

Ashutosh Pandey 12k

You can use the older pileup function and get some output shown below:

seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&

seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+

seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6

seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<

Now you can use fifth column to get the allele depth of all the alleles. I am sure you must be aware of this.

ADD COMMENT • link 11.2 years ago by Ashutosh Pandey 12k

0

Entering edit mode

@ashutoshmits thanks yes iam

ADD REPLY • link 11.2 years ago by Rm 8.3k

0

Entering edit mode

I do have a Python code similar to what you wrote in Perl. But it was kind of integrated between other codes. So i couldn't share it. But your Perl script is much advanced than mine. Just to let you know that ">" or "<" characters in 5th column (not shown in the example above) represent read that span between two exons which is typical for RNA-seq. It shouldn't be counted and ignored and you have already taken care of it in your code.

ADD REPLY • link 11.2 years ago by Ashutosh Pandey 12k

0

Entering edit mode

yes I have ignored ">" or "<"

ADD REPLY • link 11.2 years ago by Rm 8.3k

score 0 · Answer 3 · 2013-09-25

Just parse it. In whatever language you're using - Python, Bash, R - simply split the row (as in, python split()) by 'DP4=' to get it as an array. Split the 2nd element (as in, array[1]) by semicolon. Your first element in your final array is your a,b,c,d array depth.

You could do it the long way if for some reason you expect the string 'DP4=' to be present in the file somewhere else apart from the info field, for for some crazy reason (I can't imagine how though):

For each row: split row by tab (as in, python split() for example) and get it as an array.
Grab 8th element in that array, split it by semicolon
I recall (although I might be wrong) that the DP4 column might not necessarily always be in the same position for mpileup output, so you might have to iterate over the elements of the array from step 2, checking for the array element that begins with DP4/build a hash table/whatever. Or, if I'm mistaken and it is indeed in the same place always, just grab the 5th element.
Split the DP4 element by '=' sign, grab your depth.

And - put in all in a function or a shell script so you can call it once and be done with it, making it 'quick'. I have a feeling this is a short and beautiful awk script waiting to be written.