Question

Extracting The Base/Nucleotide Across All Reads At Particular Position/S.

5

Entering edit mode

12.5 years ago

Empyrean ▴ 170

Hello everyone. I will try to explain with an example of what i need.

We have vcf file to get some positions of snps.

#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ

and a bam file with the alignments to reference. the approximate depth is 20x and all are 454 reads. Now from this information, i wanted to extract all the bases at the particular positions.. I need output of something like this.

Read Name   Position1   Position2   Position3   Position4
            14370       17330       1110696     1230237
Read1         G           T             A           T
Read2         A           A             G           .
Read3         A           A             G           .
Read4         G           A             A           T
Read5         A           T             T           .
Read6         A           T             T           .
Read7         G           A             A           T

I would like to build such graph across all reads with specific position. Can this be possible by samtools mpileup/vcf tools or does anyone has any script written to solve such problem. From this information i will be extracting the haplotype information for specific genotypes.

samtools vcftools • 6.0k views

ADD COMMENT • link updated 12.5 years ago by Rm 8.3k • written 12.5 years ago by Empyrean ▴ 170

0

Entering edit mode

I did something similar a long time back. wouldn't the sff tools from 454 have something to help?

ADD REPLY • link 12.5 years ago by Bioinfosm ▴ 620

0

Entering edit mode

I looked in to it but its of not much help !! can we do this with samtools or vcftools?

ADD REPLY • link 12.5 years ago by Empyrean ▴ 170

score 3 · Answer 1 · 2012-05-22

I had an old script extract.allele.4m.strand.pileup.pl) based on samtools.pl and uses samtools "pileup" output: it prints the read sequence at a position and the allele counts too:

##################
#!/usr/bin/perl 
##########################################################################################
# Script to extract alleles from Pileup file.                                       #
# It is Based on samtools.pl:        #
##########################################################################################

use strict;
#use warnings;

&usage if (@ARGV < 1);

my $command = shift(@ARGV);
my %func = (pileup2allele=>\&pileup2allele);

die("Unknown command \"$command\".\n") if (!defined($func{$command}));
&{$func{$command}};
exit(0);

# pileup2allele

sub pileup2allele {
  die(qq/
Usage:   extract.allele.4m.strand.pileup.pl pileup2allele <input.chr.raw-pileup>
/) if (@ARGV == 0 && -t STDIN);

my @qual;
my @line;
my $phred=10; # Phred quality threshold
my $sanger=33; # Sanger quality=33 # solexa/illumina=64 

# Reading input pileup 
  while (<>) {
    @line = split;
      if ($line[2] eq "*"){next;} # skipping indel row #
      $line[8] =~ s/[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVNF]+//g; # Removes unwanted extra-indels notation from read base lines
      $line[8] =~ s/\^.//g; # Removes unwanted symbols
      $line[8] =~ s/[0-9\@\'\&\!\?\_\$\*\:\;\F\"\+\#\-\=\%\)\(\/\~\}\[]+//g; # Removes unwanted symbols 
      $line[8] =~ s/\./$line[2]/g; # Substituting . with reference base 
      $line[8] =~ s/\,/lc($line[2])/ge; # Substituting , with lower case reference base 

#      $line[8] = uc($line[8]); # Converting to UPPER case

      # Store quality scores as an array
      @qual = split(//, $line[9]);
      # Transform quality values from ASCII into Sanger format
      for( my $i = 0; $i < scalar @qual; $i++ ){
       $qual[$i] = ord($qual[$i]) - $sanger ;
           }


    my $char;
    my %chars=();
    my $chars;
    my @letters = split(//, $line[8]);
    my $top=0;
    print "$line[1]\t$line[8]\t"; # print nucleotide position and the Sequence from the READS

    # Quality comparison 
    foreach $char(@letters) { if ($qual[$top] >= $phred){ $chars{ $char }++; } $top++; }
    my @keys = sort { $chars{$b} <=> $chars{$a} } keys %chars; # sorting higher to lower value
    foreach my $key(@keys){ print $key, " ", $chars{$key},"\t";}
    print "\n";

  } # End of while

 } # End of pileup

##################

sub usage {
my $str ='Sample Input Pileup Data
10      56256   A       A       75      0       21      16      .,+4gcgc,,,,,,..,...,,  13?5??>5<9=71570';
  die(qq/
Usage:   extract.allele.4m.strand.pileup.pl pileup2allele <input.chr.raw-pileup>  \n
$str \n/);
}

# ----------------------------------------------------  

#2. If pipe the pileup through samtools output
##samtools pileup -cf reference.fasta  Input.bam | ./extract.allele.4m.strand.pileup.pl pileup2allele >OUTPUT.TXT

# ----------------------------------------------------  

## End of the Script ######

Sample Out put:

 Position Sequence_on_the_read(s)                                                 Allele_counts
    3308    TTttTctTTttTtTTtTTTtTtTTTtttTtTtTtttcctTttctttcctcTTttttTTttttTTtTt     t 33    T 26    c 7
    3309    cccccccccccc    c 12
    3311    CccCcCCgGCgcCCcCCCccccCCCCcGCcCCGGcCCgccccCgcCCCCcCcGc  C 25    c 20    G 5     g 4
    3312   CccCaCccccAcCAacCCaCcccACCCCCccCCcccCcCCCcccccCCCccCca  c 25    C 22    a 4     A 2