From BED-VAF file to extract common paired sample's SNP site info, prepare for sciclone:

Question

How to convert bam-readcount output to BED-VAF format

2

Entering edit mode

8.0 years ago

Haitao ▴ 30

Dear All,

When I use bam-readcount to count each SNP position's depth for making %VAF, Especially for those pair samples, only happened mutation in one sample but not in other paired sample, my final goal is for calculate BAF, sciclone input file or %VAF.

Here is the code:

#Step one:
#Extract SNP site info
cat $1.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2}' > $1.bed
cat $2.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2}' > $2.bed

#Step two:
#Example: 
cat $1.bed $2.bed > variants.positions_sorted.bed
sort -k1,1 -k2,2n -k3,3n -u variants.positions_sorted.bed > variants.positions_sorted.bed 

#Step three:
#Example: 
bam-readcount -b 15 -q 15 -w 1 \
-f reference/Mus_musculus.fa \
-l variants.positions_sorted.bed \
/bam_files/$1.bam \
> $1_variants.readcounts

I'm following what Chris mentioned in this post enter link description here, I got bam-readcount output file now like this:

GL456392.1      772     C       123     =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  C:103:30.22:36.83:1.46:46:57:0.57:0.06:145.63:46:0.46:64.46:0.33        G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  T:20:31.10:37.00:0.00:11:9:0.78:0.09:182.55:11:0.59:55.90:0.34  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00
GL456378.1      780     T       2       =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:2:33.00:37.00:0.00:1:1:0.32:0.04:129.50:1:0.71:100.00:0.38    C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00

However, I search from google and here, I didn't find any comment for this purpose. I don't know which tool can help me to convert bam-readcount output to VAF tab-delimited like this:

chrom   POS REF ALT REF_DP  ALT_DP  TOTAL_DP
GL456378.1  772 C   A   103 20  123
GL456378.1  780 T   A   0   2   2
1   3000004 T   A,G 15  14,2    31

Thanks in advance, Haitao

SNP sciclone LOH sequencing bam-readcount • 5.7k views

ADD COMMENT • link updated 7.1 years ago by Ming Tommy Tang ★ 4.5k • written 8.0 years ago by Haitao ▴ 30

score 1 · Answer 1 · 2016-11-15

1

Entering edit mode

8.0 years ago

Chris Miller 22k

You'll need to write a small script to parse the output, match it up with the alleles and output. This script does something similar, and though it has dependencies on the rest of the GMS system, you can extract the core logic easily enough. https://github.com/genome/genome/blob/master/lib/perl/Genome/Model/Tools/Analysis/Coverage/BamReadcount.pm

ADD COMMENT • link 8.0 years ago by Chris Miller 22k

0

Entering edit mode

Wow, you need to sleep now : ), thanks for your answer so fast! Haitao

ADD REPLY • link 8.0 years ago by Haitao ▴ 30

0

Entering edit mode

Hi Chris,

Thanks for your advice, however, I'm not very familiar with perl, I can understand the script but very hard to write script by myself until now, I try to write script by myself but need some time, my level is application level at this moment. I'm work on bench before(~10 years), recent one year I have shifted a a little bit from bench to analysis, now is 50 / 50.

Can I use GMS to figure out my issue? I installed genome music and genome music2 before.

Thanks, Haitao

ADD REPLY • link 8.0 years ago by Haitao ▴ 30

1

Entering edit mode

Try this simpler code that I also had laying around: https://github.com/chrisamiller/bioinfoSnippets/blob/master/add-readcounts.pl

ADD REPLY • link 8.0 years ago by Chris Miller 22k

0

Entering edit mode

Thanks I will try this!

ADD REPLY • link 8.0 years ago by Haitao ▴ 30

score 1 · Answer 2 · 2016-11-17

I have just reformatted the bam-read-count output file in the format you need with some simple combination of sed and awk commands. I do wish to simplify it more but for now we can go ahead with this :-)

awk -F'[:\t]' '{if ($4 != 0) print $1,$2,$4,$3,$19,$20,$33,$34,$47,$48,$61,$62}' |\ #this will take desired cols

awk '{ if($4 == "A") print $1"\t"$2"\t"$3"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12; # match ref

else if($4 == "C") print $1"\t"$2"\t"$3"\t"$7"\t"$8"\t"$5"\t"$6"\t"$9"\t"$10"\t"$11"\t"$12;

else if($4 == "G") print $1"\t"$2"\t"$3"\t"$9"\t"$10"\t"$5"\t"$6"\t"$7"\t"$8"\t"$11"\t"$12;

else if($4 == "T") print $1"\t"$2"\t"$3"\t"$11"\t"$12"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10;}' |\

awk '{print $1"$"$2"$"$3"$"$4"$"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11}' |\ #(just here I lock few cols which are ready)

sed $'s/.\t0//g' | sed $'s/\t\t\t/\t/g' | sed $'s/\t\t/\t/g' |\ #replace tabs and remove zeros

awk '{print $1"\t"$2","$4","$6"\t"$3","$5","$7;}' | sed 's/\,\,//g' | sed 's/\,$//' | sed $'s/,\t/\t/' | sed $'s/\$/\t/g' |\ #print and replace

perl -ane 'if(!$F[5]){print $F[0],"\t",$F[1],"\t",$F[2],"\t",$F[3],"\t",$F[4],"\t",".","\t","0\n"}else{print $_;}' # final the way u want the file.

I feel there is a lot of redundancy and this can be simplified further..but I am still not very through with programming so this shell commands should do the trick!

cheers.. enjoy!

score 1 · Answer 3 · 2016-11-21

Dear All,

Based on Chris and Monica's advice, Here I have some update for deal with bam-readcount output file.

My Colleague and friend help me wrote perl script to convert bam-readcount file to Bed format VAF file.

bam-readcount2VAF.pl

use strict;
use warnings;
use LWP::Simple;
use Getopt::Long;
use Data::Dumper;
use FindBin qw($Bin $Script);
use File::Basename qw(basename dirname);
use URI::Escape;
use Pod::Usage;
## Version: bam-readcount convert to VAF format version 1
## Author：zhengqiang miao, email: zqmiao@umac.mo

###############  Start Time  ########################################################
my $Time_Start="";
my $TS=time();
$Time_Start = sub_format_datetime(localtime(time()));
print "\nStart Time :[$Time_Start]\n\n";
##################  Main     ######################################################
our $g_opts;
sub parse_opts{
   my $result = GetOptions(
                   "readcount_file=s" => \$g_opts->{'readcount_file'},#string
                   #"group_list=s" => \$g_opts->{'group_list'},#string
                   "DP_filter=i" => \$g_opts->{'DP_filter'},#string
                   "quiet"     => sub { $g_opts->{'verbose'} = 0 },
                   "help|?"    => \$g_opts->{'help'}
                 );
   if(!($g_opts->{'readcount_file'})){
       &pod2usage( -verbose => 1);#exit status will be 1
   }
   if($g_opts->{'help'}){
       &pod2usage( -verbose => 1);#exit status will be 1
   }
}

&parse_opts();

my $readcount_file=$g_opts->{'readcount_file'};
my $DP_filter=1;
$DP_filter=$g_opts->{'DP_filter'} if(($g_opts->{'DP_filter'}));
my @file=glob($readcount_file);

foreach my $in (@file){
    print "$in";
    open(IN,$in)||die;
    open(OUT,">$in.VAF")||die;
    print OUT "chrom\tPOS\tTOTAL_DP\tREF\tALT\tREF_DP\tALT_DP\n";
    while(my $t=<IN>){
        chomp $t;
        my @a=split(/\t/,$t);
        next if($a[3]<$DP_filter);
        print OUT "$a[0]\t$a[1]\t$a[3]\t$a[2]";
        my %h=();
        for(my $i=5;$i<@a;$i++){
            my @aa=split(/:/,$a[$i]);
            next if($aa[0] eq "N");
            next if($aa[1]==0);
            $h{$aa[0]}=$aa[1];
        }

        my $k=keys %h;
        if($k==0){
            print "ERROR:Code1\n";
        }
        if($k==1){
            my ($c)=keys %h;
            if($c eq $a[2]){
                print OUT "\t.\t$h{$c}\t0\n";
            }
            else{
                print OUT "\t$c\t0\t$h{$c}\n";
            }
        }
        if($k>1){
            my $REF_DP=0;
            my $ALT_DP="";
            my $ALT="";
            my $kk=0;
            foreach my $c (keys %h){
                if($c eq $a[2]){
                    $REF_DP=$h{$c};
                    $kk=1;
                }
                else{
                    $ALT.="$c,";
                    $ALT_DP.="$h{$c},";
                }
            }
            $ALT=~s/,$//;
            $ALT_DP=~s/,$//;

            print OUT "\t$ALT\t$REF_DP\t$ALT_DP\n";


        }

    }

}

sub round{
    my ($s,$c)=@_;#$c, how many decimals want to keep
    my $n1=10**$c;
    #print "$c\t$n1\n";
    return $s=int(($s*$n1)+0.5)/$n1;

}

###############################################################################
__END__
=head1 NAME
script - Using Getopt::Long and Pod::Usage
=head1 SYNOPSIS
script [options] [args ...]
Options: 

   -readcount_file           [file_pattern] readcounts file pattern like "/home/\*.readcounts" 
   -DP_filter                [Int] filter by depth, default is 1
   -help            brief help message

=head1 OPTIONS
=over 8
=item B<-help>
Print a brief help message and exits.
=back

=head1 DESCRIPTION
B<This program> will read the given input file(s) and do something
useful with the contents thereof.
=cut

Run perl script:

perl bam-readcount2VAF.pl  -readcount_file <readcount.file> -DP 5 (filter depths) > output.BED-VAF (bed format)

Desired output:

chrom   POS TOTAL_DP    REF ALT REF_DP  ALT_DP
2   90433340    32  A   C   28  4
2   90434511    46  A   +G  46  1
2   90450528    6   C   .   6   0
2   90450728    139 C   G,-C,+AA,-CA,+AAA,+A,A  134 1,1,4,1,1,11,2
2   91268528    5   T   G   2   3
2   91268534    5   T   G   2   3
2   91513592    10  A   G   9   1
2   91693774    16  A   .   16  0
2   91950203    88  A   -A,C,+C 86  1,1,3
2   91982340    8   C   .   8   0
2   92160021    51  C   G   3   48
2   92160047    56  G   A   3   53
2   92160058    57  G   A   4   53

Hope this can help you when you have same issue.

There are some other way to get desired VAF via samtools mpileup Question: Calculate The Frequency Of Nucleotides At Each Position In An Mpileup File Only thing is it will take longer time to get desired results.

Thanks, Haitao

score 0 · Answer 4 · 2017-10-11

0

Entering edit mode

7.1 years ago

Ming Tommy Tang ★ 4.5k

Just for those who has the same problem, this may help https://github.com/sahilseth/bamreadcountr

ADD COMMENT • link 7.1 years ago by Ming Tommy Tang ★ 4.5k