How to convert bam-readcount output to BED-VAF format
4
2
Entering edit mode
8.2 years ago
Haitao ▴ 30

Dear All,

When I use bam-readcount to count each SNP position's depth for making %VAF, Especially for those pair samples, only happened mutation in one sample but not in other paired sample, my final goal is for calculate BAF, sciclone input file or %VAF.

Here is the code:

#Step one:
#Extract SNP site info
cat $1.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2}' > $1.bed
cat $2.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2}' > $2.bed

#Step two:
#Example: 
cat $1.bed $2.bed > variants.positions_sorted.bed
sort -k1,1 -k2,2n -k3,3n -u variants.positions_sorted.bed > variants.positions_sorted.bed 

#Step three:
#Example: 
bam-readcount -b 15 -q 15 -w 1 \
-f reference/Mus_musculus.fa \
-l variants.positions_sorted.bed \
/bam_files/$1.bam \
> $1_variants.readcounts

I'm following what Chris mentioned in this post enter link description here, I got bam-readcount output file now like this:

GL456392.1      772     C       123     =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  C:103:30.22:36.83:1.46:46:57:0.57:0.06:145.63:46:0.46:64.46:0.33        G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  T:20:31.10:37.00:0.00:11:9:0.78:0.09:182.55:11:0.59:55.90:0.34  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00
GL456378.1      780     T       2       =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:2:33.00:37.00:0.00:1:1:0.32:0.04:129.50:1:0.71:100.00:0.38    C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00

However, I search from google and here, I didn't find any comment for this purpose. I don't know which tool can help me to convert bam-readcount output to VAF tab-delimited like this:

chrom   POS REF ALT REF_DP  ALT_DP  TOTAL_DP
GL456378.1  772 C   A   103 20  123
GL456378.1  780 T   A   0   2   2
1   3000004 T   A,G 15  14,2    31

Thanks in advance, Haitao

SNP sciclone LOH sequencing bam-readcount • 5.7k views
ADD COMMENT
1
Entering edit mode
8.2 years ago

You'll need to write a small script to parse the output, match it up with the alleles and output. This script does something similar, and though it has dependencies on the rest of the GMS system, you can extract the core logic easily enough. https://github.com/genome/genome/blob/master/lib/perl/Genome/Model/Tools/Analysis/Coverage/BamReadcount.pm

ADD COMMENT
0
Entering edit mode

Wow, you need to sleep now : ), thanks for your answer so fast! Haitao

ADD REPLY
0
Entering edit mode

Hi Chris,

Thanks for your advice, however, I'm not very familiar with perl, I can understand the script but very hard to write script by myself until now, I try to write script by myself but need some time, my level is application level at this moment. I'm work on bench before(~10 years), recent one year I have shifted a a little bit from bench to analysis, now is 50 / 50.

Can I use GMS to figure out my issue? I installed genome music and genome music2 before.

Thanks, Haitao

ADD REPLY
1
Entering edit mode

Try this simpler code that I also had laying around: https://github.com/chrisamiller/bioinfoSnippets/blob/master/add-readcounts.pl

ADD REPLY
0
Entering edit mode

Thanks I will try this!

ADD REPLY
1
Entering edit mode
8.1 years ago
linkmonica ▴ 10

I have just reformatted the bam-read-count output file in the format you need with some simple combination of sed and awk commands. I do wish to simplify it more but for now we can go ahead with this :-)

awk -F'[:\t]' '{if ($4 != 0) print $1,$2,$4,$3,$19,$20,$33,$34,$47,$48,$61,$62}' |\ #this will take desired cols

awk '{ if($4 == "A") print $1"\t"$2"\t"$3"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12; # match ref

else if($4 == "C") print $1"\t"$2"\t"$3"\t"$7"\t"$8"\t"$5"\t"$6"\t"$9"\t"$10"\t"$11"\t"$12;

else if($4 == "G") print $1"\t"$2"\t"$3"\t"$9"\t"$10"\t"$5"\t"$6"\t"$7"\t"$8"\t"$11"\t"$12;

else if($4 == "T") print $1"\t"$2"\t"$3"\t"$11"\t"$12"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10;}' |\

awk '{print $1"$"$2"$"$3"$"$4"$"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11}' |\ #(just here I lock few cols which are ready)

sed $'s/.\t0//g' | sed $'s/\t\t\t/\t/g' | sed $'s/\t\t/\t/g' |\ #replace tabs and remove zeros

awk '{print $1"\t"$2","$4","$6"\t"$3","$5","$7;}' | sed 's/\,\,//g' | sed 's/\,$//' | sed $'s/,\t/\t/' | sed $'s/\$/\t/g' |\ #print and replace

perl -ane 'if(!$F[5]){print $F[0],"\t",$F[1],"\t",$F[2],"\t",$F[3],"\t",$F[4],"\t",".","\t","0\n"}else{print $_;}' # final the way u want the file.

I feel there is a lot of redundancy and this can be simplified further..but I am still not very through with programming so this shell commands should do the trick!

cheers.. enjoy!

ADD COMMENT
0
Entering edit mode

Hi Monica,

Thanks for sharing this shell command. I test it, It's work well, this is only for SNP not include insert or deletion site. thanks again : )

Haitao

ADD REPLY
1
Entering edit mode
8.1 years ago
Haitao ▴ 30

Dear All,

Based on Chris and Monica's advice, Here I have some update for deal with bam-readcount output file.

My Colleague and friend help me wrote perl script to convert bam-readcount file to Bed format VAF file.

bam-readcount2VAF.pl

use strict;
use warnings;
use LWP::Simple;
use Getopt::Long;
use Data::Dumper;
use FindBin qw($Bin $Script);
use File::Basename qw(basename dirname);
use URI::Escape;
use Pod::Usage;
## Version: bam-readcount convert to VAF format version 1
## Author:zhengqiang miao, email: zqmiao@umac.mo

###############  Start Time  ########################################################
my $Time_Start="";
my $TS=time();
$Time_Start = sub_format_datetime(localtime(time()));
print "\nStart Time :[$Time_Start]\n\n";
##################  Main     ######################################################
our $g_opts;
sub parse_opts{
   my $result = GetOptions(
                   "readcount_file=s" => \$g_opts->{'readcount_file'},#string
                   #"group_list=s" => \$g_opts->{'group_list'},#string
                   "DP_filter=i" => \$g_opts->{'DP_filter'},#string
                   "quiet"     => sub { $g_opts->{'verbose'} = 0 },
                   "help|?"    => \$g_opts->{'help'}
                 );
   if(!($g_opts->{'readcount_file'})){
       &pod2usage( -verbose => 1);#exit status will be 1
   }
   if($g_opts->{'help'}){
       &pod2usage( -verbose => 1);#exit status will be 1
   }
}

&parse_opts();

my $readcount_file=$g_opts->{'readcount_file'};
my $DP_filter=1;
$DP_filter=$g_opts->{'DP_filter'} if(($g_opts->{'DP_filter'}));
my @file=glob($readcount_file);

foreach my $in (@file){
    print "$in";
    open(IN,$in)||die;
    open(OUT,">$in.VAF")||die;
    print OUT "chrom\tPOS\tTOTAL_DP\tREF\tALT\tREF_DP\tALT_DP\n";
    while(my $t=<IN>){
        chomp $t;
        my @a=split(/\t/,$t);
        next if($a[3]<$DP_filter);
        print OUT "$a[0]\t$a[1]\t$a[3]\t$a[2]";
        my %h=();
        for(my $i=5;$i<@a;$i++){
            my @aa=split(/:/,$a[$i]);
            next if($aa[0] eq "N");
            next if($aa[1]==0);
            $h{$aa[0]}=$aa[1];
        }

        my $k=keys %h;
        if($k==0){
            print "ERROR:Code1\n";
        }
        if($k==1){
            my ($c)=keys %h;
            if($c eq $a[2]){
                print OUT "\t.\t$h{$c}\t0\n";
            }
            else{
                print OUT "\t$c\t0\t$h{$c}\n";
            }
        }
        if($k>1){
            my $REF_DP=0;
            my $ALT_DP="";
            my $ALT="";
            my $kk=0;
            foreach my $c (keys %h){
                if($c eq $a[2]){
                    $REF_DP=$h{$c};
                    $kk=1;
                }
                else{
                    $ALT.="$c,";
                    $ALT_DP.="$h{$c},";
                }
            }
            $ALT=~s/,$//;
            $ALT_DP=~s/,$//;

            print OUT "\t$ALT\t$REF_DP\t$ALT_DP\n";


        }

    }

}

sub round{
    my ($s,$c)=@_;#$c, how many decimals want to keep
    my $n1=10**$c;
    #print "$c\t$n1\n";
    return $s=int(($s*$n1)+0.5)/$n1;

}

###############################################################################
__END__
=head1 NAME
script - Using Getopt::Long and Pod::Usage
=head1 SYNOPSIS
script [options] [args ...]
Options: 

   -readcount_file           [file_pattern] readcounts file pattern like "/home/\*.readcounts" 
   -DP_filter                [Int] filter by depth, default is 1
   -help            brief help message

=head1 OPTIONS
=over 8
=item B<-help>
Print a brief help message and exits.
=back

=head1 DESCRIPTION
B<This program> will read the given input file(s) and do something
useful with the contents thereof.
=cut

Run perl script:

perl bam-readcount2VAF.pl  -readcount_file <readcount.file> -DP 5 (filter depths) > output.BED-VAF (bed format)

Desired output:

chrom   POS TOTAL_DP    REF ALT REF_DP  ALT_DP
2   90433340    32  A   C   28  4
2   90434511    46  A   +G  46  1
2   90450528    6   C   .   6   0
2   90450728    139 C   G,-C,+AA,-CA,+AAA,+A,A  134 1,1,4,1,1,11,2
2   91268528    5   T   G   2   3
2   91268534    5   T   G   2   3
2   91513592    10  A   G   9   1
2   91693774    16  A   .   16  0
2   91950203    88  A   -A,C,+C 86  1,1,3
2   91982340    8   C   .   8   0
2   92160021    51  C   G   3   48
2   92160047    56  G   A   3   53
2   92160058    57  G   A   4   53

Hope this can help you when you have same issue.

There are some other way to get desired VAF via samtools mpileup Question: Calculate The Frequency Of Nucleotides At Each Position In An Mpileup File Only thing is it will take longer time to get desired results.

Thanks, Haitao

ADD COMMENT
0
Entering edit mode

From BED-VAF file to extract common paired sample's SNP site info, prepare for sciclone:

Extract common SNP site info

cat $1.readcounts.bed | tail -n +2 | awk '{print $1"\t"$2"\t"$2}' > $1.bed
cat $2.readcounts.bed | tail -n +2 | awk '{print $1"\t"$2"\t"$2}' > $2.bed
cat $3.readcounts.bed | tail -n +2 | awk '{print $1"\t"$2"\t"$2}' > $3.bed

bedtools intersect -a $1.bed -b  $2.bed > common.bed
bedtools intersect -a $3.bed -b common.bed > common2.bed
sort -k1,1 -k2,2n -k3,3n -u common2.bed > positions_sorted.bed

Based on common bed file: first two column extract VAF info and add header

awk 'FNR==NR {a[$1 FS $2]++; next} a[$1 FS $2]' positions_sorted.bed $1.readcounts.bed > $1.vaf.txt
echo -e 'chr\tposition\tTotalDP\tRef\tRefDP\tAlt\tAltDP' | cat - $1.vaf.txt > temp && mv temp $1.vaf.txt

then to use this output file as sciclone input file or other type of analysis.

Thanks, Haitao

ADD REPLY
0
Entering edit mode

Hi Haitao,

first of all thanks for the script. I would like to know which snvs you take in for sciclone input. In you first post, you mentioned merging the two bam files. which will then contain all variants from both vcfs. Then you get the readcounts. Afterwards you use the bedtools intersect to just take in the common site. In this case all variants will be common because you had a merged file at the beginning. Probably you skip making the merged file in you updated approach?

Best Mo

ADD REPLY
0
Entering edit mode

for 2 91950203 88 A -A,C,+C 86 1,1,3

why ALT_DP + REF_DP != TOTAL_DP?

86 + 1 + 1 + 3 =91 not 88?

Thanks!

ADD REPLY
0
Entering edit mode
7.2 years ago
Ming Tommy Tang ★ 4.5k

Just for those who has the same problem, this may help https://github.com/sahilseth/bamreadcountr

ADD COMMENT

Login before adding your answer.

Traffic: 2838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6