My initial thought is to use bedtools bamtobed
to make a BED file from your BAMs. Then, it's easy to use coding strand data and the start/end positions to get what you want.
Here's my reasoning:
Let's say you have the following segment of a BED file:
chr7 127471196 127472363 Pos1 0 +
chr7 127472363 127473530 Pos2 0 +
chr7 127473530 127474697 Pos3 0 +
chr7 127474697 127475864 Pos4 0 +
chr7 127475864 127477031 Neg1 0 -
chr7 127477031 127478198 Neg2 0 -
chr7 127478198 127479365 Neg3 0 -
chr7 127479365 127480532 Pos5 0 +
chr7 127480532 127481699 Neg4 0 -
Pos1 has its 5' end at the number in column 2: 127471196
Neg1 has its 5' end at the number in column 3: 127477031
In that way it's pretty easy to create a table of 5' starting points based on the BED info. Let me know if you need coding help.
EDIT: Here's a script I busted out which will give you what you need. It takes a BED file through STDIN and spits a two-column tab-delimited table to STDOUT. Call it like this: perl scriptName.pl < [yourBed] > [yourTableFile]
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my $lineCounter = 1;
my %startPoints;
while(<>){
my @columns = split /\t/, $_;
my $startPoint;
if($columns[5] eq '+'){
$startPoint = $columns[1];
++$startPoints{$startPoint};
} elsif ($columns[5] eq '-') {
$startPoint = $columns[2];
++$startPoints{$startPoint};
} else {
die "Goofy BED Data on line $lineCounter:\n$_";
}
++$lineCounter;
}
for my $key (sort {$a <=> $b} keys %startPoints) {
print "$key\t$startPoints{$key}\n";
}
<3
Hey Dario, I was working on a similar idea and I could not work out your way. I finally managed to make it work by changing the way you read your .bam file because resize() can only be applied to "GRanges" object:
allReads <- readGAlignments("your.bam")
allReads <- as(allReads, "GRanges")
firstBases <- resize(allReads, 1)
startCounts <- coverage(firstBases)
What I really want to find out however is a way to include another column that contains the nucleotide in each position, so we can have the count and also the reference nucleotide of the specific count? Is there any simple way to attach another column (ie. the reference in .fasta), or maintain the info while scanning the .bam file?
Yes, functions sometimes become defunct and other functions provide the same functionality. To get the TSS nucleotide, you can use getSeq from BSgenome with a relevant BSgenome object (eg. Hsapiens in BSgenome.Hsapiens.UCSC.hg19).