R is good for this sort of thing. Very easy to do and examine many genomic manipulations.
library(GenomicRanges)
library(rtracklayer)
# read in BAM file
foo <- import("foo.bam")
# convert from GenomicAlignments to GRanges
foo <- GRanges(foo)
# look at first 3 elements
foo[1:3]
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 28815-28854 -
[2] chr6 172095647-172095686 +
[3] chr7 101791-101830 -
-------
seqinfo: 24 sequences from an unspecified genome
# get just the first base
foo <- resize(foo, width=1, fix='start')
# examine after resizing to confirm strand
foo[1:3]
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 28854 -
[2] chr6 172095647 +
[3] chr7 101830 -
-------
seqinfo: 24 sequences from an unspecified genome
# export as bed
export(foo, "foo.bed")
# get coverage
foo_cov <- coverage(foo)
# export as bigwig
export(foo_cov, "foo_cov.bw")
If, for instance, you want to count how many reads overlap all your proteins of interest, you could import a bed file or GFF file describing the genomic locations of all your proteins and get counts as follows:
# count the number of overlapping reads for each protein
proteinCounts <- countOverlaps(allmyproteins, foo)
The result would be a vector with the same number of elements as "allmyproteins" but each position would hold how many reads overlapped with the protein coordinates.
If you want a normalized track, also fairly simple:
# get total reads
nreads <- length(foo)
# rpm coverage, limit to 3 signficant digits
foo_rpm <- lapply(foo_cov, function(x) signif(10^6 * x/nreads,3))
# recast as SimpleRleList
foo_rpm <- as(foo_rpm,"SimpleRleList")
export.bw(foo_rpm, "foo_rpm.bw")
Can you further clarify or illustrate?
Essentially I want to use the full-length of the read to accurately map the read to the genome, but only want the first base to contribute to coverage counts. So after mapping reads to the genome, I want to trim each read so only its first base remains.
For instance in the attached picture I've highlight two reads. One maps to the Forward strand and the other maps to the Reverse strand.
The Forward read would be trimmed so that only the 5' T is retained. The Reverse read would be trimmed so that only the 3' A is retained.
If these were the only two reads present, the coverage plot would have values of 1 at base # 590 and 642 and coverage values of 0 everywhere else.
If you're wondering why I want to do this, the technique I use ligates an adapter to the piece of DNA that is closest to a protein of interest. Libraries are sequenced from this adapter, so the first base represents the closest location to the protein of interest. The other bases increase background signal.
Thanks for any help.
Since you are interested in only position 1 where a mapping starts you could simply take that position from your BAM file (column 4 in SAM format) and add those up.
Is this something that could be done with the
samtools
package, or would I need to use text processing such asawk
? Does column 4 take strandedness into account? The description of column 4 says "leftmost mapping position" - does this mean the 5' base of the read?Since the reads are always in 5' --> 3' direction it will always be the 5' end. In case you need to take strand information into account then you can process the
bitwise flags
in column 2 (16 for reverse strand).