BAM File: Trim Reads So Only First Base Remains
3
1
Entering edit mode
17 months ago
vanbelj ▴ 40

I mapped single-end 50-bp reads to a genome and now have BAM files. I need to trim the reads within the BAM file so that only the first base remains (the first base called by the sequencer). Ideally it would output another BAM or bigwig file.

Does anyone know a package that can do this?

NGS BAM Trimming • 2.2k views
ADD COMMENT
0
Entering edit mode

I need to trim the reads within the BAM file so that only the first base remains (the first base called by the sequencer)

Can you further clarify or illustrate?

ADD REPLY
0
Entering edit mode

Essentially I want to use the full-length of the read to accurately map the read to the genome, but only want the first base to contribute to coverage counts. So after mapping reads to the genome, I want to trim each read so only its first base remains.

For instance in the attached picture I've highlight two reads. One maps to the Forward strand and the other maps to the Reverse strand.

The Forward read would be trimmed so that only the 5' T is retained. The Reverse read would be trimmed so that only the 3' A is retained.

If these were the only two reads present, the coverage plot would have values of 1 at base # 590 and 642 and coverage values of 0 everywhere else.

If you're wondering why I want to do this, the technique I use ligates an adapter to the piece of DNA that is closest to a protein of interest. Libraries are sequenced from this adapter, so the first base represents the closest location to the protein of interest. The other bases increase background signal.

Thanks for any help.

Example

ADD REPLY
0
Entering edit mode

Since you are interested in only position 1 where a mapping starts you could simply take that position from your BAM file (column 4 in SAM format) and add those up.

ADD REPLY
0
Entering edit mode

Is this something that could be done with the samtools package, or would I need to use text processing such as awk? Does column 4 take strandedness into account? The description of column 4 says "leftmost mapping position" - does this mean the 5' base of the read?

ADD REPLY
0
Entering edit mode

Since the reads are always in 5' --> 3' direction it will always be the 5' end. In case you need to take strand information into account then you can process the bitwise flags in column 2 (16 for reverse strand).

ADD REPLY
2
Entering edit mode
17 months ago
seidel 11k

R is good for this sort of thing. Very easy to do and examine many genomic manipulations.

library(GenomicRanges)
library(rtracklayer)

# read in BAM file
foo <- import("foo.bam")

# convert from GenomicAlignments to GRanges
foo <- GRanges(foo)

# look at first 3 elements
foo[1:3]

GRanges object with 3 ranges and 0 metadata columns:
  seqnames              ranges strand
<Rle>           <IRanges>  <Rle>
  [1]     chr1         28815-28854      -
  [2]     chr6 172095647-172095686      +
  [3]     chr7       101791-101830      -
  -------
  seqinfo: 24 sequences from an unspecified genome

# get just the first base
foo <- resize(foo, width=1, fix='start')

# examine after resizing to confirm strand
foo[1:3]

GRanges object with 3 ranges and 0 metadata columns:
  seqnames    ranges strand
<Rle> <IRanges>  <Rle>
  [1]     chr1     28854      -
  [2]     chr6 172095647      +
  [3]     chr7    101830      -
  -------
  seqinfo: 24 sequences from an unspecified genome

# export as bed
export(foo, "foo.bed")

# get coverage
foo_cov <- coverage(foo)

# export as bigwig
export(foo_cov, "foo_cov.bw")

If, for instance, you want to count how many reads overlap all your proteins of interest, you could import a bed file or GFF file describing the genomic locations of all your proteins and get counts as follows:

# count the number of overlapping reads for each protein    
proteinCounts <- countOverlaps(allmyproteins, foo)

The result would be a vector with the same number of elements as "allmyproteins" but each position would hold how many reads overlapped with the protein coordinates.

If you want a normalized track, also fairly simple:

# get total reads
nreads <- length(foo)

# rpm coverage, limit to 3 signficant digits
foo_rpm <- lapply(foo_cov, function(x) signif(10^6 * x/nreads,3))

# recast as SimpleRleList
foo_rpm <- as(foo_rpm,"SimpleRleList")

export.bw(foo_rpm, "foo_rpm.bw")
ADD COMMENT
0
Entering edit mode

Thanks, this works for the most part. I had to make two changes to bypass errors: 1) add quotes around start in foo <- resize(foo, width=1, fix='start'). 2) use coverage(foo) in place of cov(foo)

Is is possible to normalize coverage values to CPM? I'd like to make a CPM normalized version in addition to the raw counts version.

Thanks!

ADD REPLY
0
Entering edit mode

Oh great! Ack! Sorry about those errors. I must have messed it up while formatting between editors. (thanks for pointing them out. fixed above). I added a bit about making a reads per million normalized coverage track. CPM for your proteins would be easy to calculate as well (if that's what you're trying to do).

ADD REPLY
0
Entering edit mode
17 months ago
Jeremy ▴ 930

If you're willing to convert your file to SAM format, you can use the following AWK solution:

grep -v '^@' old.sam | awk '{$10=substr($10,1,1); print}' > new.sam

This will remove the header. To add the header back, you can use the following:

samtools view -H old.sam > header.sam
cat header.sam new.sam > final.sam
ADD COMMENT
0
Entering edit mode

While this may trim the sequence it is going to leave the quality string as is. So you will end up with a malformed SAM file. Also take into consideration the explanation that OP provided above.

ADD REPLY
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

While OP had originally asked for this on further clarification it does not appear like they need to do this after all. ( see: BAM File: Trim Reads So Only First Base Remains )

ADD REPLY

Login before adding your answer.

Traffic: 2610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6