Question

BAM File: Trim Reads So Only First Base Remains

1

Entering edit mode

22 months ago

vanbelj ▴ 40

I mapped single-end 50-bp reads to a genome and now have BAM files. I need to trim the reads within the BAM file so that only the first base remains (the first base called by the sequencer). Ideally it would output another BAM or bigwig file.

Does anyone know a package that can do this?

NGS BAM Trimming • 2.7k views

ADD COMMENT • link updated 22 months ago by GenoMax 150k • written 22 months ago by vanbelj ▴ 40

0

Entering edit mode

I need to trim the reads within the BAM file so that only the first base remains (the first base called by the sequencer)

Can you further clarify or illustrate?

ADD REPLY • link 22 months ago by GenoMax 150k

0

Entering edit mode

Essentially I want to use the full-length of the read to accurately map the read to the genome, but only want the first base to contribute to coverage counts. So after mapping reads to the genome, I want to trim each read so only its first base remains.

For instance in the attached picture I've highlight two reads. One maps to the Forward strand and the other maps to the Reverse strand.

The Forward read would be trimmed so that only the 5' T is retained. The Reverse read would be trimmed so that only the 3' A is retained.

If these were the only two reads present, the coverage plot would have values of 1 at base # 590 and 642 and coverage values of 0 everywhere else.

If you're wondering why I want to do this, the technique I use ligates an adapter to the piece of DNA that is closest to a protein of interest. Libraries are sequenced from this adapter, so the first base represents the closest location to the protein of interest. The other bases increase background signal.

Thanks for any help.

Example

ADD REPLY • link 22 months ago by vanbelj ▴ 40

0

Entering edit mode

Since you are interested in only position 1 where a mapping starts you could simply take that position from your BAM file (column 4 in SAM format) and add those up.

ADD REPLY • link 22 months ago by GenoMax 150k

0

Entering edit mode

Is this something that could be done with the samtools package, or would I need to use text processing such as awk? Does column 4 take strandedness into account? The description of column 4 says "leftmost mapping position" - does this mean the 5' base of the read?

ADD REPLY • link 22 months ago by vanbelj ▴ 40

0

Entering edit mode

Since the reads are always in 5' --> 3' direction it will always be the 5' end. In case you need to take strand information into account then you can process the bitwise flags in column 2 (16 for reverse strand).

ADD REPLY • link 22 months ago by GenoMax 150k

0

Entering edit mode

22 months ago

Jeremy ▴ 930

If you're willing to convert your file to SAM format, you can use the following AWK solution:

grep -v '^@' old.sam | awk '{$10=substr($10,1,1); print}' > new.sam

This will remove the header. To add the header back, you can use the following:

samtools view -H old.sam > header.sam
cat header.sam new.sam > final.sam

ADD COMMENT • link 22 months ago by Jeremy ▴ 930

0

Entering edit mode

While this may trim the sequence it is going to leave the quality string as is. So you will end up with a malformed SAM file. Also take into consideration the explanation that OP provided above.

ADD REPLY • link 22 months ago by GenoMax 150k

0

Entering edit mode

22 months ago

Pierre Lindenbaum 166k

I wrote https://jvarkit.readthedocs.io/en/latest/Biostar9566948/

$ java -jar dist/jvarkit.jar biostar9566948 -R src/test/resources/rotavirus_rf.fa src/test/resources/S1.bam   -S 
@HD	VN:1.6	SO:coordinate
@SQ	SN:RF01	LN:3302
@SQ	SN:RF02	LN:2687
@SQ	SN:RF03	LN:2592
@SQ	SN:RF04	LN:2362
@SQ	SN:RF05	LN:1579
@SQ	SN:RF06	LN:1356
@SQ	SN:RF07	LN:1074
@SQ	SN:RF08	LN:1059
@SQ	SN:RF09	LN:1062
@SQ	SN:RF10	LN:751
@SQ	SN:RF11	LN:666
@RG	ID:S1	SM:S1	LB:L1	CN:Nantes
@CO	biostar9566948. compilation:20230621100700 githash:8a9a881b htsjdk:3.0.4 date:20230621100755. cmd:-R src/test/resources/rotavirus_rf.fa src/test/resources/S1.bam -S
RF01_1_483_2:0:0_3:0:0_41	163	RF01	1	60	1=	=	414	483	G	2	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:60	XS:i:0
RF01_8_542_1:0:0_2:0:0_95	99	RF01	8	60	1=	=	473	535	A	2	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:69	XS:i:0
RF01_11_507_0:0:0_1:0:0_9e	99	RF01	11	60	1=	=	438	497	G	2	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:70	XS:i:0
RF01_12_501_0:0:0_2:0:0_62	99	RF01	12	60	1=	=	432	490	C	2	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:70	XS:i:0
RF01_27_590_3:0:0_1:0:0_68	163	RF01	27	60	1=	=	521	564	G	2	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:55	XS:i:0
RF01_44_622_1:0:0_1:0:0_3a	99	RF01	44	60	1=	=	553	579	C	2	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:65	XS:i:0

$ java -jar dist/jvarkit.jar biostar9566948 -R src/test/resources/rotavirus_rf.fa src/test/resources/S1.bam 
@HD	VN:1.6	SO:coordinate
@SQ	SN:RF01	LN:3302
@SQ	SN:RF02	LN:2687
@SQ	SN:RF03	LN:2592
@SQ	SN:RF04	LN:2362
@SQ	SN:RF05	LN:1579
@SQ	SN:RF06	LN:1356
@SQ	SN:RF07	LN:1074
@SQ	SN:RF08	LN:1059
@SQ	SN:RF09	LN:1062
@SQ	SN:RF10	LN:751
@SQ	SN:RF11	LN:666
@RG	ID:S1	SM:S1	LB:L1	CN:Nantes
@CO	biostar9566948. compilation:20230621100700 githash:8a9a881b htsjdk:3.0.4 date:20230621100859. cmd:-R src/test/resources/rotavirus_rf.fa src/test/resources/S1.bam
RF01_1_483_2:0:0_3:0:0_41	163	RF01	1	60	1=69S	=	414	483	GGCTATTAAAGCTATACAATGGGGCCGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACT	2222222222222222222222222222222222222222222222222222222222222222222222	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:60	XS:i:0
RF01_8_542_1:0:0_2:0:0_95	99	RF01	8	60	1=69S	=	473	535	AAAGCTATACAATGGGGAAGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACTCACAATG	2222222222222222222222222222222222222222222222222222222222222222222222	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:69	XS:i:0
RF01_11_507_0:0:0_1:0:0_9e	99	RF01	11	60	1=69S	=	438	497	GCTATACAATGGGGAAGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACTCACAATCCGC	2222222222222222222222222222222222222222222222222222222222222222222222	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:70	XS:i:0
RF01_12_501_0:0:0_2:0:0_62	99	RF01	12	60	1=69S	=	432	490	CTATACAATGGGGAAGTATAATCTAATCTTGTCAGAATATTTATCATTTATATATAACTCACAATCCGCA	2222222222222222222222222222222222222222222222222222222222222222222222	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:70	XS:i:0
RF01_27_590_3:0:0_1:0:0_68	163	RF01	27	60	1=69S	=	521	564	GTATCATCTAATCTTGTCATAATATTTATCATATATATATAACTCACAATCCGCAGTTCAAATTCCAATA	2222222222222222222222222222222222222222222222222222222222222222222222	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:55	XS:i:0
RF01_44_622_1:0:0_1:0:0_3a	99	RF01	44	60	1=69S	=	553	579	CAGAATATTTATCATTTATATATAACTCAGAATCCGCAGTTCAAATTCCAATATACTATTCTTCCAATAG	2222222222222222222222222222222222222222222222222222222222222222222222	MC:Z:M1	RG:Z:S1	NM:i:0	AS:i:65	XS:i:0

view raw Biostar9566948.md hosted with ❤ by GitHub

ADD COMMENT • link 22 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

While OP had originally asked for this on further clarification it does not appear like they need to do this after all. ( see: BAM File: Trim Reads So Only First Base Remains )

ADD REPLY • link 22 months ago by GenoMax 150k

score 2 · Accepted Answer · 2023-06-20

R is good for this sort of thing. Very easy to do and examine many genomic manipulations.

library(GenomicRanges)
library(rtracklayer)

# read in BAM file
foo <- import("foo.bam")

# convert from GenomicAlignments to GRanges
foo <- GRanges(foo)

# look at first 3 elements
foo[1:3]

GRanges object with 3 ranges and 0 metadata columns:
  seqnames              ranges strand
<Rle>           <IRanges>  <Rle>
  [1]     chr1         28815-28854      -
  [2]     chr6 172095647-172095686      +
  [3]     chr7       101791-101830      -
  -------
  seqinfo: 24 sequences from an unspecified genome

# get just the first base
foo <- resize(foo, width=1, fix='start')

# examine after resizing to confirm strand
foo[1:3]

GRanges object with 3 ranges and 0 metadata columns:
  seqnames    ranges strand
<Rle> <IRanges>  <Rle>
  [1]     chr1     28854      -
  [2]     chr6 172095647      +
  [3]     chr7    101830      -
  -------
  seqinfo: 24 sequences from an unspecified genome

# export as bed
export(foo, "foo.bed")

# get coverage
foo_cov <- coverage(foo)

# export as bigwig
export(foo_cov, "foo_cov.bw")

If, for instance, you want to count how many reads overlap all your proteins of interest, you could import a bed file or GFF file describing the genomic locations of all your proteins and get counts as follows:

# count the number of overlapping reads for each protein    
proteinCounts <- countOverlaps(allmyproteins, foo)

The result would be a vector with the same number of elements as "allmyproteins" but each position would hold how many reads overlapped with the protein coordinates.

If you want a normalized track, also fairly simple:

# get total reads
nreads <- length(foo)

# rpm coverage, limit to 3 signficant digits
foo_rpm <- lapply(foo_cov, function(x) signif(10^6 * x/nreads,3))

# recast as SimpleRleList
foo_rpm <- as(foo_rpm,"SimpleRleList")

export.bw(foo_rpm, "foo_rpm.bw")