Add tags to BAM/SAM file

0

Entering edit mode

8.1 years ago

dzb • 0

I'm working with some RNA-seq data. I have alignments done in STAR and the resultant BAM file. I'd like to annotate this BAM alignment data with custom tags using data that are stored in a separate file. The data in the second file contain are a read ID, a barcode, and a UMI. I want to add the barcode and UMI to all reads in the BAM file that match the read ID in the second file.

To summarise:

First file: BAM output from STAR

Second file: Read ID (matching those in STAR BAM file), UMI, barcode.

How do I get the UMI and barcode in file 2 tagged onto the reads in the BAM file?

Intensive Google and forum searching have yielded little info about this but I have a feeling there's a simple answer. Can anyone help?

alignment RNA-Seq sequencing BAM SAM • 9.2k views

ADD COMMENT • link updated 8.1 years ago by Matt Shirley 10k • written 8.1 years ago by dzb • 0

0

Entering edit mode

Please post example lines from the two files.

ADD REPLY • link 8.1 years ago by GenoMax 153k

3

Entering edit mode

8.1 years ago

Matt Shirley 10k

I don't think there's a simple answer for this question, but if you want to write a script you might find simplesam (https://github.com/mdshw5/simplesam) useful:

	import simplesam

	barcodes = {}
	with open('read_id_barcode_umi.txt') as barcodes_file:
	for line in barcodes_file:
	# should check the delimiter in this file. If it's ' ' or \t or ','
	read_id, umi, barcode = line.rstrip().split()
	barcode[read_id] = (umi, barcode)
	# reading this entire file could use a TON of memory if
	# if you have lots of reads

	# set the tag names - take a look at SAM spec to pick an appropriate one
	barcode_tag = 'ZB'
	umi_tag = 'ZU'

	with simplesam.Reader(open('in.bam')) as in_bam:
	with simplesam.Writer(open('out.sam', 'w'), in_bam.header) as out_sam:
	for read in in_bam:
	read[umi_tag] = barcodes[read.qname][0]
	read[barcode_tag] = barcodes[read.qname][1]
	out_sam.write(read)