Question

UMI-tools deduplication of reads with UMI at the start of the line

0

Entering edit mode

2.5 years ago

Uri • 0

Hi,

I am trying to do a deduplication with UMI-tools. The data I am using arrived to me after UMI extraction, so I did not use UMI-tools extract.

As a result, the UMIs are the first 11 chars of the first fastq read line, and do not appear at the end of the line or on the read itself. For example, my fastq read looks like this, where the highlighted characters are the UMI:

@TGTAGGGAGTG:A01350:127:HJHTVDRX2:1:2101:21856:1016 1:N:0:AGTGACCT+CTCCTAGA
#^_________^
GNAGCCCTTCCGGATTCAGGATGTGCAGCATGTCGTCAAAGATGGGCTGCAAGTCACAGGGGTCCAGGCGGACCTGCCGCAACACT
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

The problem is that the UMI-tools dedup try to take the last chars of the line as the UMI. How do I ''tell'' UMI-tools to take the umis as the first 11 chars from the read name?

Alignment was done with STAR, sorting and indexing later on were done with samtools. The example above is of the fastq file and not from SAM file.

I'll appreciate any help!

Uri

UMI UMI-tools • 2.3k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 2.5 years ago by Uri • 0

0

Entering edit mode

Curious as to how the UMI got to the place where it is at. No program I know of does this so this may have been done by some custom manipulation. You may want to find the original data and go from there, if possible.

ADD REPLY • link 2.5 years ago by GenoMax 152k

score 2 · Answer 1 · 2023-01-11

2

Entering edit mode

2.5 years ago

i.sudbery 21k

No, currently there is no way built into UMI-tools to do this. The best thing to do would be to write a short script to move the uni, either in the fast before mapping and move it to the end of the read name. Or after mapping and move it to a tag.

ADD COMMENT • link 2.5 years ago by i.sudbery 21k

0

Entering edit mode

From the fastq, in python, with biopython:

import Bio.SeqIO

output = open("Modified.fastq", "w")

for record in Bio.SeqIO.parse("my_file.fastq"):
    name = record.id
    parts = name[1:].strip().split(":")
    name = @+":".join(part[1:] + part[0])
    record.id = name
    Bio.SeqIO.write(record, output, "fastq")

output.close()

Or for the BAM, in python, with pysam

import pysam

inbam = pysam.AlignmentFile("my_bam.bam")
outbam = pysam.AlignmentFile("modified.bam", "wb", template=inbam)

for read in inbam.fetch(until_eof=True):
    name = read.query_name
    parts = name.split(":")
    read.query_name = ":".join(parts[1:])
    read.set_tag("RX", parts[0])
    outbam.write(read)

outbam.close()

ADD REPLY • link 2.5 years ago by i.sudbery 21k

0

Entering edit mode

Thanks a lot for your help!

I have thechnical issues with pysam so I tried some variations of the fastq script and its looks good!

you think I should also change record.description?

ADD REPLY • link 2.5 years ago by Uri • 0

1

Entering edit mode

You can do if you like, but as far as I'm aware the description part of the I'd line is ignored by most aligners.

ADD REPLY • link 2.5 years ago by i.sudbery 21k