UMI-tools deduplication of reads with UMI at the start of the line
1
0
Entering edit mode
22 months ago
Uri • 0

Hi,

I am trying to do a deduplication with UMI-tools. The data I am using arrived to me after UMI extraction, so I did not use UMI-tools extract.

As a result, the UMIs are the first 11 chars of the first fastq read line, and do not appear at the end of the line or on the read itself. For example, my fastq read looks like this, where the highlighted characters are the UMI:

@TGTAGGGAGTG:A01350:127:HJHTVDRX2:1:2101:21856:1016 1:N:0:AGTGACCT+CTCCTAGA
#^_________^
GNAGCCCTTCCGGATTCAGGATGTGCAGCATGTCGTCAAAGATGGGCTGCAAGTCACAGGGGTCCAGGCGGACCTGCCGCAACACT
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

The problem is that the UMI-tools dedup try to take the last chars of the line as the UMI. How do I ''tell'' UMI-tools to take the umis as the first 11 chars from the read name?

Alignment was done with STAR, sorting and indexing later on were done with samtools. The example above is of the fastq file and not from SAM file.

I'll appreciate any help!

Uri

UMI UMI-tools • 1.8k views
ADD COMMENT
0
Entering edit mode

Curious as to how the UMI got to the place where it is at. No program I know of does this so this may have been done by some custom manipulation. You may want to find the original data and go from there, if possible.

ADD REPLY
2
Entering edit mode
22 months ago

No, currently there is no way built into UMI-tools to do this. The best thing to do would be to write a short script to move the uni, either in the fast before mapping and move it to the end of the read name. Or after mapping and move it to a tag.

ADD COMMENT
0
Entering edit mode

From the fastq, in python, with biopython:

import Bio.SeqIO

output = open("Modified.fastq", "w")

for record in Bio.SeqIO.parse("my_file.fastq"):
    name = record.id
    parts = name[1:].strip().split(":")
    name = @+":".join(part[1:] + part[0])
    record.id = name
    Bio.SeqIO.write(record, output, "fastq")

output.close()

Or for the BAM, in python, with pysam

import pysam

inbam = pysam.AlignmentFile("my_bam.bam")
outbam = pysam.AlignmentFile("modified.bam", "wb", template=inbam)

for read in inbam.fetch(until_eof=True):
    name = read.query_name
    parts = name.split(":")
    read.query_name = ":".join(parts[1:])
    read.set_tag("RX", parts[0])
    outbam.write(read)

outbam.close()
ADD REPLY
0
Entering edit mode

Thanks a lot for your help!

I have thechnical issues with pysam so I tried some variations of the fastq script and its looks good!

you think I should also change record.description?

ADD REPLY
1
Entering edit mode

You can do if you like, but as far as I'm aware the description part of the I'd line is ignored by most aligners.

ADD REPLY

Login before adding your answer.

Traffic: 2729 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6