Hi,
I am trying to do a deduplication with UMI-tools. The data I am using arrived to me after UMI extraction, so I did not use UMI-tools extract.
As a result, the UMIs are the first 11 chars of the first fastq read line, and do not appear at the end of the line or on the read itself. For example, my fastq read looks like this, where the highlighted characters are the UMI:
@TGTAGGGAGTG:A01350:127:HJHTVDRX2:1:2101:21856:1016 1:N:0:AGTGACCT+CTCCTAGA
#^_________^
GNAGCCCTTCCGGATTCAGGATGTGCAGCATGTCGTCAAAGATGGGCTGCAAGTCACAGGGGTCCAGGCGGACCTGCCGCAACACT
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
The problem is that the UMI-tools dedup try to take the last chars of the line as the UMI. How do I ''tell'' UMI-tools to take the umis as the first 11 chars from the read name?
Alignment was done with STAR, sorting and indexing later on were done with samtools. The example above is of the fastq file and not from SAM file.
I'll appreciate any help!
Uri
Curious as to how the UMI got to the place where it is at. No program I know of does this so this may have been done by some custom manipulation. You may want to find the original data and go from there, if possible.