The head of reads contains long characters. Actually I wanted to use UMI-tools. If I extract UMI and add to the header ('_' separated), it appends after @ST-E00205:943:HCF3YCCX2:4:1101:11495:1678 (before first space). So, when I group the UMIs, the UMIs are considered as unique due to the presence later part (probably). So I want to discard the end part after space (1:N:0:NCCACGCG+NGATCTCG ). How can I do that? Thanks.
@ST-E00205:943:HCF3YCCX2:4:1101:11495:1678 1:N:0:NCCACGCG+NGATCTCG
ACCGGATGGTAGACCTGGAGGAGGGGAAAGCCGAGGTGGTGACGGGAGCGGCTGGGGGGGGAGTCCGGGATGGTAGGCGGAGCGGGCAGAGCACAGCAGCTCGTGTAGAAATGG
+
7-<--7--7-7F-----77----7---7-------------------7----77-7-----7------7---------7-7------7--7----77----------77-7---
Hello Ian, old question, but I just ran into the same problem mentioned above without being able to find a suitable answer.
Processing my dual indexed reads, the UMI is appended to the first part of the header before a space.
However, this naming scheme (read:is filtered:control number:index barcodes) is retained in the BAM file:
If I look into the 'deduplicated_per_umi.tsv' file, the UMI are labeled as:
Am I just misinterpreting the output or is there something else I should be doing?
Thats odd, I've never seen an aligner do that before. What aligner are you using?
We use BBTools
bbmap
for the mapping as it has performed the most consistent with our data (gammaherpesvirus).Upon further review of the data, I assume that the multiple entries with the same UMI in the 'deduplicated_per_umi.tsv' are due to differences in the dual indexes due to sequence error. My plan is to run
demuxbyname.sh
from the BBMap suite prior to usingUMItools
. It won't adress the naming scheme (read:is filtered:control number:index barcodes) being appended to the UMI, but should hopefully limit the number of entries.