I know of no tool that can do does this, and there are probably good reasons for that.
In the case of most protocols that use UMIs, the UMI alone simply isn't unique enough to uniquely identify a pre-PCR molecule. Consider: For deduplicating only on a UMI to work, it has to be far more likley that two reads with the same UMI are PCR duplicates than that two independent molecules got the same UMI. With a 10nt UMI there are 1 million different possible UMI sequences. A standard RNAseq library, for example, might contain around 30 million reads. But the situation is worst than this: UMIs can containing sequencing errors, thus sfotware like UMI-tools doesn't just assume that two reads with the same UMI sequences are duplicates, but that two reads with similar UMI sequences are duplicates. Finally usage of supposedly random UMI sequences is not actually random: some sequences are more likely to be used than others. Thus to distinguish duplicates from two reads that just happen to have got the same UMI sequence, we need more information.
What information is appropriate depends on the protocol that created the data. Simply put, if PCR happens after fragmentation, then reads with different mapping co-ordinates are likely to have come from different molecules. This applies to techniques like iCLIP, 4C, ChIP-seq and standard RNA-seq. In this case you might find that duplicates can the same complete read sequence, but the cDNA part and the UMI part, but you will missing things that simply have similar sequences do to a sequencing or PCR error.
In other cases fragmentation comes after PCR. In this case in this case two reads from the same original molecule can have different mapping co-ordinates, but there are limits: in 3' end tagging RNA seq (e.g. droplet based single cell RNA seq) two reads coming from different genes cannot be PCR duplicates. In amplicon sequencing, two reads from different amplicons cannot be from the same molecule. In this case you cannot use the rest of the read to decide if something is a duplicate or not, and there really isn't anything you can do without mapping or transcript assignment of some sort
Solutions
I strongly recommend that you follow standard workflows. Otherwise you might try:
- If you are doing droplet based single-cell RNA-seq (e.g. 10X chromium or drop-seq) and are looking for a low resource way to process the data, you might like to try the newly released
alevin
, which takes fastqs and outputs genecounts, and does so with a 10x time and memory reduction compared to tools like Cell Ranger. It encorporates an UMI-deduplication algo inspired by UMI-tools, but which properly deals with transcript ambiguity. alevin
is part of the lastest release of salmon
and can be obtained at https://github.com/COMBINE-lab/salmon
- If your protocol has PCR and UMI addition only after fragmentation, then the tool
tally
will de-duplicate identical fastq records. You should be able to just pass in your raw reads. But be aware that any sequencing or PCR errors will mark a read as not a duplicate when it is.
The latest versions of UMI-tools has an importable python module which you can use for implementing your own deduplication proceedure using UMI-tools' error aware barcode collapsing algorithm:
from umi_tools.network import UMIClusterer
# umis is a list of UMI sequences, e.g. ["ATAT", "GTAT", "CCAT"]
# counts is a dictionary mapping these UMIs to counts, eg:
# {"ATAT":10, "GTAT":3, "CCAT": 5}
# threshold is the edit distance threshold at which to cluster.
clusterer = UMIClusterer(cluster_method="directional")
clusters = clusterer(umis, counts, threshold)
# clusters is now a list of lists, where each sub list is a cluster of
# umis we believe are PCR dupcliates. e.g.
# [["ATAT", "GTAT"], ["CCAT"]]
Use alevin
if your data is of the right type, otherwise I do recommend the standard methods (i.e. not 1 or 2 above). With the exception of CellRanger, we find that actually processing the fastq file is the most time consuming part of the pipeline and that memory usage is dominated by the mapper, which is independent of the size of the input, so you will not gain much, in terms of time or memory, deduplicating before mapping anyway.
CoI: I am the author of UMI-tools
and an author on the alevin
paper.
Hi i.sudbery,
I was looking for clarification on tally. Does it de-duplicate identical reads based on the sequence or the UMI?