De-duplicate UMI at FASTQ level
2
3
Entering edit mode
6.1 years ago
jomo018 ▴ 730

I am looking for a tool to de-duplicate FASTQ files based on UMI which are known per each read. The tool would likely pool identical/similar UMI and check for high similarity between the reads of each pool.

The purpose is to reduce processing down the pipeline, in particular to allow mapping of rna-seq to transcriptome rather than to genome.

UMI fastq • 14k views
ADD COMMENT
12
Entering edit mode
6.1 years ago

I know of no tool that can do does this, and there are probably good reasons for that.

In the case of most protocols that use UMIs, the UMI alone simply isn't unique enough to uniquely identify a pre-PCR molecule. Consider: For deduplicating only on a UMI to work, it has to be far more likley that two reads with the same UMI are PCR duplicates than that two independent molecules got the same UMI. With a 10nt UMI there are 1 million different possible UMI sequences. A standard RNAseq library, for example, might contain around 30 million reads. But the situation is worst than this: UMIs can containing sequencing errors, thus sfotware like UMI-tools doesn't just assume that two reads with the same UMI sequences are duplicates, but that two reads with similar UMI sequences are duplicates. Finally usage of supposedly random UMI sequences is not actually random: some sequences are more likely to be used than others. Thus to distinguish duplicates from two reads that just happen to have got the same UMI sequence, we need more information.

What information is appropriate depends on the protocol that created the data. Simply put, if PCR happens after fragmentation, then reads with different mapping co-ordinates are likely to have come from different molecules. This applies to techniques like iCLIP, 4C, ChIP-seq and standard RNA-seq. In this case you might find that duplicates can the same complete read sequence, but the cDNA part and the UMI part, but you will missing things that simply have similar sequences do to a sequencing or PCR error.

In other cases fragmentation comes after PCR. In this case in this case two reads from the same original molecule can have different mapping co-ordinates, but there are limits: in 3' end tagging RNA seq (e.g. droplet based single cell RNA seq) two reads coming from different genes cannot be PCR duplicates. In amplicon sequencing, two reads from different amplicons cannot be from the same molecule. In this case you cannot use the rest of the read to decide if something is a duplicate or not, and there really isn't anything you can do without mapping or transcript assignment of some sort

Solutions

I strongly recommend that you follow standard workflows. Otherwise you might try:

  1. If you are doing droplet based single-cell RNA-seq (e.g. 10X chromium or drop-seq) and are looking for a low resource way to process the data, you might like to try the newly released alevin, which takes fastqs and outputs genecounts, and does so with a 10x time and memory reduction compared to tools like Cell Ranger. It encorporates an UMI-deduplication algo inspired by UMI-tools, but which properly deals with transcript ambiguity. alevin is part of the lastest release of salmon and can be obtained at https://github.com/COMBINE-lab/salmon
  2. If your protocol has PCR and UMI addition only after fragmentation, then the tool tally will de-duplicate identical fastq records. You should be able to just pass in your raw reads. But be aware that any sequencing or PCR errors will mark a read as not a duplicate when it is.
  3. The latest versions of UMI-tools has an importable python module which you can use for implementing your own deduplication proceedure using UMI-tools' error aware barcode collapsing algorithm:

    from umi_tools.network import UMIClusterer
    
    # umis is a list of UMI sequences, e.g. ["ATAT", "GTAT", "CCAT"]
    # counts is a dictionary mapping these UMIs to counts, eg:
    # {"ATAT":10, "GTAT":3, "CCAT": 5}
    # threshold is the edit distance threshold at which to cluster.
    
    clusterer = UMIClusterer(cluster_method="directional")
    clusters = clusterer(umis, counts, threshold)
    
    # clusters is now a list of lists, where each sub list is a cluster of 
    # umis we believe are PCR dupcliates. e.g. 
    # [["ATAT", "GTAT"], ["CCAT"]]
    

Use alevin if your data is of the right type, otherwise I do recommend the standard methods (i.e. not 1 or 2 above). With the exception of CellRanger, we find that actually processing the fastq file is the most time consuming part of the pipeline and that memory usage is dominated by the mapper, which is independent of the size of the input, so you will not gain much, in terms of time or memory, deduplicating before mapping anyway.

CoI: I am the author of UMI-tools and an author on the alevin paper.

ADD COMMENT
0
Entering edit mode

Hi i.sudbery,

I was looking for clarification on tally. Does it de-duplicate identical reads based on the sequence or the UMI?

ADD REPLY
0
Entering edit mode
5.9 years ago
Rituriya ▴ 50

Hi Ian,

I have been using UMI_tools to get the UMI counts for miRNA data. I require counts for differential expression of miRNA. I have done adapter trimming -> umi_tools extract -> mapping to genome/miRBase -> umi_tools dedup to remove PCR duplicates and get the UMI counts.

My confusion:

  1. Can I use this tool for miRNA sequenced reads? (Illumina machine + Library: QIASeq miRNA with 12 bp UMI, 75 bp Single end reads)
  2. I want to use the collapsed reads based on UMI for differential miRNA expression. As after running the dedup command, it will deduplicate the reads based on UMI which will be actually inflated than the usual number of reads for a miRNA mapped. Correct?

Because of this confusion, I feel that I could deduplicate reads based on UMI tags and then align reads to mature miRNA sequences. Any inputs from you would be really helpful as you are an expert on UMI technology.

Thank you, Rituriya.

ADD COMMENT
0
Entering edit mode

Please post this as a new thread/question and then come back here and delete.

ADD REPLY
0
Entering edit mode

But I have the exact same question: De-duplicate reads at FASTQ level. Shall I still post as new question?

ADD REPLY
0
Entering edit mode

You added this as an answer to the original question. If this is a follow-up question on @Ian's answer then you should add this a a comment to his answer above using ADD COMMENT there.

That said, you should be able to use umi_tools for miRNA (#1). A similar question was answered yesterday: Removing UMI with UMI tools?

As after running the dedup command, it will deduplicate the reads based on UMI which will be actually inflated than the usual number of reads for a miRNA mapped. Correct?

I am not sure what you mean by this. Can you not dedupe on UMI and then align?

ADD REPLY
0
Entering edit mode

Sorry. I always get confused with Add Reply/Add Comment.

I am not sure what you mean by this. Can you not dedupe on UMI and then align?

That is precisely what I want to perform but UMI_tools requires an aligned file as input for deduplication.

Thank you for your inputs, genomax.

ADD REPLY
0
Entering edit mode

Ah I see now. I thought umi_tools can dedupe based on just the UMI but looks like that is not the case.

I made @Ian aware of this thread. It is late in England but hopefully he will look at this thread tonight. Otherwise tomorrow.

ADD REPLY
0
Entering edit mode

Great, looking forward to hear from him. Thank you for your input, genomax.

ADD REPLY
0
Entering edit mode

You could think about extracting UMI's as a separate read. Dedupe them with clumpify.sh from BBMap and then recover a representative fastq read for further alignment?

ADD REPLY
0
Entering edit mode

I was just reading your answer in another post about clumpify.sh. I will surely give it a try and see how the downstream analysis performs. Thanks.

ADD REPLY
0
Entering edit mode

I'm not sure I understand what you mean by "UMI which will be actually inflated than the usual number of reads for a miRNA mapped"

After dedup, the number of reads will be lower than the number of reads if you had not dedupped.

I recommend just following the standard UMI-tools proceedure.

ADD REPLY
0
Entering edit mode

@Ian this is only speculation but I wonder if @Rituria has reads that have identical UMI but not the same start and end.

ADD REPLY
0
Entering edit mode

Then I would say they are unlikely to be PCR duplicates, unless start and end of fragments is determined after UMI addition, which would seem odd for an miRNA library.

If this is the case, could deduplicate on the basis of which miRNA is mapped to rather than start co-ordinate.

ADD REPLY
0
Entering edit mode

Hi Ian,

I think I did not explain correctly. Sorry. Following is the command I used to deduplicate the mature miRNA aligned BAM file:

umi_tools dedup --method=unique -I mature-miRNA-aligned.bam --output-stats=mature-miRNA_deduplicated-stats.txt -S mature-miRNA-dedup.bam -L dedup.log

I assume here that it takes which miRNA it has mapped and also its mapping position into consideration for deduplication. If that is wrong, which parameter am I missing here?

When I use the umi_tools count function and get the UMI counts from the outputted BAM from above step, is it biologically right to use that as the measure to find Differential Expression of miRNA? My supervisor suggests still to use the read counts for this step which I obtain using command:

samtools idxstats mature-miRNA-aligned.bam > mature-miRNA-readcounts.tsv

For eg. hsa-miR-16-5p = 1254894 (from BAM file BEFORE dedup step), while hsa-miR-16-5p = 703 (from BAM file AFTER dedup step)

There is a huge difference in these two results and hence the confusion as to which one to use.

Thank you, Rituriya.

ADD REPLY
0
Entering edit mode

You do not want to run dedup AND count, just one or the other. count effectively runs dedup and then calculates the read counts after deduplicating. While this might seem trivial, it was originally designed to function in scRNAseq applications where a single BAM might contain data from 1000s of cells. So at the moment you are deduplicating twice. Also, currently count enforces a "per-gene" deduplication policy (i.e. don't use position), which you probably don't want.

Secondly, why are you using --method=unique? This will not account for PCR and sequencing errors in the UMI, which is the whole point of UMI-tools. We mostly included --method=unique for benchmarking purposes.

Finally, by default dedup uses contig and position for deduplication. In some miRNA library preps it makes sense to use miRNA sequence length as well. This can be activated using --read-length.

The extact pipeline you should use depends on what you have aligned to. Given that you are counting using samtools idxstats mature-miRNA-aligned.bam > mature-miRNA-readcounts.tsv, I'm assuming you have aligned to the transcriptome rather than the genome. In which case, my suggestion is:

$ umi_tools dedup --read-length -I mature-miRNA-aligned.bam --output-stats=mature-miRNA_deduplicated-stats.txt -S mature-miRNA-dedup.bam -L dedup.log
$ samtools index mature-miRNA-dedup.bam
$ samtools idxstats mature-miRNA-dedup.bam > mature-miRNA-readcounts.tsv

and use these counts as inputs to your differential expression analysis.

These counts will be much lower than the non-deduplicated counts. This is the point, as dedup removes PCR duplicates that are not genuinely independent reads and so shouldn't be counted.

ADD REPLY
0
Entering edit mode

I ran the steps you mentioned above and ouptut from the samtools idxstats command, the miRNA named hsa-miR-16-5p=31 (which is very low when compared to reads). Out of ~2500 mature miRNAs in miRBase, only 32 miRNAs got a count > 10, which is quite drastic when compared to number of miRNA aligned based on read counts. Anyway, I will have a look at how they behave downstream.

Thank you for your useful guidance with UMI. I am new to it. -Rituriya.

ADD REPLY
0
Entering edit mode

While we should debug this separately what was the alignment % and what aligner did you use?

ADD REPLY
0
Entering edit mode

Adapter trimming -> UMI extraction using UMI_tools -> Align to PhiX -> take unmapped reads -> align to genome (99.5% alignment perc)-> take aligned reads and map to RFAM db and retain only miRNA related reads -> align with miRBase (13%) -> umi_tools dedup

I have used Bowtie1 everywhere but specifically for miRNA alignment I used:

Bowtie1 -m 3 --best --strata -v 0

as the read lengths were below 50 bp after trimming.

Raw reads = 21.6 million Reads aligned to miRBase = 2.8 million (13%)

I felt this to be the best approach. Would love to hear from your end.

ADD REPLY
0
Entering edit mode

That alignment % looks pretty low to me. Does the QIAseq kit have a specific adapter that it uses? Did you trim it off?

Let us wait to see what @Ian has to say on stats below.

ADD REPLY
0
Entering edit mode

So here 13% means 13% of total raw reads got aligned. If you measure from total reads of previous step (take aligned reads and map to RFAM db and retain only miRNA related reads), then it is 47.6% aligned to mature miRNA, which is not bad I feel. Right?

ADD REPLY
0
Entering edit mode

Did you check for presence of these two adapters in your data? Those would only be the correct reads if I am reading the manual right.

5'- adapter
3'- adapter.

ADD REPLY
0
Entering edit mode

Absolutely. That was the first step as I mentioned earlier. Also I kept only those reads which had the adapters in them, which is approx 47% of reads.

ADD REPLY
0
Entering edit mode

What is your UMI length, how many reads did you sequence and what is the content of your mature-miRNA_deduplicated-stats.txt

ADD REPLY
0
Entering edit mode

UMI length=12 bp, Raw reads around 20 million reads per sample and for the stats file, there are 3 files generated. Below is the edit_distance.tsv file:

directional     directional_null        edit_distance   unique  unique_null
3193    3193    Single_UMI      2173    2173
0       75      0       0       138
13      1       1       464     3
126     10      2       629     26
61      0       3       161     5
31      4       4       46      23
25      7       5       34      39
27      26      6       33      136
36      37      7       38      192
39      87      8       30      339
45      108     9       18      331
36      45      10      18      124
22      52      11      13      105
3       12      12      0       23
0       0       13      0       0

I am trying to understand if it is good or bad. Any insights?

ADD REPLY
0
Entering edit mode

Each row shous the number of locations with a given average edit distance between UMIs. For example in the fourth row its the number locations where the average edit distance between UMIs at that location is 1.

In your post to @genomax, you say that you map to the genome, then take those reads and map to RFAM and then map those to miRBase. This seems pretty conservative to me. You are basically forcing your miRNAs to have to precise mature sequence that is in miRBase. Is this the proceedure recommended by QIAGEN? It feels like things would be better if a few more of those 20M reads could be assigned to miRNAs - as they probably are offset from the major species from the locus by perhaps a base or two, these would deduplicate to seperate reads.

You might also try putting the deduplicate step between the genome and RFAM alignments, as reads mapping to different copies of the miRNA genes would then be deduplicated seperately.

ADD REPLY
0
Entering edit mode

I wanted to keep it conservative as we have already processed this data by a commercial software and I am trying to reproduce the same results. Now the problem is they have collapsed reads before alignment and I do not see anyone else doing that. There is a huge difference in the counts for each miRNA which is want I want to understand.

The miRNA expression pattern is almost similar but the relative abundances are different and this happens only when I use --method=unique. When I use --method=directional (default), the results are not even comparable as the counts drop drastically and the expression pattern changes too. I think I will have to figure this on my own. Thank you so much for your and genomax guidance. Sorry for making this thread so long with posts.

-Rituriya.

ADD REPLY
0
Entering edit mode

we have already processed this data by a commercial software and I am trying to reproduce the same results.

That can be a futile endeavor unless you know precisely what the commercial software did (aligner used, settings, version of miRBase etc).

If you want others to be able to reproduce your results then you will have to use open source/accessible software. If you just want to publish, then either analysis will work.

ADD REPLY
0
Entering edit mode

I have almost full clarity on these parameters (aligner used, settings, version of miRBase etc) just not the UMI based collapsing of reads which they do before alignment. But yes, I agree to your thoughts.

ADD REPLY
0
Entering edit mode

The commericial solution will almost certainly be using an UMI collapse algorithmn related to unique, as far as I am aware only UMI-tools and DropEst implement error correcting UMI deduplication. The commerical solution probably does something similar to tally mentioned in the above answer. This is almost certainly an under-collapsing.

Good luck.

ADD REPLY
0
Entering edit mode

Interesting. Thank you Ian.

ADD REPLY

Login before adding your answer.

Traffic: 2140 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6