availability: https://github.com/brwnj/umitools
umitools facilitates the processing of data that has incorporated a unique molecular identifier (UMI). It assumes the UMI is incorporated as part of the read.
Using the IUPAC sequence design of the UMI, strip the sequence from the 5' end of the fastq:
umitools trim --end 5 unprocessed_fastq.gz NNNNNV > out.fq
The UMI sequence for reads are appended onto the read name and processed again after the reads are mapped. Duplicate UMIs at any given start site need to be removed:
umitools rmdup unprocessed.bam out.bam > before_after.bed
EDIT:
I've updated this to account for mismatches among a given UMI sequence set at a start site. This allows the user to essentially merge very similar UMIs into fewer representative sequences.
umitools rmdup --mismatches 1 unprocesed.bam out.bam > before_after.bed
Dose umitools adapt to paired-end data(PE is popular in NGS analysis)?
PE is popular? What are you trying to do? What's your UMI incorporation design?
Hello, in my PE reads, both
1.fq
and2.fq
have UMIs.To take advantage of UMIs, I should take two UMIs into consideration.
So, does umitools can solve my problem?
unexpected problem with this tool: paired-end reads find themselves with different names, which causes BWA-MEM to quit. What aligner do you use downstream of umitools that does not require paired reads to have the same name?
I could make this work on PE reads, but it's unclear how I would be counting the UMIs at a given start. Would you want to remove R1s independently of R2s?
If you were interested in sharing data with me I think we can get it worked out. If you've already solved it and made the code available somewhere, I'd love to check it out!