fastp (https://github.com/OpenGene/fastp) is an open source tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance. Unique molecular identifer (UMI) preprocessing is one of its features.
UMI is useful for duplication elimination and error correction based on generating consensus of reads originated from a same DNA fragment. It's usually used in deep sequencing applications like ctDNA sequencing.
Commonly for Illumina platforms, UMIs can be integrated in two different places: index or head of read. To enable UMI processing, you have to enable -U
or --umi
option in the command line, and specify --umi_loc
to specify the UMI location, it can be one of:
index1
the first index is used as UMI. If the data is PE, this UMI will be used for both read1/read2.index2
the second index is used as UMI. PE data only, this UMI will be used for both read1/read2.read1
the head of read1 is used as UMI. If the data is PE, this UMI will be used for both read1/read2.read2
the head of read2 is used as UMI. PE data only, this UMI will be used for both read1/read2.per_index
read1 will use UMI extracted from index1, read2 will use UMI extracted from index2.per_read
read1 will use UMI extracted from the head of read1, read2 will use UMI extracted from the head of read2.
If --umi_loc
is specified as read1
, read2
or per_read
, the length of UMI should specified with --umi_len
.
fastp will extract the UMIs, and append them to the first part of read names, so the UMIs will also be presented in SAM/BAM records. If the UMI is in the reads, then it will be shifted from read so that the read will become shorter. If the UMI is in the index, it will be kept.
A prefix can be specified with --umi_prefix
. If prefix is specified, an underline will be used to connect it and UMI. For example, UMI=AATTCCGG
, prefix=UMI
, then the final string presented in the name will be UMI_AATTCCGG
.
UMI example
original read:
@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
read processed with command: fastp -i testdata/R1.fq -o testdata/out.R1.fq -U --umi_loc=read1 --umi_len=8
@NS500713:64:HFKJJBGXY:1:11101:1675:1101:AAAAAAAA 1:N:0:TATAGCCT+GACCCCCA
GCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
EEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
You can find that AAAAAAAA
is shifted from the read, and the UMI label :AAAAAAAA
is added to the sequence name.
Nice!
Did you add a cross-reference to this tutorial in the fastp tool thread?
Thanks for this advice, I will update the fastp tool thread.