I used x-gen udi/umi adaptors from idt to generate my RNA-seq samples and ran my single end RNA-seq experiment. I received two fastq files for each sample: R1 from my 100bp SE run and R2 for 9bp UMI sequence split off. I normally analyze my RNA-seq experiments using STAR aligner to transcriptome and expression calculation with RSEM. I would like to incorporate umi-based deduplication into this step.
I've tried a few methods.
- I ignored R2 and used umi_tools extract with
--bc-pattern NNNNNNNNN
as instructed on the website and followed up with STAR alignment and umi dedup. In this case, I obtained deduplicated files but my file size was reduced to 1/20 of original size. - I tried to convert my R1 fastq file into unmapped bam by using picard fastqtosam function. I incorporated UMIs from fastq by using fgbio annotatebamwithumis function. I converted ubam with UMIs marked with RX back to fastq and at this point I was able to see all my UMIs tagged with RX in bam file. Then I proceeded with STAR alignment to transcriptome. After alignment, I ran umi dedup with command
. However, then I get a warning message that at least one read is missing umi and/or cell tag and I'm left with much smaller file size compared to original bam file.
Does anyone have a experience with this situation? I guess I can also try picard markduplicates with REMOVE DUPLICATES=TRUE option instead of umi dedup, but I'm concerned that I'm losing a big chunk of file. I would like to stick to already established STAR-RSEM pipeline as much as possible. I would appreciate any help! Thank you very much in advance!
For specialized kits like this you should follow the recommendations from IDT to analyze the data (Appendix G Here) You may be doing this already but wanted to check.
Interestingly that manual doesn't mention UMIs.....
They are extended adapters that can be used with xGEN RNA kit I think: https://www.idtdna.com/pages/support/faqs/how-do-i-sequence-the-umi-in-the-xgen-udi-umi-adapters