Question

UMI-tools no tsv files at deduplication

0

Entering edit mode

10 months ago

Sarah • 0

Hi all,

I'm very new to bioinformatics and am currently using UMI-tools to look at my recently sequencing libraries. I have a few steps I'm struggling with:

1st question- Using the UMI-tools guide, I extracted the UMIs from my R1 file, mapped, indexed and sorted my file. Now, when I deduplicate, I only get a resulting bam file and no tsv files. When I did this using the UMI-tools example files, I got a bam file + 3 tsv files. How do I go about generating this? The command line I am using is:

$ umi_tools dedup -I Sorted.bam --output-stats=deduplicated -S deduplicated.bam

2nd question- I have seven libraries in total. I have a R1 and R2 fastq file generated from each library so 14 fastq files in total. Can I process all 14 at once, or at least the R1 and R2 of each library together? I used the command from the UMI-tools guide for paired reads but was unsuccessful. Their website provides the command below:

$ umi_tools extract -I pair.1.fastq.gz --bc-pattern=NNNXXXXNN \ 
  --read2-in=pair.2.fastq.gz --stdout=processed.1.fastq.gz \
  --read2-out=processed.2.fastq.gz

Here is how I used it:

$ umi_tools extract -I FilenameR1.fastq.gz --bc-pattern=NNNXXXXNN \ 
  --read2-in=FilenameR2.fastq.gz --stdout=processed.FilenameR1.fastq.gz \
  --read2-out=processed.FilenameR2.fastq.gz

Have I made a mistake here?

Thanks in advance!!

deduplication NGS UMI DNA-seq UMI-tools • 1.1k views

ADD COMMENT • link updated 10 months ago by Ram 44k • written 10 months ago by Sarah • 0

0

Entering edit mode

When you say it was unsucessful, what do you mean? Was there an error message?

You are also going to want to use a --bc-pattern that matches your read geometry. And unless you are analysing an iCLIP experiment from circa 2012, I doubt its NNNXXXXNN

ADD REPLY • link 10 months ago by i.sudbery 20k

0

Entering edit mode

Also, for the dedup, check that the job is finishing. The logging output should end with # job finished. It is doesn't, that suggests that umi_tools isn't finishing before it produces the stats files. Could be due to being killed by an OOM killer, or by a cluster resource manager.

ADD REPLY • link 10 months ago by i.sudbery 20k

0

Entering edit mode

In regards to the dedup step for my unpaired file- yes, it appears that umi-tools isn't finsihing as I get the following error at the end of a message with mostly the file parameters:

ValueError: file does not have a valid header (mode='rb') - is it BAM/CRAM format?

My input file is in bam format and is not empty. I'm not sure why it wouldn't have a header if I followed the previous commands from the guideline correctly.

ADD REPLY • link 10 months ago by Sarah • 0

0

Entering edit mode

Hmmm... that is starange. Have you run samtools quickcheck on the BAM file? And if that passes, how about samtools view -H this should output the header if there is one.

ADD REPLY • link 10 months ago by i.sudbery 20k

0

Entering edit mode

Sorry, by unsuccessful, I meant that the first fastq file text appeared in my terminal as a response. The start of the message was:

    # UMI-tools version: 1.1.4
    # output generated by extract --stdin=NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz --bc-pattern=NNNNNNNNNN

And the end was:

## 2024-01-17 16:00:14,317 INFO Reads output: 629590
# job finished in 11 seconds at Wed Jan 17 16:00:14 2024 --  7.48  1.97  0.00  0.00 -- ee347034-0ee5-4cff-b8a6-cefb72b040a8
zsh: command not found: --read2-in=

Also, yes, sorry, I hadn't put the correct pattern into the command above as I was using the example command. What I used was:

umi_tools extract -I NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz --bc-pattern=NNNNNNNNNN \ 
  --read2-in= NPM1A-OCI-100PCT-CP01_S22_L001_R2_001.fastq.gz --stdout= Processed_NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz \
  --read2-out=Processed_NPM1A-OCI-100PCT-CP01_S22_L001_R2_001.2.fastq.gz

Today, I tried the following command which only generated a log file and a fastq.gz file for the R1 file:

umi_tools extract --stdin=NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz --read2-in=NPM1A-OCI-100PCT-CP01_S22_L001_R2_001.fastq.gz --bc-pattern= NNNNNNNNNN -L processed.log --stdout=processed_NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz --read2-out=Processed_NPM1A-OCI-100PCT-CP01_S22_L001_R2_001.2.fastq.gz

I was expecting two output files but that might be incorrect as when I get to paired de-duplication, the command only includes one input file. Sorry if I've confused you!

ADD REPLY • link updated 10 months ago by i.sudbery 20k • written 10 months ago by Sarah • 0

0

Entering edit mode

Sarah why did you delete this post?

ADD REPLY • link 10 months ago by Ram 44k

score 0 · Answer 1 · 2024-01-17

Your error with extract is being caused by the whole command not being passed to UMI-tools.

You can tell this in two ways:

The first is the second line of the log file, which, according to your comment, reads:

# output generated by extract --stdin=NPM1A-OCI-100PCT-CP01_S22_L001_R1_001.fastq.gz --bc-pattern=NNNNNNNNNN

Not that the command stops after the bc-pattern option.

The second is the zsh error at the end of the output:

zsh: command not found: --read2-in=

This suggests that the second line of your statenent is being treated as a seperate command.

This usually happens if you have used a multi-line statement with line continuations (as the quick-start guide uses), but spaces after the \. That is, you have "\ " at the end of the line, rather than "\".