Removing UMI with UMI tools?
2
3
Entering edit mode
6.0 years ago
simplyphage ▴ 30

I have miRNA sequences where the 3' adapter is flanked with 12N(random) UMI sequence+RT primer+Universal adapter.

I have tried using regex function in UMI tools but I am not sure if it will work because of the sequence pattern I have.

Please let me know if I have any chance of processing it with UMI tools.

Bold - 3' Adapter  ||   Italics - RT primer   ||  In between both of them -UMI Sequence ||  After RT primer UNiversal adapter.
NGGGGTGTGACAATGGTGTTT**AACTGTAGGCACCATCAAT**ATACCCTATGCT*AGATCGGAAGAGCACACGTCT*GAA
NTCTGACATGTGTGCG**AACTGTAGGCACCATCAAT**ACGACGACCCTA*AGATCGGAAGAGCACACGTCT*GAACTCCA
RNA-Seq sequencing next-gen assembly • 8.3k views
ADD COMMENT
0
Entering edit mode

What does a selection of your reads look like? Can you post 3 or 4?

ADD REPLY
0
Entering edit mode

Sure:

@SN526:340:HTLWLBCXX:1:1103:1846:1930 1:N:0:ATCACG
NCTGAAAGGTGCTCTTGCCAATCTTCAATCTAGCCAACTGTAGGCACCATCAATTGAGACCCTCTTAGATCGGAAG
+
#<DDDIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIHHIIIIIIIIIIG
@SN526:340:HTLWLBCXX:1:1103:2401:1817 1:N:0:ATCACG
NAGAAAGATTGAACTGTAGGCACCATCAATTAAGCCGTCGTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
+
#<DDDIIIIIIIIHIIIIIIIIIIIIIIIIIIIHIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGF
@SN526:340:HTLWLBCXX:1:1103:2267:1834 1:N:0:ATCACG
NGCCGGTTAGCTCAGAACTGTAGGCACCATCAATAACCCGCTTACCAGATCGGAAGAGCACACGTCTGAACTCCAG
+
#<<@DHHHHHHIHIIIIIIIIIIIIIIIGIHIIHIIIHIIIGIHIIIIIIIHHHHDHEHECGHIIGHFIHHIIHHG
@SN526:340:HTLWLBCXX:1:1103:2513:1946 1:N:0:ATCACG
NCCGCGGGAACTGTAGGCACCATCAATATACTGACTACTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATC
+
#<<<@CHC?CEH<<<<@FHFCGEGHHH?<FHFEHIIIIFHHIHIE=CGCEHHHHHHDFEHCHHCCHHIEHIHHHE@
@SN526:340:HTLWLBCXX:1:1103:2705:1971 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAATAGTGATGAACAGAGATCGGAAGAGCACACGTCTGAA
+
#<DDDIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIHHIIIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:2879:1843 1:N:0:ATCACG
NCGCTGTGATGATGAACTGTAGGCACCATCAATCATGCCGTGTTAAGATCGGAAGAGCACACGTCTGAACTCCAG
+
#<<DDIIIIIIIIIIGHHHIIIGIIIIIIIIIIIIIIIIIHIIIIIIIHIIIIIIHHIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:3227:1874 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAATTAGTTACAAATGAGATCGGAAGAGCACACGTCTGAA
+
#<DDDIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ADD REPLY
0
Entering edit mode

Tagging: i.sudbery

ADD REPLY
0
Entering edit mode

Thank you @genomax and @i.sudbery for your inputs.

ADD REPLY
7
Entering edit mode
6.0 years ago

I'm a little confused as to your read structure, but what I think your saying is:

  1. The read starts with a variable sequence that represents the miRNA sequence.
  2. This is followed by an 3' adaptor of sequence AACTGTAGGCACCATCAAT
  3. There is then a 12 base UMI
  4. Then an RT primer, of sequence AGATCGGAAGAGCACACGTCT
  5. Finally the universal adaptor of unknown length.

In the below I'm assuming that you wish to keep 1; discard 2, 4, 5 and extract 3 as the UMI. Let me know if thats not the case.

Its definitely possible with umi-tools. The UMI-tools regex read specification make heavy use of "named capture groups". In a regular expression, a "capture group" is a part of the pattern you want to capture and use again later. To define a part of a patter as a group, you enclose these parts of the pattern parentheses, e.g. ATA(.+)GCG caputres any sequence between ATA and GCG. Capture groups are usually reffered to by number, but it is possible to give then names. Thus ATA(?P<captured_group>.+)GCG will capture the same group as before, but assign it the name "captured_group".

In UMI-tools, you basically build a regex with named capture groups. Groups named discard_n are discarded and groups named cell_n or umi_n are added to the read name. Sequence not captured by a group is left on the read.

  • Because your sequence of interset is on the 5' end of the read, your regex wants to start with any number of bases that want to be retained on the read .+
  • Next you want to find and discard the 3' adaptor: (?P<discard_1>AACTGTAGGCACCATCAAT)
  • You then want to capture and keep the next 12 characters as the UMI (?P<umi_1>.{12})
  • Finally you want to match as discard the RT primer, and anythin that comes after it (?P<discard_2>AGATCGGAAGAGCACACGTCT.+) (the .+ basically means any number of any characters)

As the start of the read is not matched by the regex, it is left alone.

Thus, your regex will want to look something like:

.+(?P<discard_1>AACTGTAGGCACCATCAAT)(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+)

This will only keep reads that have perfect matches to the RT primer and adaptor sequence. If you wish to allow some mismatches, you can do that by using the {s<=n} syntax. For example to allow one mismatch:

.+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=1}(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+){s<=1}

Assuming that your data is single ended, the final umi_tools command without allowing mismatches might look like:

umi_tools extract --stdin=unprocessed_reads.fa.gz --stdout=processed_reads.fa.gz --extract-method=regex --bc-pattern='.+(?P<discard_1>AACTGTAGGCACCATCAAT)(?P<umi_1>.{12})(?P<discard_2>AGATCGGAAGAGCACACGTCT.+)'
ADD COMMENT
0
Entering edit mode

No, No. 2 is supposed to be the 3 primer Adapter. and No. 4 is supposed to be the RT primer RT Primer is followed by the Universal adapter, which varies in length.

If I remove 2 and 4 and extract UMI, I am still left with the Universal adapter, which varies in length.

ADD REPLY
0
Entering edit mode

You need to remove universal adapter using conventional scanning/trimming. Best option for you is to look for reads that contain RT primer and only process those further.

ADD REPLY
0
Entering edit mode

Yeah, that seems to be the only viable option.

ADD REPLY
0
Entering edit mode

Answer updated to reflect this.

ADD REPLY
0
Entering edit mode

I gave it a try, but the no. of regex matches i got is incredibly low !

2019-01-15 18:59:01,497 INFO Input Reads: 5442010 2019-01-15 18:59:01,497 INFO regex does not match read1: 5438362 2019-01-15 18:59:01,497 INFO regex matches read1: 3648 2019-01-15 18:59:01,497 INFO Reads output: 364

ADD REPLY
1
Entering edit mode

My bad, the search for the barcode anchors at the 5' end of the read, so you need to specify that you want to accept any number of any base at the start of the read before the 3' adaptor.

ADD REPLY
0
Entering edit mode

Final Command that worked for us:

umi_tools extract --stdin=BU243_S1_L001_R1_001.fastq.gz --log=withonesubs2.log --stdout=processed_reads_3.fa.gz --extract-method=regex --bc-pattern='.+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=2}(?P<umi_1>.{12})(?P<discard_2>.+)'
ADD REPLY
3
Entering edit mode
6.0 years ago
GenoMax 148k

You could trim the RT-primer and all sequence to the end of the read, using bbduk.sh literal=AGATCGGAAGAGCACACGTCT ktrim=r k=15 minlen=0 option. That would ensure that your UMI's will now be at the 3-end most end. Then you should be able to use the --3prime option from umi_tools.

Final solution based on thread below:

bbduk.sh in=your_file.fq.gz out=stdout.fq literal=AGATCGGAAGAGCACACGTCT ktrim=r k=15 minlen=0 | umi_tools extract --extract-method=string --bc-pattern=NNNNNNNNNNNN --3-prime | bbduk.sh in=stdin.fq out=final.fq literal=AACTGTAGGCACCATCAAT ktrim=r k=15 minlen=0
ADD COMMENT
0
Entering edit mode

Yeah, that seems rational. I am relatively new to the sequencing world. So, Thank you so much. I will give it a try. Just a quick question. Can i also do the same thing with trim galore as well?

ADD REPLY
0
Entering edit mode

See if this does what you need. I can move to an answer once you test:

bbduk.sh in=your_file.fq.gz out=stdout.fq literal=AGATCGGAAGAGCACACGTCT ktrim=r k=15 minlen=0 | umi_tools extract --extract-method=string --bc-pattern=NNNNNNNNNNNN --3-prime

results in

##  INFO Starting barcode extraction
@SN526:340:HTLWLBCXX:1:1103:1846:1930_TTAGATCGGAAG 1:N:0:ATCACG
NCTGAAAGGTGCTCTTGCCAATCTTCAATCTAGCCAACTGTAGGCACCATCAATTGAGACCCTC
+
!<DDDIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIH
@SN526:340:HTLWLBCXX:1:1103:2401:1817_TAAGCCGTCGTG 1:N:0:ATCACG
NAGAAAGATTGAACTGTAGGCACCATCAAT
+
!<DDDIIIIIIIIHIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:2267:1834_AACCCGCTTACC 1:N:0:ATCACG
NGCCGGTTAGCTCAGAACTGTAGGCACCATCAAT
+
!<<@DHHHHHHIHIIIIIIIIIIIIIIIGIHIIH
@SN526:340:HTLWLBCXX:1:1103:2513:1946_ATACTGACTACT 1:N:0:ATCACG
NCCGCGGGAACTGTAGGCACCATCAAT
+
!<<<@CHC?CEH<<<<@FHFCGEGHHH
@SN526:340:HTLWLBCXX:1:1103:2705:1971_AGTGATGAACAG 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAAT
+
!<DDDIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:2879:1843_CATGCCGTGTTA 1:N:0:ATCACG
NCGCTGTGATGATGAACTGTAGGCACCATCAAT
+
!<<DDIIIIIIIIIIGHHHIIIGIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:3227:1874_TAGTTACAAATG 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAAT
+
!<DDDIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

This does not remove the reads that don't have the right adapter (for example read 1 in your test data above). You may need to do two step bbduk.sh to remove those.

ADD REPLY
0
Entering edit mode

Sorry, i did try to run it on my laptop but it turned off..due to lack of memory or something i believe. I will get back to you once i get access to the cluster.

ADD REPLY
0
Entering edit mode

Coudn't get access to the cluster..but i tried the same thing with Cutadapt and Trim_trim galore.

But there's a problem now...I noticed that some of the sequences (Example:The second sequence) don't have the RT primer attached. How should i proceed with these type of sequences?

SN526:340:HTLWLBCXX:1:1103:1684:1854 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTGTAACTGTAGGCACCATCAATAACCTTCATTC
+
#<DDDIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:1846:1930 1:N:0:ATCACG
NCTGAAAGGTGCTCTTGCCAATCTTCAATCTAGCCAACTGTAGGCACCATCAATTGAGACCCTCTT
+
#<DDDIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIHHI
@SN526:340:HTLWLBCXX:1:1103:2401:1817 1:N:0:ATCACG
NAGAAAGATTGAACTGTAGGCACCATCAATTAAGCCGTCGTG
+
#<DDDIIIIIIIIHIIIIIIIIIIIIIIIIIIIHIIIIIIIH
@SN526:340:HTLWLBCXX:1:1103:2267:1834 1:N:0:ATCACG
NGCCGGTTAGCTCAGAACTGTAGGCACCATCAATAACCCGCTTACC
+
#<<@DHHHHHHIHIIIIIIIIIIIIIIIGIHIIHIIIHIIIGIHII
@SN526:340:HTLWLBCXX:1:1103:2513:1946 1:N:0:ATCACG
NCCGCGGGAACTGTAGGCACCATCAATATACTGACTACT
+
#<<<@CHC?CEH<<<<@FHFCGEGHHH?<FHFEHIIIIF
@SN526:340:HTLWLBCXX:1:1103:2705:1971 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAATAGTGATGAACAG
+
#<DDDIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIHHI
@SN526:340:HTLWLBCXX:1:1103:2879:1843 1:N:0:ATCACG
NCGCTGTGATGATGAACTGTAGGCACCATCAATCATGCCGTGTTA
+
#<<DDIIIIIIIIIIGHHHIIIGIIIIIIIIIIIIIIIIIHIIII
@SN526:340:HTLWLBCXX:1:1103:3227:1874 1:N:0:ATCACG
NGGAGTGTGACAATGGTGTTTAACTGTAGGCACCATCAATTAGTTACAAATG
+
#<DDDIIIHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:3147:1888 1:N:0:ATCACG
NAAGGCTGTGCTGACCATCGATAACTGTAGGCACCATCAATGCAACATCAACG
+
#<DDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:3141:1981 1:N:0:ATCACG
NTGAAGAAATAGAACTGTAGGCACCATCAATGACGACCGCCCT
+
#<<DDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SN526:340:HTLWLBCXX:1:1103:3713:1818 1:N:0:ATCACG
NCCGGGGAAACTGTAGGCACCATCAATCATACTAAAACT
ADD REPLY
0
Entering edit mode

I noticed that some of the sequences (Example:The second sequence) don't have the RT primer attached.

Thinking off the top of my head. You could set maxlength=N-1 in bbduk.sh options to weed out those reads that don't get trimmed at all (which presumably lack RT primer). Note: Where N is length of your sequencing.

ADD REPLY
0
Entering edit mode

Yeah, we will discard those. Going through the data 96% of the seq. contain RT primer, which is workable for us. Thanks again

ADD REPLY
1
Entering edit mode

Just to complete my solution you will need another bbduk.sh pass to remove the 3'-adapter.

bbduk.sh in=your_file.fq.gz out=stdout.fq literal=AGATCGGAAGAGCACACGTCT ktrim=r k=15 minlen=0 | umi_tools extract --extract-method=string --bc-pattern=NNNNNNNNNNNN --3-prime | bbduk.sh in=stdin.fq out=final.fq literal=AACTGTAGGCACCATCAAT ktrim=r k=15 minlen=0
ADD REPLY

Login before adding your answer.

Traffic: 1708 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6