Remove hundreds of probes
2
0
Entering edit mode
7.0 years ago
Floydian_slip ▴ 170

Hi, I have paired-end fastq file which contain the sequences of the target regions along with probes sequences on either side (in entirety or partially). I have sequences of all the 3,200 probes. I want to quickly trim the probe sequences from the fastq files if they occur on either side of the target region (only if they start with or end with the probes).

Does anyone have a suggestion for a tool to do the above? I have tried Trimmomatic and Cutadapt but they are too slow as they are designed with only a handful of probes (or adapters) in mind?

Is there anyway to remove 1000's of probes that occur at the end of the reads in paired-end fastq files? Thanks!

probes trimming • 2.1k views
ADD COMMENT
0
Entering edit mode
7.0 years ago
GenoMax 147k

You can use bbduk.sh from BBMap suite. Provide the sequences of the probes in multi-fasta format in a file. You will have to do ktrim=lr if the sequences are on either side (or may have to do a two-pass trim).

ADD COMMENT
0
Entering edit mode

Thanks, genomax. I did try bbduk.sh but it did not give me what I expected. I ran it to only look for probes on the 5' end with exact matches in the first 31 bp as follows:

bbduk.sh in1=L001_R1_001.fastq.gz in2=L001_R2_001.fastq.gz out1=bbmap_R1.fastq.gz out2=bbmap_R2.fastq.gz ref=RC_DLSO_and_ULSO.fa rcomp=f restrictleft=31 hdist=0 minkmerfraction=1.0 tbo

Added 8630 kmers; time:     0.225 seconds.
Memory: max=1457m, free=1396m, used=61m
Input is being processed as paired
Started output streams: 0.121 seconds.
Processing time:        14.583 seconds.
Input:                      1089512 reads       164516312 bases.
Contaminants:               0 reads (0.00%)     0 bases (0.00%)
Trimmed by overlap:         51452 reads (4.72%)     2452840 bases (1.49%)
Total Removed:              0 reads (0.00%)     2452840 bases (1.49%)
Result:                     1089512 reads (100.00%)     162063472 bases (98.51%)

As you can see, it did not remove any bases due to matches with the probes (although I checked that they are there) but only due to overlap. Moreover, my ref file (RC_DLSO_and_ULSO.fa) contains ~6,500 probes but the info above says that it added 8630 k-mers. So, I think that my ref file is not being read.

Then, I ran it with half the k-mers (only for R1):

bbduk.sh in1=R1_001.fastq.gz in2=R2_001.fastq.gz out1=bbmap_R1.fastq.gz out2=bbmap_R2.fastq.gz ref=RC_DLSO_without_start.fa rcomp=f restrictleft=31 hdist=0 minkmerfraction=1.0 tbo skipr2=t

Added 7164 kmers; time:     0.162 seconds.
Memory: max=1459m, free=1359m, used=100m
Input is being processed as paired
Started output streams: 0.123 seconds.
Processing time:        14.258 seconds.
Input:                      1089512 reads       164516312 bases.
Contaminants:               0 reads (0.00%)     0 bases (0.00%)
Trimmed by overlap:         51452 reads (4.72%)     2452840 bases (1.49%)
Total Removed:              0 reads (0.00%)     2452840 bases (1.49%)
Result:                     1089512 reads (100.00%)     162063472 bases (98.51%)

Above, it said it added 7164 k-mers intead of ~3250 in my ref file.

And it only took 14 secs for a file of 3,250 probes.

Can you help me figuring out with what is going on?

Thanks a lot in advance!

ADD REPLY
0
Entering edit mode

Can you add k=13 ktrim=l and remove minkmerfraction=1.0 in your command line and re-try? Are your probe sequences in fasta format in that file?

ADD REPLY
0
Entering edit mode

hi genomax, yes, that worked. What does k=13 mean?

Also, the reason why I had minmerfraction=1.0 is because I wanted to remove only those 5' parts that were 100% identical to the probe sequences. I hope that it still does that.

Yes, the probe sequences are in multi-fasta format. Thanks!

ADD REPLY
0
Entering edit mode

That is the k-mer size used for initial/seed matches. What do the stats look like?

ADD REPLY
0
Entering edit mode
7.0 years ago
chen ★ 2.5k

If you have a file contains all the probes, you can use buduk.sh from BBMap suite to remove them, as @genomax recommended.

But if you don't have a file contains all the probes, you can use fastp to remove all the probes, without the need to provide the probe sequences. fastp trims adapters (probes) by finding the insert size for paired end data, and treating the read tails beyond insert size as adapters. So fastp can do this without adapter sequences provided.

The command is simple:

fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz

This tool is very fast (written in C++, with multithreading supported), you can get it from: https://github.com/OpenGene/fastp

ADD COMMENT
0
Entering edit mode

fastp trims adapters (probes) by finding the insert size for paired end data, and treating the read tails beyond insert size as adapters

@chen: Would you mind clarifying how this feature works if the PE reads do not overlap? They should not, in majority of cases with well made libraries. You do not appear to be using a reference in the above command line either.

ADD REPLY
0
Entering edit mode

The trick is:

PE is not overlapped --> insert DNA is longer than sequencing length --> no adapters are sequenced.

The only issue is sequencing errors will cause mismatches in the overlapped region, but it can be handled by error tolerant algorithms.

ADD REPLY
0
Entering edit mode

Hi Chen, I do have the sequences of the probes but I still tried it nonetheless - it worked but it did not remove all the probes (only about half of them were removed). I think its because it does not have all the info (the exact sequences of all the probes).

ADD REPLY
0
Entering edit mode

How did you know a half of probes were removed and rest were kept? Can you post the table of adapter trimming result?

ADD REPLY
0
Entering edit mode

Hi chen, actually, it does work - I loaded it in a browser and it looked good. Thanks for the suggestion!

ADD REPLY
0
Entering edit mode

I am glad that it helps:)

ADD REPLY

Login before adding your answer.

Traffic: 2888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6