Hi, I have paired-end fastq file which contain the sequences of the target regions along with probes sequences on either side (in entirety or partially). I have sequences of all the 3,200 probes. I want to quickly trim the probe sequences from the fastq files if they occur on either side of the target region (only if they start with or end with the probes).
Does anyone have a suggestion for a tool to do the above? I have tried Trimmomatic and Cutadapt but they are too slow as they are designed with only a handful of probes (or adapters) in mind?
Is there anyway to remove 1000's of probes that occur at the end of the reads in paired-end fastq files? Thanks!
Thanks, genomax. I did try
bbduk.sh
but it did not give me what I expected. I ran it to only look for probes on the 5' end with exact matches in the first 31 bp as follows:bbduk.sh in1=L001_R1_001.fastq.gz in2=L001_R2_001.fastq.gz out1=bbmap_R1.fastq.gz out2=bbmap_R2.fastq.gz ref=RC_DLSO_and_ULSO.fa rcomp=f restrictleft=31 hdist=0 minkmerfraction=1.0 tbo
As you can see, it did not remove any bases due to matches with the probes (although I checked that they are there) but only due to overlap. Moreover, my ref file (RC_DLSO_and_ULSO.fa) contains ~6,500 probes but the info above says that it added 8630 k-mers. So, I think that my ref file is not being read.
Then, I ran it with half the k-mers (only for R1):
bbduk.sh in1=R1_001.fastq.gz in2=R2_001.fastq.gz out1=bbmap_R1.fastq.gz out2=bbmap_R2.fastq.gz ref=RC_DLSO_without_start.fa rcomp=f restrictleft=31 hdist=0 minkmerfraction=1.0 tbo skipr2=t
Above, it said it added 7164 k-mers intead of ~3250 in my ref file.
And it only took 14 secs for a file of 3,250 probes.
Can you help me figuring out with what is going on?
Thanks a lot in advance!
Can you add
k=13 ktrim=l
and removeminkmerfraction=1.0
in your command line and re-try? Are your probe sequences in fasta format in that file?hi genomax, yes, that worked. What does
k=13
mean?Also, the reason why I had
minmerfraction=1.0
is because I wanted to remove only those 5' parts that were 100% identical to the probe sequences. I hope that it still does that.Yes, the probe sequences are in multi-fasta format. Thanks!
That is the k-mer size used for initial/seed matches. What do the stats look like?