I tried BBDuk in cluster and finally it works but the result is not what i want. i use as input pair end fastq files .
The Original read :
@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT
TTCTTAAGATAGTGACAATA**AGATCTGCTCTTGCTCACCTAGGTAC**CCGGACCACCAATCAGCAATCTCAGTTATCCTTTAATTTCTTTCGTTAGATCGGAAGAGCACACGTCTGAACTCCAGTC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFF
What I want as output is everything from the primary restriction site trimmed at the second restriction site :
@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT
AGATCTGCTCTTGCTCACCTAG
+
BBBBBFFFFFFFFFFFFFFFFFF
or in the reverse reads
@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA
GTGCGGTCGATATTGTGTACCTATACCCCTCAGTAGAATAGATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFFFF
and after trimming should give in trim2 file the entry
@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA
GTACCTATACCCCTCAGTAGAAT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFF
So at the end i want to have fastq files only with all reads like that
@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT
AGATCTGCTCTTGCTCACCTAG
+
BBBBBFFFFFFFFFFFFFFFFFF
@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA
GTACCTATACCCCTCAGTAGAAT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFF
Is it possible to do this with BBDuk ???? I tried with command line and other ways and finally i didnt have the correnct results .
if want to cut also a part from the start from AGATCT until the GTAC in order to have finally the read like that :
@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT
AGATCTGCTCTTGCTCACCTAGGTAC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFF
I should run one more time the awk with AGATCT ?? Thank you in advance for your valuable help !!!
My python script can easily get modified to get that done ;-)
This script seems to work but it is very slow in huge fastq files. I have a file which is 15 Gb and it takes 1 day to finish. Any suggestion to make it faster ??
Do you have multiple cores/processors? You could split the fastq and run it in parallel.
BBDuk will process a 15GB fastq file in a couple minutes...
Yes, but does BBDuk support to use both a start recognition sequence and an end recognition sequence? If not, this probably can be done by running BBDuk twice/sequential.
Kind of single step.
bbduk.sh in=file.fq out=stdout.fq ktrim=r k=4 mm=f literal=GTAC rcomp=f ktrimexclusive | bbduk.sh in=stdin.fq out=trimmed.fq ktrim=l k=6 mm=f literal=AGATCT rcomp=f ktrimexclusive
Should add a minute or two :)
BBDuk2 actually supports left and right trim simultaneously, but it only allows a single fixed kmer length which would make things rather difficult in this case :) I don't really ever use it because BBDuk is more flexible, as in Genomax's command.
Can we modify the python script to search for : Everything that starts from AGATCT to GTAC, GTAC to AGATCT , AGATCT to the end of the sequence (if there is no GTAC) and GTAC to the end of the sequence (if there is not AGATCT) ?? Thank you in advance.
Yup that's perfectly possible, but it's still going to be slow for huge files.
still better than nothing ! At least with this is going to work correct !! How is going to be the new version ?
I think the following should do the trick, but I didn't test it.
I thing it doesnt work ! i tried it but there is no match. It should check a read and take eveything that starts from AGATCT to GTAC or from GTAC to AGATCT and if it doesnt exist one of those two in each case take until the end of the read (AGATCT to the end or GTAC to the end )
So it tells you
None of the sites were found in read...
?yes i always have the exit message
Can you show a read which gives the exit message?
None of the sites were found in read HISEQ:267:CAAV9ANXX:4:1101:10050:2218. this is the message i have and i checked the produced file which is only with the header of the read and emply !
Can you show me the actual sequence of that read?
@HISEQ:267:CAAV9ANXX:4:1101:10050:2218 1:N:0:AGTCAA GTGCGGTCGATATTTTGTATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+
BBBBB/B<f< <="" <bf<fffffffffffffffffffffffffffff7ffffbfffffffffffffbfbf<ffffffffffffff<bfff="" bf="" fbf<b<<f="" 7bfbbffff="" b="" bf<="" p="">
As I expected, that read DOESN'T contain AGATCT or GTAC so the error message you got is completely accurate.
No because this is just a read from a fastq file. I run the script for a normal fastq file with a lot of reads and the output was the error message and one empty file or a file with 5 headers of the read but nothing else !
That's what you asked for. That's what you got from me too. I'm really bad at reading your mind.
This thread is getting confusing and you keep changing what you have and what you need. I'm hesitating to just close this down and let you think a while about your question and the solution that would help you. I'm also in doubt if all this makes biological sense and wonder if you know what you are doing. In addition, I can read your reactions perfectly fine without adding exclamation marks, thank you very much.
So, let's try one more time: what should then happen to reads which contain none of both restriction sites?
I am sorry if i didnt explain well what i want. The read which contain none of both restriction sites will be discarded. we want to have a new fastq file from the original which contains only reads that starts with AGATCT until GTAC, if there is no GTAC(secondary rest site) it will take the rest of the read. Also reads that start from GTAC until AGATCT, and if there is no AGATCT(secondary rest site in this case) it will take until the end of the read. I hope this is clear. Thank you and i am so sorry for the confusion
So reads without the restriction sites get ignored, alright. So we can do without the final lines, I updated the code:
This seems ok but i have to test it for all files !I will come back after testing. Thank you for the reply and i am sorry for the misunderstanding.
Hey i finished with testing the script and it is working fine!I have one last question: how can we modify the script remove reads with second apperance of restriction site at each case ? Apparently i have some reads like that
and in this case i want to remove the second apperance and keep only this
or with the other cutter
and keep this
Thank you very much for your valuable help !!
Does this only happen when the "other" restriction site is not found? Or always, even with the second cutter?
i found it when the second cutter was missing but maybe can be a case in which in one read we have
In that case we want only
so actually at any case reads will start with any restriction site and end with a restriction site as we did before OR start with a restriction site and go until the end of the line in which there is no other restriction site or second appearance of restriction site
I've updated the code to do:
Obviously untested.
I have an error : python/3.5.1 load complete. File "trim_data.py", line 17 print(record[hits[0]:hits[1]].format("fastq")) ^ IndentationError: expected an indented block
That means something got messed up in the tabs of the script, check if the indentation is as I made it on the github gist.
This is what i did but still there is error. Now its something like : Traceback (most recent call last): File "trim_data.py", line 13, in <module> hits = findSite(record.seq, pattern) File "trim_data.py", line 9, in findSite return [m.start() for m in re.finditer(pattern, read)] File "/opt/exp_soft/local/generic/python/3.5.1/lib/python3.5/re.py", line 220, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object
I've made a minor modification, can you try again?
It looks OK now. Thanks i will try it for the rest of my data !