Fastq Trimmer by pattern
4
0
Entering edit mode
7.9 years ago
dzisis1986 ▴ 70

I have a set of fastq files and I want to trim from the begining until the second restriction site (GTAC) found in the read. Is there any simple way to do it in terminal ?

Example:

Original read :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT TTCTTAAGATAGTGACAATAAGATCTGCTCTTGCTCACCTAGGTACCCGGACCACCAATCAGCAATCTCAGTTATCCTTTAATTTCTTTCGTTAGATCGGAAGAGCACACGTCTGAACTCCAGTC +

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFF

What I want as output :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

TTCTTAAGATAGTGACAATAAGATCTGCTCTTGCTCACCTAGGTAC

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

fastq trimming reads terminal • 5.5k views
ADD COMMENT
4
Entering edit mode
7.9 years ago

Using awk:

cat input.fq | paste - - - - | awk -F '\t' '{i=index($2,"GTAC"); OFS="\t";$2=(i==0?$2:substr($2,1,i+3));$4=(i==0?$4:substr($4,1,i+3)); print;}' | tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

if want to cut also a part from the start from AGATCT until the GTAC in order to have finally the read like that :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

AGATCTGCTCTTGCTCACCTAGGTAC

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFF

I should run one more time the awk with AGATCT ?? Thank you in advance for your valuable help !!!

ADD REPLY
1
Entering edit mode

My python script can easily get modified to get that done ;-)

from Bio import SeqIO
import sys

for record in SeqIO.parse(sys.argv[1], "fastq"):
    try:    
        positionSTART = str(record.seq).index("AGATCT")
        positionEND = str(record.seq).index("GTAC") + 4
        print(record[positionSTART:positionEND].format("fastq"))
    except ValueError:
        pass #Errors shouldn't go silent but up to OP what to do with those
ADD REPLY
0
Entering edit mode

This script seems to work but it is very slow in huge fastq files. I have a file which is 15 Gb and it takes 1 day to finish. Any suggestion to make it faster ??

ADD REPLY
0
Entering edit mode

Do you have multiple cores/processors? You could split the fastq and run it in parallel.

ADD REPLY
0
Entering edit mode

BBDuk will process a 15GB fastq file in a couple minutes...

ADD REPLY
1
Entering edit mode

Yes, but does BBDuk support to use both a start recognition sequence and an end recognition sequence? If not, this probably can be done by running BBDuk twice/sequential.

ADD REPLY
3
Entering edit mode

Kind of single step. bbduk.sh in=file.fq out=stdout.fq ktrim=r k=4 mm=f literal=GTAC rcomp=f ktrimexclusive | bbduk.sh in=stdin.fq out=trimmed.fq ktrim=l k=6 mm=f literal=AGATCT rcomp=f ktrimexclusive

Should add a minute or two :)

ADD REPLY
1
Entering edit mode

BBDuk2 actually supports left and right trim simultaneously, but it only allows a single fixed kmer length which would make things rather difficult in this case :) I don't really ever use it because BBDuk is more flexible, as in Genomax's command.

ADD REPLY
0
Entering edit mode

Can we modify the python script to search for : Everything that starts from AGATCT to GTAC, GTAC to AGATCT , AGATCT to the end of the sequence (if there is no GTAC) and GTAC to the end of the sequence (if there is not AGATCT) ?? Thank you in advance.

ADD REPLY
0
Entering edit mode

Yup that's perfectly possible, but it's still going to be slow for huge files.

ADD REPLY
0
Entering edit mode

still better than nothing ! At least with this is going to work correct !! How is going to be the new version ?

ADD REPLY
0
Entering edit mode

I think the following should do the trick, but I didn't test it.

ADD REPLY
0
Entering edit mode

I thing it doesnt work ! i tried it but there is no match. It should check a read and take eveything that starts from AGATCT to GTAC or from GTAC to AGATCT and if it doesnt exist one of those two in each case take until the end of the read (AGATCT to the end or GTAC to the end )

ADD REPLY
0
Entering edit mode

i tried it but there is no match.

So it tells you None of the sites were found in read...?

ADD REPLY
0
Entering edit mode

yes i always have the exit message

ADD REPLY
0
Entering edit mode

Can you show a read which gives the exit message?

ADD REPLY
0
Entering edit mode

None of the sites were found in read HISEQ:267:CAAV9ANXX:4:1101:10050:2218. this is the message i have and i checked the produced file which is only with the header of the read and emply !

ADD REPLY
0
Entering edit mode

Can you show me the actual sequence of that read?

grep -A 3 "HISEQ:267:CAAV9ANXX:4:1101:10050:2218" yourfile.fastq
ADD REPLY
0
Entering edit mode

@HISEQ:267:CAAV9ANXX:4:1101:10050:2218 1:N:0:AGTCAA GTGCGGTCGATATTTTGTATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAACAA
+

BBBBB/B<f< <="" <bf<fffffffffffffffffffffffffffff7ffffbfffffffffffffbfbf<ffffffffffffff<bfff="" bf="" fbf<b&lt;<f="" 7bfbbffff="" b="" bf<="" p="">

ADD REPLY
1
Entering edit mode

As I expected, that read DOESN'T contain AGATCT or GTAC so the error message you got is completely accurate.

ADD REPLY
0
Entering edit mode

No because this is just a read from a fastq file. I run the script for a normal fastq file with a lot of reads and the output was the error message and one empty file or a file with 5 headers of the read but nothing else !

ADD REPLY
0
Entering edit mode

Can we modify the python script to search for : Everything that starts from AGATCT to GTAC, GTAC to AGATCT , AGATCT to the end of the sequence (if there is no GTAC) and GTAC to the end of the sequence (if there is not AGATCT) ?? Thank you in advance.

That's what you asked for. That's what you got from me too. I'm really bad at reading your mind.

This thread is getting confusing and you keep changing what you have and what you need. I'm hesitating to just close this down and let you think a while about your question and the solution that would help you. I'm also in doubt if all this makes biological sense and wonder if you know what you are doing. In addition, I can read your reactions perfectly fine without adding exclamation marks, thank you very much.

So, let's try one more time: what should then happen to reads which contain none of both restriction sites?

ADD REPLY
0
Entering edit mode

I am sorry if i didnt explain well what i want. The read which contain none of both restriction sites will be discarded. we want to have a new fastq file from the original which contains only reads that starts with AGATCT until GTAC, if there is no GTAC(secondary rest site) it will take the rest of the read. Also reads that start from GTAC until AGATCT, and if there is no AGATCT(secondary rest site in this case) it will take until the end of the read. I hope this is clear. Thank you and i am so sorry for the confusion

ADD REPLY
1
Entering edit mode

So reads without the restriction sites get ignored, alright. So we can do without the final lines, I updated the code:

ADD REPLY
0
Entering edit mode

This seems ok but i have to test it for all files !I will come back after testing. Thank you for the reply and i am sorry for the misunderstanding.

ADD REPLY
0
Entering edit mode

Hey i finished with testing the script and it is working fine!I have one last question: how can we modify the script remove reads with second apperance of restriction site at each case ? Apparently i have some reads like that

GTACCCGGACCACCAATCAGCAATCTCAGTTATCGTACAATTTCTTTCGTTAGATCGGAAGAGCACACGTCTGAACTCCAGTC

and in this case i want to remove the second apperance and keep only this

GTAC CCGGACCACCAATCAGCAATCTCAGTTATC

or with the other cutter

AGATCT CCGGACCACCAATCAGCAATCTCAGTTATCAATTTCTTTCGTTAGATCGGAAGAGCACACGTCT AGATCT

and keep this

AGATCT CCGGACCACCAATCAGCAATCTCAGTTATCAATTTCTTTCGTTAGATCGGAAGAGCACACGTCT

Thank you very much for your valuable help !!

ADD REPLY
0
Entering edit mode

Does this only happen when the "other" restriction site is not found? Or always, even with the second cutter?

ADD REPLY
0
Entering edit mode

i found it when the second cutter was missing but maybe can be a case in which in one read we have

GTACCCGGACCACCAATCAGCAATCTCAGTTATCAGATCTAATTTCTTTCGTTAGATCGGAAGGTACCACGTCTGAACTCCAGTC

In that case we want only

GTACCCGGACCACCAATCAGCAATCTCAGTTATC

so actually at any case reads will start with any restriction site and end with a restriction site as we did before OR start with a restriction site and go until the end of the line in which there is no other restriction site or second appearance of restriction site

ADD REPLY
1
Entering edit mode

I've updated the code to do:

  1. Look for all matches of both restriction sites
  2. If one match is found, take from that position to the end
  3. If more than one match is found, take from the first to the second match
  4. If no match is found, do nothing

Obviously untested.

ADD REPLY
0
Entering edit mode

I have an error : python/3.5.1 load complete. File "trim_data.py", line 17 print(record[hits[0]:hits[1]].format("fastq")) ^ IndentationError: expected an indented block

ADD REPLY
0
Entering edit mode

That means something got messed up in the tabs of the script, check if the indentation is as I made it on the github gist.

ADD REPLY
0
Entering edit mode

This is what i did but still there is error. Now its something like : Traceback (most recent call last): File "trim_data.py", line 13, in <module> hits = findSite(record.seq, pattern) File "trim_data.py", line 9, in findSite return [m.start() for m in re.finditer(pattern, read)] File "/opt/exp_soft/local/generic/python/3.5.1/lib/python3.5/re.py", line 220, in finditer return _compile(pattern, flags).finditer(string) TypeError: expected string or bytes-like object

ADD REPLY
1
Entering edit mode

I've made a minor modification, can you try again?

ADD REPLY
0
Entering edit mode

It looks OK now. Thanks i will try it for the rest of my data !

ADD REPLY
2
Entering edit mode
7.9 years ago

Could you try this:

from Bio import SeqIO
import sys

for record in SeqIO.parse(sys.argv[1], "fastq"):
    try:    
        position = str(record.seq).index("GTAC") + 4
        print(record[:position].format("fastq"))
    except ValueError:
        pass #Errors shouldn't go silent but up to OP what to do with those

Needs biopython. Save as e.g. trimAfterGTAC.py
Execute as python trimAfterGTAC.py yourinput.fastq > youroutput.fastq

Now, when no occurrence of GTAC is found the error is silently ignored. Not good, but up to you to decide what to do with that.

ADD COMMENT
1
Entering edit mode
7.9 years ago

You can do this with BBDuk from the BBMap package:

bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=4 mm=f literal=GTAC rcomp=f ktrimexclusive
ADD COMMENT
0
Entering edit mode

i am using a cluster and i dont have the authority to install packages thats why i am looking for a simpler way in terminal. Thanks a lot anyway !

ADD REPLY
1
Entering edit mode

No installation is needed for BBMap suite. All you need is Java 1.7 or above. Things can't be any simpler than this.

ADD REPLY
0
Entering edit mode

Do you have some scripting experience? This should be quite easy (for example using Biopython).
Did I understand correctly that everything after the occurrence of GTAC is to be removed?

ADD REPLY
0
Entering edit mode

i am trying with grep awk etc to do it but i cant exactly! yes this is correnct i want to remove everything after the occurence of GTAC

ADD REPLY
0
Entering edit mode
7.8 years ago
dzisis1986 ▴ 70

I tried BBDuk in cluster and finally it works but the result is not what i want. i use as input pair end fastq files .

The Original read :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT 

TTCTTAAGATAGTGACAATA**AGATCTGCTCTTGCTCACCTAGGTAC**CCGGACCACCAATCAGCAATCTCAGTTATCCTTTAATTTCTTTCGTTAGATCGGAAGAGCACACGTCTGAACTCCAGTC 
+

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFF

What I want as output is everything from the primary restriction site trimmed at the second restriction site :

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

AGATCTGCTCTTGCTCACCTAG

+

BBBBBFFFFFFFFFFFFFFFFFF

or in the reverse reads

@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA

GTGCGGTCGATATTGTGTACCTATACCCCTCAGTAGAATAGATCTTTAACGTTTAATGATTGTGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTG

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFFFFFFFFFFFFFFFFFFFF

and after trimming should give in trim2 file the entry

@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA

GTACCTATACCCCTCAGTAGAAT

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFF

So at the end i want to have fastq files only with all reads like that

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT

AGATCTGCTCTTGCTCACCTAG

+

BBBBBFFFFFFFFFFFFFFFFFF 

@HISEQ:267:CAAV9ANXX:3:1103:18012:14165 1:N:0:TGACCA

GTACCTATACCCCTCAGTAGAAT

+

BBBBBFFFFFFFFFFFFFFFFFFFFFFF

Is it possible to do this with BBDuk ???? I tried with command line and other ways and finally i didnt have the correnct results .

ADD COMMENT
0
Entering edit mode

Something does not jive here. You keep referring to "reverse" reads (which I assume refers to R2 read) but the examples you have shown above all have 1:N:0:tag. So are these reads in two separate files that are labeled R1?

Please use ADD COMMENT/ADD REPLY when responding to existing posts. This helps keep threads logically organized. This particular information could have gone in the original post since you are explaining more clearly as to what you want.

ADD REPLY
0
Entering edit mode

I am sorry for the wrong output of my post . No i have one fastq file with forward and reverse primer siquences. The forward primer seq is for one data set for example ACACAATCATTAAACGTTAAAGATCT and the reverse is GTGCGGTCGATATTGTGTAC. In this case i want to look in the fastq file first keep only reads which start with the primer sequences (i did this) and the i want to do trimming on the reads as it is described above.I tried using diifferent ways in terminal with grep cat and python but everything is slow and apparently the result is not what i want. Thats why i am asking about an alternative faster way which is with BBDuk. Can you help with that ?

ADD REPLY
0
Entering edit mode

This is hard to follow because it seems like you keep changing what you say you want. But on your example forward read, I used this command:

bbduk.sh in=sample.fq out=stdout.fq ktrim=l k=6 mm=f literal=AGATCT rcomp=f ktrimexclusive | bbduk.sh in=stdin.fq out=out.fq ktrim=r k=4 mm=f literal=GTAC rcomp=f ktrimexclusive=f

...which produced this:

@HISEQ:267:CAAV9ANXX:4:1101:17780:2335 1:N:0:CGATGT
AGATCTGCTCTTGCTCACCTAG
+
FFFFFFFFFFFFFFFFFFFFFF

However, that command won't work if half your reads are reversed.

ADD REPLY
0
Entering edit mode

This is hard to follow because it seems like you keep changing what you say you want.

^^ This ^^

If I am getting this right there is one pattern (which we have demo'ed) for one half of the reads and another for the second (reverse - not sure what OP means there) half.

ADD REPLY
0
Entering edit mode

This is the case haof my reads are reversed.

ADD REPLY

Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6