bash command for trimming and filtering .fastq files
1
1
Entering edit mode
6.2 years ago

Hello everyone,

I need little help to work with the genomic DNA seq. I would like to trim the reads (.fastq files) until first 'TA' position. And then, want to remove all the reads having length <40bp. It would be helpful if someone could share some useful commands to do the same. Thanks very much!

Best,

Himanshu

For ex -

@K00302:80:HLTWCBBXX:3:1101:3478:1402 1:N:0:NTAGGC
GGCGATGCGGCGGCGTTATTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACAAAAA
+
FAAA<JAA7AAFJ<AJJ-AJAFAFJFJ<-A<7<AAA-7AA-A7<F7-AJF7AA-77FFAJFFFFFJA<JAFJJ-A77AJFJFJF7F<FA7FJ<JJ7<J-A-
@K00302:80:HLTWCBBXX:3:1101:2199:1402 1:N:0:NTAGGC
GATAAATGCATTGTCCACTAAGAAGTTCTGAGCTGGAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCTTCACGTATTCCGT
+
JAFFFJF-<FFF-<-A--FFF-7JJFJJ--<A<<J-7FFFJJFFJF<JF7A7<--77J-AA-7AA-AAJ-7FFFA7<-7-7--7J--<---<)---7-7-----)7<

edit: Removed ** that were not part of the sequence. @genomax

filtering fastq files • 5.3k views
ADD COMMENT
0
Entering edit mode

Hi Himanshu Bhusan Samal,

  • Please select a more descriptive title for your thread so it's clear what this is about. You are looking at trimming and filtering.
  • Please use appropriate tags so experts can easily find your question and help you. Just 'next-gen' is meaningless.

  • I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

  • Please put some effort in solving this yourself. We are more eager to help you if you show us that you tried to solve it.
ADD REPLY
0
Entering edit mode

Thanks for your suggestions.

ADD REPLY
0
Entering edit mode

do you want to trim the reads from the left side up until (or including) the 'TA' of trim them back from the right up to the 'TA' ?

If the former then the answer given by b.nota will indeed be what you're looking for

ADD REPLY
0
Entering edit mode

Hello, thanks very much for both of your response. Yes, I would like to trim from 5' end (from left side up-to/including TA). Ex- 'GGCGATGCGGCGGCGTTA' from the first read. Then want to remove if any reads having length <40.

Previously I have tried cutadapt, but it's not filling the purpose as the allowed error rate is 10% and here is only two base, TA. That's why, I am looking for bash command to the same. If you think, with cutadapt it is possible then it will be helpful if you could give the solution. Thanks!

ADD REPLY
0
Entering edit mode

Yes, I was trying to bold the rest sequence after TA, (sequence which I want to keep after trimming, just to visualize for you) that's why '**'. And, randomly I have copied pasted the .fastq sequences (from a huge .fastq file), that's why the length is not matching to the quality of sequence. In the original .fastq file, everything is fine. The problem is arising when I am trimming only two base, TA.

ADD REPLY
0
Entering edit mode

see my post: add the -O 2 as a parameter will fix this !

you get this error because the adapter length you are looking for is smaller then the allowed overlap- between adapters causing cutadapt to fail , set the min overlap to 'adapter' length (== -O 2) to lower or equal to adapter length is the solution!

ADD REPLY
0
Entering edit mode

Did you try what lieven.sterck is suggesting? With the -O 2 argument?

ADD REPLY
0
Entering edit mode

Thank you so very much Sterck! I was trying that only. It's working the way you suggested (with -O 2 parameter). Couldn't post my message because I have limitation of sending 5 message per 6 hours, so reached the limitation. Sorry for the delay. I am really thankful for both of your effort, thank you!

ADD REPLY
5
Entering edit mode
6.2 years ago
Benn 8.3k

I suggest not to reinvent the wheel but use something like cutadapt. If I understand your aim correctly you want to remove everything preceding the first 'TA', correct?

Simple cutadapt command would do this.

cutadapt -g TA -o output.fq input.fq

See cutadapt -h for more information, also how to use it for the <40 cut off.

ADD COMMENT
0
Entering edit mode

might work indeed.

but probably worth looking into if cutadapt uses the left most one or the right most one. In the latter case a few rounds of trimming might be required (if cutadapt does not handle multiple copies of the "adapter" well)

ADD REPLY
0
Entering edit mode

-g argument means 5' of the read, please read help page.

ADD REPLY
0
Entering edit mode

yes, but that says nothing about the multiple "adapter" presence

and I was under the impression he wants to trim from right to left. But better ask OP for clarification

ADD REPLY
0
Entering edit mode

Did you read the manual?

ADD REPLY
0
Entering edit mode

I certainly did, which raised another remark: should you not add the -O 2 as parameters (to set the min overlap equal to the adapter length) ?

and I stand corrected on the multiple adapters issue (empirical determined)

ADD REPLY
0
Entering edit mode

Hi I am not sure if you really understood the manual then correctly, but let's go thru it together. The -g argument means to trim everything from 5' end until (including) the adapter, TA in this case. By default it only trims 1 adapter per read, so no need to change that. Furthermore, the default allowed error rate is 10%, which is 0.2 for this two base adapter, hence no error allowed. Can you please add your better solution as a new answer? Thank you.

ADD REPLY
0
Entering edit mode

Hello, thanks very much for both of your response. Yes, I would like to trim from 5' end (from left side up-to/including TA). Ex- 'GGCGATGCGGCGGCGTTA' from the first read. Then want to remove if any reads having length <40.

Previously I have tried cutadapt, but it's not filling the purpose as the allowed error rate is 10% and here is only two base, TA. That's why, I am looking for bash command to the same. If you think, with cutadapt it is possible then it will be helpful if you could give the solution. Thanks!

ADD REPLY
0
Entering edit mode

Well it seems you did not read the manual well enough, there is an argument -m which filters for length. Like I said before the error rate of 10% is 0.2 so no errors are allowed by default (and if you are still afraid of this error rate set it to 0, but it is not necessary).

cat input.fq
@K00302:80:HLTWCBBXX:3:1101:3478:1402 1:N:0:NTAGGC
GGCGATGCGGCGGCGTTATTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACAAAAA
+
FAAA<JAA7AAFJ<AJJ-AJAFAFJFJ<-A<7<AAA-7AA-A7<F7-AJF7AA-77FFAJFFFFFJA<JAFJJ-A77AJFJFJF7F<FA7FJ<JJ7<J-A-
@K00302:80:HLTWCBBXX:3:1101:2199:1402 1:N:0:NTAGGC
GATAAATGCATTGTCCACTAAGAAGTTCTGAGCTGGAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCTTCACGTATTCCGT
+
JAFFFJF-<FFF-<-A--FFF-7JJFJJ--<A<<J-7FFFJJFFJF<JF7A7<--77J-AA-7AA-AAJ-7FFFA7<-7-7--7J--<---<)---7-7-----

cutadapt -g TA -m 40 -o output.fq input.fq

cat output.fq
@K00302:80:HLTWCBBXX:3:1101:3478:1402 1:N:0:NTAGGC
TTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACAAAAA
+
AJAFAFJFJ<-A<7<AAA-7AA-A7<F7-AJF7AA-77FFAJFFFFFJA<JAFJJ-A77AJFJFJF7F<FA7FJ<JJ7<J-A-
@K00302:80:HLTWCBBXX:3:1101:2199:1402 1:N:0:NTAGGC
AATGCATTGTCCACTAAGAAGTTCTGAGCTGGAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCTTCACGTATTCCGT
+
FJF-<FFF-<-A--FFF-7JJFJJ--<A<<J-7FFFJJFFJF<JF7A7<--77J-AA-7AA-AAJ-7FFFA7<-7-7--7J--<---<)---7-7-----
ADD REPLY
0
Entering edit mode

Thanks for your reply. Of course, I read the manual and had tried the same that you mentioned. Since I was getting error, thought of trying in different way, if it is possible. Since it's working for you, maybe some problem from my side. I am trying to sort that, thanks!

Trimming 1 adapter(s) with at most 10.0% errors in single-end mode ...
Traceback (most recent call last):
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/bin/.cutadapt-real", line 10, in <module>
    cutadapt.main()
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/lib/python3.4/site-packages/cutadapt/scripts/cutadapt.py", line 833, in main
    stats = process_single_reads(reader, modifiers, writers)
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/lib/python3.4/site-packages/cutadapt/scripts/cutadapt.py", line 254, in process_single_reads
    read = modifier(read)
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/lib/python3.4/site-packages/cutadapt/scripts/cutadapt.py", line 190, in __call__
    match = self._best_match(read)
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/lib/python3.4/site-packages/cutadapt/scripts/cutadapt.py", line 132, in _best_match
    match = adapter.match_to(read)
  File "/gnu/store/9rwgx4pbq4sd4d18x6x24hqab8mv3iys-cutadapt-1.8/lib/python3.4/site-packages/cutadapt/adapters.py", line 293, in match_to
    assert match.length >= self.min_overlap
AssertionError
ADD REPLY
0
Entering edit mode

Okay, I see. The error was not in the original question, so I suggest to post a new question where you describe your error (and your code). Good luck!

ADD REPLY
0
Entering edit mode

I am not getting this error when I am trying with any different/long adapter. Code is same only. I have used cutadapt for adapter trimming several times, but this error is for the first time when I am trying to trim only two base, TA. Thats why I thought here maybe it's not possible with cutadapt, though I have read the manual. Hope you understand! It will be a great help if you can give me the solution, if possible. Thanks!

ADD REPLY
0
Entering edit mode

Maybe this can help: Cutadapt Assertionerror In your example fastq file, if you remove the ** in the sequence, the length of the DNA sequence is not equal to the quality sequence.

ADD REPLY
0
Entering edit mode

see my post: add the -O 2 as a parameter will fix this !

you get this error because the adapter length you are looking for is smaller then the allowed overlap- between adapters causing cutadapt to fail , set the min overlap to 'adapter' length (== -O 2) to lower or equal to adapter length is the solution!

and I assume he added the ** to indicate where he want is to be trimmed (or indicate the part he wants to keep)?

ADD REPLY
0
Entering edit mode

Okay, in my version of cutadapt (1.13) this seems not a problem. I assume some things have changed in the newer versions. If you remove the ** from the second read, the quality of that reads is longer in length. I don't know the history of that fastq file, but something went wrong.

ADD REPLY
0
Entering edit mode

strange, I did not find anything in the change log that this has been updated, moreover it should still be default 3 , anyway ...

+1 for the length diff, I noticed that as well (and for the top one removing the ** gives the corresponding quality length)

ADD REPLY
0
Entering edit mode

Yes, it's strange. In version 1.13 it is also default 3 but still seems to run with no errors.

ADD REPLY
1
Entering edit mode

Thank you so very much Sterck! I was trying that only. It's working the way you suggested (with -O 2 parameter). Couldn't post my message because I have limitation of sending 5 message per 6 hours, so reached the limitation. Sorry for the delay. I am really thankful for both of your effort, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1491 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6