Hello everyone,
I need little help to work with the genomic DNA seq. I would like to trim the reads (.fastq files) until first 'TA' position. And then, want to remove all the reads having length <40bp. It would be helpful if someone could share some useful commands to do the same. Thanks very much!
Best,
Himanshu
For ex -
@K00302:80:HLTWCBBXX:3:1101:3478:1402 1:N:0:NTAGGC
GGCGATGCGGCGGCGTTATTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACAAAAA
+
FAAA<JAA7AAFJ<AJJ-AJAFAFJFJ<-A<7<AAA-7AA-A7<F7-AJF7AA-77FFAJFFFFFJA<JAFJJ-A77AJFJFJF7F<FA7FJ<JJ7<J-A-
@K00302:80:HLTWCBBXX:3:1101:2199:1402 1:N:0:NTAGGC
GATAAATGCATTGTCCACTAAGAAGTTCTGAGCTGGAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCTTCACGTATTCCGT
+
JAFFFJF-<FFF-<-A--FFF-7JJFJJ--<A<<J-7FFFJJFFJF<JF7A7<--77J-AA-7AA-AAJ-7FFFA7<-7-7--7J--<---<)---7-7-----)7<
edit: Removed **
that were not part of the sequence. @genomax
Hi Himanshu Bhusan Samal,
Please use appropriate tags so experts can easily find your question and help you. Just 'next-gen' is meaningless.
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
Thanks for your suggestions.
do you want to trim the reads from the left side up until (or including) the 'TA' of trim them back from the right up to the 'TA' ?
If the former then the answer given by b.nota will indeed be what you're looking for
Hello, thanks very much for both of your response. Yes, I would like to trim from 5' end (from left side up-to/including TA). Ex- 'GGCGATGCGGCGGCGTTA' from the first read. Then want to remove if any reads having length <40.
Previously I have tried cutadapt, but it's not filling the purpose as the allowed error rate is 10% and here is only two base, TA. That's why, I am looking for bash command to the same. If you think, with cutadapt it is possible then it will be helpful if you could give the solution. Thanks!
Yes, I was trying to bold the rest sequence after TA, (sequence which I want to keep after trimming, just to visualize for you) that's why '**'. And, randomly I have copied pasted the .fastq sequences (from a huge .fastq file), that's why the length is not matching to the quality of sequence. In the original .fastq file, everything is fine. The problem is arising when I am trimming only two base, TA.
see my post: add the
-O 2
as a parameter will fix this !you get this error because the adapter length you are looking for is smaller then the allowed overlap- between adapters causing cutadapt to fail , set the min overlap to 'adapter' length (==
-O 2
) to lower or equal to adapter length is the solution!Did you try what lieven.sterck is suggesting? With the
-O 2
argument?Thank you so very much Sterck! I was trying that only. It's working the way you suggested (with -O 2 parameter). Couldn't post my message because I have limitation of sending 5 message per 6 hours, so reached the limitation. Sorry for the delay. I am really thankful for both of your effort, thank you!