Trimming A Fastq File Representing Rna-Seq Data
4
5
Entering edit mode
12.6 years ago
Varun Gupta ★ 1.3k

Hi Everyone

So I am using RNA-seq data. I have raw fastq files which has read length of around 100bp.

I quickly ran the fastqc to check for the quality of my fastq file, and I inferred that I need to remove first 10 bases(even though quality scores are high for them , per base % gc content is higher for first 10 bp). Since my read length is 100 bp and as we go along the read the quality decreases, I would like to generate a fastq file which has reads starting from base position 10 till base position 80.

How can I do that??

Any command would be useful??

Hope to hear back soon

Regards

fastq RNA-seq • 29k views
ADD COMMENT
0
Entering edit mode

I have got the same case here, high quality score and strangely biased GC content and base frequencies. Do you have any idea how/why does this problem emerge?

ADD REPLY
0
Entering edit mode

Hi this problem i think occurs becoz of the rna seq done during library preparation since we use random hexamer. I think there is a paper on this which explains it. I will post it here as soon as i find it.

Varun

ADD REPLY
0
Entering edit mode

Thank you. Will be looking forward to it.

ADD REPLY
2
Entering edit mode

Hi Here's the link

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/

Hope this solves some of your doubts..

ADD REPLY
19
Entering edit mode
12.6 years ago
lh3 33k

Seqtk (input files can be optionally gzip'ed):

seqtk trimfq -b 10 -e 20 reads.fq.gz > out.fq

This trims 10bp from the left end and 20bp from the right end. It is over 10X faster than fastx_toolkit/fastx_trimmer:

hengli@xxxx c_elegans$ time seqtk trimfq -b 10 -e 10 /dev/shm/1.fq > /dev/null

real    0m1.751s
user    0m0.606s
sys     0m1.137s

hengli@xxxx c_elegans$ time fastx_trimmer -f 11 -l 90 -Q33 -i /dev/shm/1.fq -o /dev/null

real    0m26.960s
user    0m10.179s
sys     0m16.722s

hengli@xxxx c_elegans$ time gawk 'NR%2==0{print substr($1,11,80)}NR%2==1' /dev/shm/1.fq > /dev/null

real    0m5.316s
user    0m1.803s
sys     0m3.490s

hengli@xxxx c_elegans$ time perl prinseq-lite.pl -fastq /dev/shm/1.fq -trim_left 10 -trim_right 10

real    0m36.002s
user    0m38.517s
sys     0m9.374s

Seqtk also implements the phred quality trimming algorithm, which is the most sophisticated algorithm. It is still much faster than fastx_toolkit/fastq_quality_trimmer:

hengli@xxxx c_elegans$ time seqtk trimfq /dev/shm/1.fq > /dev/null

real    0m2.631s
user    0m0.746s
sys     0m1.885s

hengli@xxxx c_elegans$ time fastq_quality_trimmer -Q33 -l 30 -t 13 -i /dev/shm/1.fq -o /dev/null

real    0m21.065s
user    0m14.494s
sys     0m6.557s

If you have huge fastq files, seqtk is preferred. It is fast, sophisticated, easy to install and feature rich.

ADD COMMENT
0
Entering edit mode

Looks lightening fast, may the fastx people want to restructure their code.

ADD REPLY
0
Entering edit mode

This looks great, but I have a couple of questions. 1) What are the specs for the file used for benchmarking? I'd like to know about the file used so I can have an idea how other trimming methods compare. 2) When you say "phred quality trimming," do you mean seqtk will quality trim files with numerical Phred values or will it handle Sanger/Illumina encoded Fastq files? I assume you mean the latter, but wanted to make sure.

ADD REPLY
1
Entering edit mode

1) It is the first 1 million read1 from SRR065390. Nonetheless, for fixed-length trimming, which file to use does not matter. 2) The phred trimming algorithm is described in the phred (or phrap) documentation. The BWA trimming is derived from that somehow. The phred/bwa trimming algorithms are more robust and theoretically correct than trimming all the bases with quality below a threshold (e.g. the 90th base happen to have higher quality, but bases 80-100bp are generally very poor; we should trim off 20bp instead of 10).

ADD REPLY
0
Entering edit mode

seqtk is fantastic, but there is a bug with seqtrim when the trimmed-off length is longer than the reads. The output becomes huge with broken format, which seems from the @@@@@@@@\n string. Could not figure out the fix by myself yet.

ADD REPLY
0
Entering edit mode

Do you have any plans to add colorspace support?

ADD REPLY
0
Entering edit mode

Fixed-length trimming should work for color-space. Quality-based trimming does not as it may trim from the 5'-end.

ADD REPLY
12
Entering edit mode
12.6 years ago
brentp 24k

Something like this should work

awk 'NR % 2 == 0 { print substr($1, 11, 80 - 10) } NR % 2 == 1' input.fastq > output.trimmed.fastq

commented version

(NR % 2 == 0) { # this is the pattern, match 2nd and 4th lines (sequence and quality)
     print substr($1, 11, 80 - 10) # print the substring starting at 11 (1-based) and length 70
}
(NR % 2 == 1) # these are the name and "+" lines, just print them as-is. in awk,
              # print is implicit if there is no {} block

that said, you likely should use a trimmer that takes quality into account.

ADD COMMENT
0
Entering edit mode

Thanks brent this worked. If possible could you explain what you did, i have a liittle bit idea of awk but not much. Still learning awk.

ADD REPLY
0
Entering edit mode

@Varun, I added some comments.

ADD REPLY
0
Entering edit mode

Thanks brent . It was really helpful. Yes i have taken quality into account. I already ran fastqc on my fastq file and came to conclusion that first 10 bases and bases after 80 are not so good for my analysis. I hope that is ok. Thanks a lot

ADD REPLY
0
Entering edit mode

I also trim the fastq file use this kind of perl script!

ADD REPLY
0
Entering edit mode

Hi

Can you share your perl script. Regards

ADD REPLY
0
Entering edit mode

Hello Varun:

The perl script is much alike with the version of Brentp provided.

I think they will have the same function

Do you still needed?

Best Regards,

ADD REPLY
0
Entering edit mode

Hi Callsobing

If you could share it, it would be wonderful. Good to learn new things.

Thanks

Varun

ADD REPLY
6
Entering edit mode
12.6 years ago

You can use FASTX toolkit: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

It has a trimmer tool. Also you might want to look at their quality filter tool.

ADD COMMENT
0
Entering edit mode

Thanks Dk, just saw that, works smoothly.

ADD REPLY
0
Entering edit mode

I think you should try FASTQ Quality Trimmer insted of FASTQ/A Trimmer, because this script only remove the bad quality tails of the reads.... When I use this script I remove all the tail that have Q score under 10 and keeps only the reads that remains 50 pb o longer after trimming.

ADD REPLY
4
Entering edit mode
12.6 years ago

From FASTX toolkit :

fastx_trimmer -f 11 -l 80 -i file.fastq -o file_trimmed.fastq

If compressed;

gunzip -c file.fastq.gz | fastx_trimmer -f 11 -l 80 -g -o file_trimmed.fastq.gz

You just need to take care about the encoding of the quality score, if its sanger, then you need to add a '-Q33' parameter to the command. Unlike fastqc, it doesnt picks the encoding format automatically. There is another tools called PrinSeq, written in Perl, if you are eager to explore something new.

Cheers

ADD COMMENT
0
Entering edit mode

Thanks for above information. As I am using fastq_quality_trimmer with below command for single end(single file) by trimming 7 bp from end of sequence and impoving quality of it. So i need to know is it right or not?? fastq_quality_trimmer -t 7 -Q33 -i file1.fastq -o file1trim.fastq

Thanks

ADD REPLY
0
Entering edit mode

What is right or not, the command or the way?? Check the output of fastqc and then remove!!

ADD REPLY
0
Entering edit mode

Thanks a lot for replying. Actually command is right but it's showing peculiar result like difference in no of reads after using fastq _qualty _trimmer command. For raw fastq file reads were 16573994 and after it was 16356445. However I am not sure reads should remain same or it can vary???

ADD REPLY
0
Entering edit mode

You are using wrong parameter, read the help carefully.

usage: fastq_quality_trimmer [-h] [-v] [-t N] [-l N] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.13 by A. Gordon (gordon@cshl.edu)

  [-h]         = This helpful help screen.
  [-t N]       = Quality threshold - nucleotides with lower 
                 quality will be trimmed (from the end of the sequence).
  [-l N]       = Minimum length - sequences shorter than this (after trimming)
                 will be discarded. Default = 0 = no minimum length. 
  [-z]         = Compress output with GZIP.
  [-i INFILE]  = FASTQ input file. default is STDIN.
  [-o OUTFILE] = FASTQ output file. default is STDOUT.
  [-v]         = Verbose - report number of sequences.
                 If [-o] is specified,  report will be printed to STDOUT.
                 If [-o] is not specified (and output goes to STDOUT),
                 report will be printed to STDERR.
ADD REPLY
0
Entering edit mode

Thanks! As after correcting parameter and trying every possibility also its showing difference in no of reads in case for fastq_quality_trimmer but not in case for fastx_trimmer. So my question IS "Is it possible to have difference in no of reads in case for fastq quality trimmer, before and after trimming"?

ADD REPLY

Login before adding your answer.

Traffic: 2234 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6