Hello everyone..
I am learning RNA seq analysis. Firstly, I am using Prinseq lite for preprocessing of data.
I used command:
perl prinseq-lite.pl –fastq read_1.fastq -fastq2 read_2.fastq -out_format 5 -min_len 50 -min_qual_mean 25
I got three output files in same folder for each data file. These are _prinseq_good_singletons_
, _prinseq_good_
, _prinseq_bad_
.
Further, the size of _prinseq_good_
is greater than input data file. Is it OK?
Please suggest me that which file could I use for downstream analysis?
File sizes are not a good measure of anything by themselves. Does prinseq print a log file or a stats file of some sort? That would be useful in understanding what happens in the run. Also, read the manual - that should describe each output file.
Hello there Adarsh,
I am also using prinseq-lite to process my RNA data prior to analysis, and have also noticed that many of my output fastq files are larger than the input fastq files. Did you happen to obtain a satisfactory answer as to why this might be?
Tom
Are these files gzipped? If so, compression ratios depend on content sorting.
Hi there, thank you for your reply.
they are not zipped no. prinseq-lite doesn't accept compressed fastq files as input. having gone through my files, it seems an 18GB input fq file will usually result in a 21GB output fq file. and the larger the input fq file, the larger the increase in size.
wondering if maybe prinseq-lite doesn't actually remove filtered reads, but rather masks them in some way? that's the only thing I could think of as to why this would happen.
That might actually be the case, Take a look at the first 100 or so headers on the input and output FQs: