Prinseq lite data preprocessing
2
1
Entering edit mode
3.9 years ago
Adarsh Kuamr ▴ 60

Hello everyone..

I am learning RNA seq analysis. Firstly, I am using Prinseq lite for preprocessing of data.

I used command:

perl prinseq-lite.pl –fastq read_1.fastq -fastq2 read_2.fastq -out_format 5 -min_len 50 -min_qual_mean 25

I got three output files in same folder for each data file. These are _prinseq_good_singletons_, _prinseq_good_, _prinseq_bad_.

Further, the size of _prinseq_good_ is greater than input data file. Is it OK?

Please suggest me that which file could I use for downstream analysis?

Prinseq-lite RNA-seq • 2.6k views
ADD COMMENT
0
Entering edit mode

File sizes are not a good measure of anything by themselves. Does prinseq print a log file or a stats file of some sort? That would be useful in understanding what happens in the run. Also, read the manual - that should describe each output file.

ADD REPLY
0
Entering edit mode

Hello there Adarsh,

I am also using prinseq-lite to process my RNA data prior to analysis, and have also noticed that many of my output fastq files are larger than the input fastq files. Did you happen to obtain a satisfactory answer as to why this might be?

Tom

ADD REPLY
0
Entering edit mode

Are these files gzipped? If so, compression ratios depend on content sorting.

ADD REPLY
0
Entering edit mode

Hi there, thank you for your reply.

they are not zipped no. prinseq-lite doesn't accept compressed fastq files as input. having gone through my files, it seems an 18GB input fq file will usually result in a 21GB output fq file. and the larger the input fq file, the larger the increase in size.

wondering if maybe prinseq-lite doesn't actually remove filtered reads, but rather masks them in some way? that's the only thing I could think of as to why this would happen.

ADD REPLY
0
Entering edit mode

That might actually be the case, Take a look at the first 100 or so headers on the input and output FQs:

sed -n '1~4p;402q' file.fq #command not tested but should work
ADD REPLY
0
Entering edit mode
3.9 years ago
GenoMax 147k

I am not a prinseq user but based on the names _prinseq_good_singletons_, _prinseq_good_ would be the files you would want. Good are reads where both reads (from R1/R2) survived the trimming. You will want to be cautious about using the singleton file. Most aligners will not allow you to mix paired and singleton reads in the same alignment.

File sizes are never a good metric for anything (unless you are just making sure file produced is not empty). Since your files don't appear to be compressed hopefully the size difference is negligible. Generally compressibility of data results in file size changes as data is lost via trimming/filtering for example.

ADD COMMENT
0
Entering edit mode

Thank you for your response

ADD REPLY
0
Entering edit mode
3.6 years ago
nzulapa • 0

_prinseq_good_singletons_: contain the reads which lost their pairs

_prinseq_good_: contain the remained pairs after removing duplicate, low complexity,...

ADD COMMENT

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6