I'm analyzing a siRNA library and after receiving the data from the sequencer I applied "Cutadapt" to tream the adapter sequences and filter the reads by quality.
Now the next step before mapping is to collapse the repeated reads. To do so, I have tried to use "Fastxcollapser" from Fastqtools but it doesn't work because apparently the file I want to collapse is too large (6.9GB).
Does anyone knows any other program able to collapse big files?
ADD COMMENT
• link
updated 13.3 years ago by
lh3
33k
•
written 13.3 years ago by
Cdiez
▴
150
0
Entering edit mode
I would run your analysis both ways - collapsed and uncollapsed. The chances of seeing two identical short rna's that are biologically real are much higher than seeing two identical reads from a coding RNA-Seq run
I used to do similar things for a >100GB file. 6.9G should be done in minutes if I am right. Actually for a file as small as 6.9GB, the fastest way is to load all the data into memory and then do a quicksort. This would not take more than 4GB memory. If you do not care about reads containing "N", 2GB is enough.
Sorry but I'm not familiar with UNIX (I'm just starting to analyze this kind of data) and I don't get what is the meaning of 'NR%4==2' and -S1900G in the command line. Although probably it's a silly question I'd be very grateful if you could add some comments about it. Thanks in advance!
At this point, output.txt contains each unique sequence and the count of occurrences, you'll still need some kind of lookup to pull out the full read with the header and quality info.
My guess is when we prefer to collapse reads, we disregard quality and read names; otherwise we would not want to do this in the first place. Nonetheless, I do not do chip/mirna-seq. Could be wrong.
iterate over reads and put them in a bloom filter.
if subsequent reads appear in the filter, put them in a python set.
that set is putative matches (a bloom filter can give false positives)
iterate over the reads again and check against the set.
The python package I used for that is here.
You can get a tar ball of that here. After you untar it, you can run
$ sudo python setup.py install
And it will put bloomfilter in your python path. (You may need to have cython
installed).
That will allow you to use a lot less memory than storing the reads in a hash.
You could also have a look at khmer which has a few scripts that might get you started.
Thanks a lot for you answer! I've tried to run your script but it seems that I need to have installed some "extra" python-packages (bloomfaster) to run it. Please could you indicate me briefly which are these packages and how I could install them to make the script work?
Thanks in advance!
I would run your analysis both ways - collapsed and uncollapsed. The chances of seeing two identical short rna's that are biologically real are much higher than seeing two identical reads from a coding RNA-Seq run