Hello everyone,
I am working on a few number of RNA-seq data libraries sequenced through Illumina Genome analyser technology, which consists of millions of reads in each(appr 2-10milloins,and 20-100nt length).I want to cluster sequences that are 97% identical into one cluster,this is to reduce the redundancy of my data library which will be used for further analysis.
For this, I tried both CD-HIT-EST and UCLUST clustering tools that could fulfill my criteria like identity 97% and minimum alignment coverage for both longer and shorter sequences >= 40.
By using CD_HIT_EST,I couldn't get the full description of the ID's in the output "cluster file" as given in the input file,although there is an option "-d " ---for length description.It ends when it come across the 1st space(tab de-limited),but I need the entire ID description until the end.As I am not a good programmer I couldn't make changes in the code(written in c++).
For example, sample input
HWI-ST365:262:C0RY7ACXX:6:2312:6978:74690 1:N:0:GTGAAA size|1
CCAACCAATGAACAGGGCTTTGGCGACGACGAACTCACTCCTCTCTGTTGACGAT
HWI-ST365:262:C0RY7ACXX:6:1305:5522:5869 1:N:0:GTGAAA size|5
TGAAATGCTGCGCGGTAGAGGAGCGTTCTGTAAGTCGCTGAAGCTGAGTCGCGAGGCTTGGTGGAGACATCAGAAGTGCGAATGCTGACATGAGCAACGA
sample output
Cluster 0
0 100nt, >HWI-ST365:262:C0RY7ACXX:6:1305:5522:5869... *
Cluster 1
0 100nt, >HWI-ST365:262:C0RY7ACXX:6:1208:7633:77605... *
To overcome this problem I found a solution with Uclust,but unlike described in the published papers it works very very slow for high range of data although,it prints the entire description of the ID.Is there any option to mention the usage of number of threads or an alternate solution.
It will be very helpful If anyone could help me to solve this problem.
Thank you in advance.
Couldn't you circumvent this issue by replacing spaces in the header with something (e.g.
_
)?Hi Manu,
that was really a good and simple idea and I am very thankful to you
If reducing redundancy of the data library is your goal, I would also suggest tools like Prinseq