Question

How To Cluster Nucleotides Sequences

1

Entering edit mode

11.7 years ago

bambus0725 ▴ 50

Hello everyone,

I am working on a few number of RNA-seq data libraries sequenced through Illumina Genome analyser technology, which consists of millions of reads in each(appr 2-10milloins,and 20-100nt length).I want to cluster sequences that are 97% identical into one cluster,this is to reduce the redundancy of my data library which will be used for further analysis.

For this, I tried both CD-HIT-EST and UCLUST clustering tools that could fulfill my criteria like identity 97% and minimum alignment coverage for both longer and shorter sequences >= 40.

By using CD_HIT_EST,I couldn't get the full description of the ID's in the output "cluster file" as given in the input file,although there is an option "-d " ---for length description.It ends when it come across the 1st space(tab de-limited),but I need the entire ID description until the end.As I am not a good programmer I couldn't make changes in the code(written in c++).

For example, sample input

HWI-ST365:262:C0RY7ACXX:6:2312:6978:74690 1:N:0:GTGAAA size|1
CCAACCAATGAACAGGGCTTTGGCGACGACGAACTCACTCCTCTCTGTTGACGAT

HWI-ST365:262:C0RY7ACXX:6:1305:5522:5869 1:N:0:GTGAAA size|5
TGAAATGCTGCGCGGTAGAGGAGCGTTCTGTAAGTCGCTGAAGCTGAGTCGCGAGGCTTGGTGGAGACATCAGAAGTGCGAATGCTGACATGAGCAACGA

sample output

Cluster 0
0 100nt, >HWI-ST365:262:C0RY7ACXX:6:1305:5522:5869... *

Cluster 1
0 100nt, >HWI-ST365:262:C0RY7ACXX:6:1208:7633:77605... *

To overcome this problem I found a solution with Uclust,but unlike described in the published papers it works very very slow for high range of data although,it prints the entire description of the ID.Is there any option to mention the usage of number of threads or an alternate solution.

It will be very helpful If anyone could help me to solve this problem.

Thank you in advance.

clustering • 3.7k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 11.7 years ago by bambus0725 ▴ 50

0

Entering edit mode

Couldn't you circumvent this issue by replacing spaces in the header with something (e.g. _)?

sed 's/ /_/g' seq.fa > seq2.fa

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 11.7 years ago by Manu Prestat 4.1k

0

Entering edit mode

Hi Manu,

that was really a good and simple idea and I am very thankful to you

ADD REPLY • link 11.7 years ago by bambus0725 ▴ 50

0

Entering edit mode

If reducing redundancy of the data library is your goal, I would also suggest tools like Prinseq

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.2 years ago by Prakki Rama ★ 2.7k