Question

How to find the total number of reads using CD-HIT

0

Entering edit mode

2.4 years ago

khq5801 ▴ 10

I have created clusters using CD-HIT for miRNA NGS data. The length of miRNA is 16-40 and I would like to find out the total number of reads and distinct reads corresponding to each miRNA. Kindly provide your valuable suggestion or any command that can help me. Thanks.

perl NGS miRNA CD-HIT • 813 views

ADD COMMENT • link updated 2.4 years ago by Mensur Dlakic ★ 28k • written 2.4 years ago by khq5801 ▴ 10

score 0 · Answer 1 · 2022-07-13

0

Entering edit mode

2.4 years ago

Mensur Dlakic ★ 28k

Presumably this is related to your earlier inquiry about cd-hit-dup. If so, at the start the program prints out the total number of sequences, and at the end the total number of clusters. For example, in your previous screenshot you had 200000 sequences and 199988 clusters, meaning you had 12 duplicates.

As to exact clusters, there will be a file ending in .clstr which will contain the clusters. Assuming this was your command:

cd-hit-cup -i sequences.fas -o sequences_nodup.fas

The clusters will be in sequences_nodup.fas.clstr. Even if you ran a plain cd-hit the same cluster file will be created.

ADD COMMENT • link 2.4 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

No, I was not able to use CD-HIT-DUP successfully. However, I followed this suggestion cd hit for removing sequence redundancy to generate non-redundant data. I have two files one is sequence and another belongs to the cluster (.clstr). Now, I would like to segregate the total number of sequences in the cluster to specific length miRNA. For instance, length 16 miRNA has total 5486 reads and 4586 distinct reads. Like this, I would like to generate the data till 40.

ADD REPLY • link 2.4 years ago by khq5801 ▴ 10

0

Entering edit mode

The less information you provide initially, the less useful suggestions you get.

ADD REPLY • link 2.4 years ago by Mensur Dlakic ★ 28k