Question

cd-hit-est total seq value doesn't match number of sequences being compared?

0

Entering edit mode

6.8 years ago

c.e.chong ▴ 60

Hi,

I am trying to use cd-hit-est to cluster merged contig files (containing contigs from 22 different metagenome samples), to remove any contigs which are 99% similar to any others. So I am left with a a contig.fasta file containing no duplicates.

I am running cd-hit-est on 4 merged contig files: 1. reads filtered for human sequences using bbmap and then assembled with metaspades 2. reads filtered for human sequences using bbmap and then assembled with spades 3. reads filtered for human sequences using bowtie & samtools and then assembled with metaspades 4. reads filtered for human sequences using bowtie & samtools and then assembled with spades.

This is the code I used:

cd-hit-est -M 100000 -i mergedcontigs.fasta -o merged_cd.fasta -c 0.99 -n 8 -A 0.90

When I run this code on files 1 & 2 everything seems to work fine. But when I run files 3 & 4, the total seq value and the number of sequences being compared are different and capped at 40000. Whereas these numbers were the same for 1 & 2.

      Output
----------------------------------------------------------------
total seq: 502084
longest and shortest : 898003 and 11
Total letters: 418459779
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 485M
Buffer          : 1 X 358M = 358M
Table           : 1 X 9M = 9M
Miscellaneous   : 6M
Total           : 860M

Table limit with the given memory limit:
Max number of representatives: 40000
Max number of word counting entries: 12392460836

comparing sequences from          0  to      40000

If any one has any idea why this might be I would be grateful!!

Thanks in advance!

metagenomics cd-hit-est CD-HIT binning filtering • 2.0k views

ADD COMMENT • link updated 19 months ago by Sriram • 0 • written 6.8 years ago by c.e.chong ▴ 60

0

Entering edit mode

have you considered -s parameter?

cd-hit-est help for `-s`
 -s   length difference cutoff, default 0.0 if set to 0.9, the shorter sequences need to be at least 90% length of the representative of the cluster

ADD REPLY • link 6.8 years ago by Nitin Narwade ★ 1.7k

score 0 · Answer 1 · 2024-01-16

0

Entering edit mode

19 months ago

Sriram • 0

For me, it was the cutoff for the minimum length of sequences that was the issue. The default value of throw_away_sequences is 10, and is given by the -l option.

ADD COMMENT • link 19 months ago by Sriram • 0