Transcripome Data Analysis Using Cd-Hits
3
2
Entering edit mode
13.1 years ago
Kiriya ▴ 100

We have sequenced a transcriptome of a species who does not have a sequenced genome, using 454 and our initial goal is to find a set of ESTs that represent genes. The 454 reads were assembled using Newbler 2.5 and the initial assembly gave ~26000 isotigs and 18,000. Contigs. After talking to the several people, I used CD-Hits program to combine the isotigs, contigs and singltons that were not assembled. After combining these sequences, I got ~4000 isotigs, ~17,000 contigs and ~30, 000 Singlton that were not assembled. Is this the correct way to do this? I couldn’t find any publication that has mentioned this method.

transcriptome gene • 5.2k views
ADD COMMENT
0
Entering edit mode

Which identity thresholds did you use?

ADD REPLY
0
Entering edit mode

Algorithms for CD-HIT were described in three papers published in Bioinformatics.

  1. Clustering of highly homologous sequences to reduce the size of large protein databases. Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics (2001) 17:282-283, PDF, Pubmed
  2. Tolerating some redundancy significantly speeds up clustering of large protein databases. Weizhong Li, Lukasz Jaroszewski & Adam Godzik. Bioinformatics (2002) 18: 77-82, PDF, Pubmed
  3. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Weizhong Li & Adam Godzik. Bioinformatics (2006) 22:1658-1659 PDF, Pubmed
  4. CD-HIT: accelerated for clustering the next generation sequencing data. Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li, Bioinformatics (2012) 28:3150-3152, doi: 10.1093/bioinformatics/bts565 PDF

Please check these papers about CD HIT

ADD REPLY
1
Entering edit mode
13.1 years ago

Why do you want to further cluster the reads? Isotigs are grouped into isogroups in Newbler. Think of isogroup as the gene and isotigs as the alternate splice forms. Isotigs are made from contigs in Newbler, so you don't need to cluster the isotigs with the contigs.

If you feel you can get more data out of the unassembled reads, you can try cdhit + cap3 the unassembled reads with the isotigs.

ADD COMMENT
0
Entering edit mode

Thanks DK for your answer. For the annotation, should I keep the longest Isotig from each Isogroup?

ADD REPLY
0
Entering edit mode

That's more tricky. Longest isotig doesn't always mean the most inclusive transcript. I would just report all the possible isoforms.

ADD REPLY
1
Entering edit mode
13.1 years ago
lexnederbragt ★ 1.3k

I got recommended the same method from 454. Some isotigs from the same isogroup are very similar, with just a few bases difference due to the heterozygotic nature of the sample(s) sequenced. The use of CDHit will allow for clustering these transcripts (isotigs). So, after clustering, you should again look how many transcripts (isotigs) there are for each isogroup. In principle, isotigs should cluster with isotigs from the same isogroup only. After clustering, remaining isotigs from the same isogroup could very well represent real isoforms.

ADD COMMENT
0
Entering edit mode
7.7 years ago
njtulsani ▴ 60

Applicatioins can be found from the papers that cited CD-HIT (external links to Google Scholar): Li et al (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Li et al (2001) Clustering of highly homologous sequences to reduce the size of large protein database. Li et al (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Huang et al (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Niu et al (2009) Artificial and natural duplicates in pyrosequencing reads of metagenomic data

ADD COMMENT

Login before adding your answer.

Traffic: 2822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6