how to use CD_HIT to remove the redundant sequence from trinity output file
2
0
Entering edit mode
7.7 years ago

Dear all, I am trying to use CD-hit to remove the duplicates from the file that is the output from trinity (RNA seq assembly).

I used the following parameters:

cd-hit-est -i in.fasta -o out_cdhit90.fasta -c 0.90 -n 9 -d 0 -M 0 -T 0

But the output file still contains lots of small or fragmented sequence plus the best one. How can I remove those small or fragmented duplicates by changing the parameters?

thanks

ZQ

rna-seq Assembly • 5.7k views
ADD COMMENT
0
Entering edit mode

Are you trying to get the longest isoform per gene cluster?

ADD REPLY
0
Entering edit mode

yes, I just want to have one longest isoform. How can I realize that? ZQ

ADD REPLY
0
Entering edit mode
7.7 years ago
Sej Modha 5.3k

You can change -c and -t parameters on cd-hit that control the identity threshold and redundancy tolerance.

More info: http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide

ADD COMMENT
0
Entering edit mode

you mean -c and -t right?

ADD REPLY
0
Entering edit mode
7.7 years ago
st.ph.n ★ 2.7k

Trinity FASTA entry example:

>TRINITY_DN1000|c115_g5_i1 len=247 path=[31015:0-148 23018:149-246]
  

AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA TAAAGCA

Description of Trinity headers: 'TRINITY_DN1000|c115_g5' corresponds to

'gene id: TRINITY_DN1000|c115_g5' encoding 'isoform id: TRINITY_DN1000|c115_g5_i1'.

I prefer single line FASTA, borrowed from answer on this post:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < Trinity.fasta > Trinity_single.fasta

Here's a python script to get the longest isoform per gene:

#!/usr/bin/env python

import sys
from collections import defaultdict

transcripts = defaultdict(list)
with open(sys.argv[1], 'r') as f:
     for line in f:
        if line.startswith(">"):
            transcripts['_'.join(line.strip().split(' ')[0].split('|')[1].split('_')[:2])].append(next(f).strip())  # Groups isoforms by gene id

with open(sys.argv[1].split(".fasta")[1], 'w') as out:
    for t in transcripts:
        print >> out, '>' + t, max(transcripts[t], key=len)

Save as get_longest_iso.py, run as python get_longest_iso.py Trinity_single.fasta

ADD COMMENT
0
Entering edit mode

it would be helpful to keep the longest isoforms. but I still could not remove the duplicates from other contrig. right? thanks

ADD REPLY
0
Entering edit mode

Trinity should have done the gene clustering for you. This script will take the longest isoform that exists within a gene cluster based on gene ids.

ADD REPLY
0
Entering edit mode

thanks。 I need to read Trinity again.

ADD REPLY
0
Entering edit mode

Hi st.ph.n I tried your script, it shown error as

Traceback (most recent call last):

File "get_longest_iso.py", line 8, in <module>

if line.startswith(">"):

NameError: name 'line' is not defined

could you help me to fix this? I do not know the python. thanks

My trinity out start as:

TRINITY_DN62966_c0_g1_i1 len=352 path=[571:0-44 572:45-65 573:66-351] [-1, 571, 572, 573, -2]

ADD REPLY
0
Entering edit mode

Sorry forgot a for statement, wrote it a little too quickly.

ADD REPLY

Login before adding your answer.

Traffic: 1947 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6