Question

how to use CD_HIT to remove the redundant sequence from trinity output file

0

Entering edit mode

8.0 years ago

wu.zhiqiang.1020 ▴ 50

Dear all, I am trying to use CD-hit to remove the duplicates from the file that is the output from trinity (RNA seq assembly).

I used the following parameters:

cd-hit-est -i in.fasta -o out_cdhit90.fasta -c 0.90 -n 9 -d 0 -M 0 -T 0

But the output file still contains lots of small or fragmented sequence plus the best one. How can I remove those small or fragmented duplicates by changing the parameters?

thanks

ZQ

rna-seq Assembly • 6.0k views

ADD COMMENT • link updated 8.0 years ago by st.ph.n ★ 2.7k • written 8.0 years ago by wu.zhiqiang.1020 ▴ 50

0

Entering edit mode

Are you trying to get the longest isoform per gene cluster?

ADD REPLY • link 8.0 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

yes, I just want to have one longest isoform. How can I realize that? ZQ

ADD REPLY • link 8.0 years ago by wu.zhiqiang.1020 ▴ 50

score 0 · Answer 1 · 2017-03-20

0

Entering edit mode

8.0 years ago

Sej Modha 5.3k

You can change -c and -t parameters on cd-hit that control the identity threshold and redundancy tolerance.

More info: http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide

ADD COMMENT • link 8.0 years ago by Sej Modha 5.3k

0

Entering edit mode

you mean -c and -t right?

ADD REPLY • link 8.0 years ago by wu.zhiqiang.1020 ▴ 50

score 0 · Answer 2 · 2017-03-20

0

Entering edit mode

8.0 years ago

st.ph.n ★ 2.7k

Trinity FASTA entry example:

>TRINITY_DN1000|c115_g5_i1 len=247 path=[31015:0-148 23018:149-246]
  
AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA TAAAGCA

Description of Trinity headers: 'TRINITY_DN1000|c115_g5' corresponds to

'gene id: TRINITY_DN1000|c115_g5' encoding 'isoform id: TRINITY_DN1000|c115_g5_i1'.

I prefer single line FASTA, borrowed from answer on this post:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < Trinity.fasta > Trinity_single.fasta

Here's a python script to get the longest isoform per gene:

#!/usr/bin/env python

import sys
from collections import defaultdict

transcripts = defaultdict(list)
with open(sys.argv[1], 'r') as f:
     for line in f:
        if line.startswith(">"):
            transcripts['_'.join(line.strip().split(' ')[0].split('|')[1].split('_')[:2])].append(next(f).strip())  # Groups isoforms by gene id

with open(sys.argv[1].split(".fasta")[1], 'w') as out:
    for t in transcripts:
        print >> out, '>' + t, max(transcripts[t], key=len)

Save as get_longest_iso.py, run as python get_longest_iso.py Trinity_single.fasta

ADD COMMENT • link 8.0 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

it would be helpful to keep the longest isoforms. but I still could not remove the duplicates from other contrig. right? thanks

ADD REPLY • link 8.0 years ago by wu.zhiqiang.1020 ▴ 50

0

Entering edit mode

Trinity should have done the gene clustering for you. This script will take the longest isoform that exists within a gene cluster based on gene ids.

ADD REPLY • link 8.0 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

thanks。 I need to read Trinity again.

ADD REPLY • link 8.0 years ago by wu.zhiqiang.1020 ▴ 50

0

Entering edit mode

Hi st.ph.n I tried your script, it shown error as

Traceback (most recent call last):

File "get_longest_iso.py", line 8, in <module>

if line.startswith(">"):

NameError: name 'line' is not defined

could you help me to fix this? I do not know the python. thanks

My trinity out start as:

TRINITY_DN62966_c0_g1_i1 len=352 path=[571:0-44 572:45-65 573:66-351] [-1, 571, 572, 573, -2]

ADD REPLY • link 8.0 years ago by wu.zhiqiang.1020 ▴ 50

0

Entering edit mode

Sorry forgot a for statement, wrote it a little too quickly.

ADD REPLY • link 8.0 years ago by st.ph.n ★ 2.7k