Question

How to parse cd-hit output (given in the description) to create a mapping file of clusters and representative sequence names?

1

Entering edit mode

3.8 years ago

mrj ▴ 180

cd-hit has given me an output that looks just like the example below.

>Cluster 0
0   496aa, >SRR5892231.2396932... *
>Cluster 1
0   496aa, >SRR5892231.3763255... *
1   390aa, >SRR5892231.1558909... at 91.03%
>Cluster 2
0   496aa, >SRR5892231.1710795... *
>Cluster 3
0   496aa, >SRR5892231.2083014... *
1   464aa, >SRR5892231.14158... at 91.59%
2   423aa, >SRR5892231.1116524... at 94.56%
3   314aa, >SRR5892231.1717279... at 95.86%
4   268aa, >SRR5892231.2309241... at 99.63%
5   371aa, >SRR5892231.480233... at 99.46%
>Cluster 4
0   496aa, >SRR5892231.3954388... *
1   319aa, >SRR5892231.1752373... at 99.69%
>Cluster 5
0   496aa, >SRR5892231.14746... *
>Cluster 6
0   496aa, >SRR5892231.2340653... *
1   407aa, >SRR5892231.2608197... at 100.00%
2   340aa, >SRR5892231.1216749... at 100.00%
3   345aa, >SRR5892231.3205930... at 92.46%

Each line that starts with a > shows the cluster label. The next line that ends with a ... * shows the representative sequence for the particular cluster.

How do I create a csv file containing the cluster label and its representative sequence?

(note: a cluster that appeared once in this file can re-appear somewhere in the middle of the file even though it is not shown in this case)

Thank you in advance for the help.

clusters sequences mapping parse cd-hit • 6.1k views

ADD COMMENT • link 3.8 years ago by mrj ▴ 180

2

Entering edit mode

$ awk -v OFS="," '/^>/ {getline seq; print $0, seq}' test.txt | awk -v OFS="," -F ">|,|\.\." '{print $2,$5}'


Cluster 0,SRR5892231.2396932
Cluster 1,SRR5892231.3763255
Cluster 2,SRR5892231.1710795
Cluster 3,SRR5892231.2083014
Cluster 4,SRR5892231.3954388
Cluster 5,SRR5892231.14746
Cluster 6,SRR5892231.2340653

You can try using scripts provided by CD-HIT devs: https://github.com/weizhongli/cdhit/blob/master/clstr2txt.pl and https://github.com/weizhongli/cdhit/blob/master/clstr_select_rep.pl.

ADD REPLY • link 3.8 years ago by cpad0112 21k

0

Entering edit mode

Thanks for this solution. This solution works if the representative sequence is exactly the next line after line starting with >Cluster. I noticed that some lines may be present as the third or fifth etc line. Therefore, it is better if we capture the line ending with *. I created my own solution (posted here) which does this using python.

If you can modify your solution using awk, that would be great.

ADD REPLY • link 3.8 years ago by mrj ▴ 180

1

Entering edit mode

3.8 years ago

Mensur Dlakic ★ 29k

There is a set of tools by Joe that may help you here.

https://github.com/jrjhealey/bioinfo-tools

Look for ParseCDHIT.py. It doesn't do exactly what you want, but close enough. You should be able to modify the script with minimal Python knowledge.

ADD COMMENT • link 3.8 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thank you for these resources.

ADD REPLY • link 3.8 years ago by mrj ▴ 180

score 2 · Accepted Answer · 2021-10-02

I created my own answer for this question.

# this program takes in cltr file and create a mapping file
#cluster,representative sequence name

import re
# Using readlines()
file1 = open('sub_SRR5892231_90.clstr', 'r')
file2 = open('map.sub_SRR5892231_90.clstr.csv', 'w')
Lines = file1.readlines()

# since I am using a while loop I need the following index
LineIndex = 0
while LineIndex < len(Lines):
    # check if >Cluster lines are there
    FF = re.search(r'^(>Cluster.*)', Lines[LineIndex])
    # if the lines re found, then prep it for first column
    if FF != None:
        #print(FF.group())
        firstcol = FF.group().replace('>','')
        file2.write(firstcol+',')
        print(firstcol, end =",")
        LineIndex = LineIndex + 1
        while LineIndex < len(Lines):
            KK = re.search(r'^(.*\*)', Lines[LineIndex])
            if KK != None:

                seccol = KK.group()
                seccol = re.sub(r'.*>','',seccol)
                seccol = re.sub(r'\.\.\. \*','',seccol)
                print(seccol)
                file2.write(seccol+'\n')
                break
            LineIndex = LineIndex + 1
    LineIndex = LineIndex + 1

file1.close()
file2.close()