Sort a fasta file with ~5k sequences and match to a cluster txt file with sequence identifiers
2
1
Entering edit mode
8.2 years ago
hridindu.0 • 0

Hey guys. I have a fasta file with about 5k sequences that looks something like this:

>gi|944203132|gb|ALL55193.1| TetR [synthetic construct]
MSRLDKSKVINSALELLNEVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALAIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEDQEHQVAKEERET
PTTDSMPPLLRQAIELFDHQGAEPAFLFGLELIICGLEKQLKCESGSSTLFGGANFNQSGNIADSSLSFTFTNSSNGPNL
ITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPLSPGWTDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDT
PPNPKKE

>gi|307752607|gb|ADN93291.1| rTetR [synthetic construct] >gi|749446379|gb|AJF22769.1| rTetR [synthetic construct]
MSRLDKSKVINGALELLNGVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALPIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEEQEHQVAKEERET
PTTDSMPPLLRQAIELFDRQGAEPAFLFGLELIICGLEKQLKCESGSSTLFGGANFNQSGNIADSSLSFTFTNSSNGPNL
ITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPLSPGWTDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDT
PPNPKKE

>gi|409127751|gb|AFV15310.1| TG [Cloning vector pRS416 pRS-TG]
MSRLDKSKVINSALELLNEVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALAIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEDQEHQVAKEERET
PTTDSMPPLLRQAIELFDHQGAEPAFLFGLELIICGLEKQLKCESGSGGGGGGGGGVSKGEELFTGVVPILVELDGDVNG
HKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDG
NYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQ
NTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYKLEGGRA

>gi|953781338|gb|ALO92858.1| TetR-family transcriptional regulator [Streptomyces hygroscopicus subsp. limoneus]
MCAMARPRKPLLSTDRIVETARDLVDREGLAAVSTRRLAAELGVSGPSLYNHFRTKDEILEAVADSVSAQVDLSMFEDGR
DWRTALHDWAVSYRAALRDHPNIVPVLARGPGRRPAALRLADAVYGAMVDAGWPPAQATSIGALMRYFIMGSALGSFAGG
FVDDASAYDPADYPHLGQAHLLAEQQEKIDERAFEAGLAALLDGLAQQYRQVRV

and a file made through CD HIT that clusters them like this:

>Cluster 0
>0  214aa, >gi|953781338|gb|ALO92858.1|... at 93.93%
>1  211aa, >gi|504484658|ref|WP_014671760.1|... at 95.26%
>2  211aa, >gi|664069898|ref|WP_030608495.1|... at 94.79%
>3  213aa, >gi|927897990|ref|WP_053850251.1|... at 92.49%
>4  213aa, >gi|973382757|ref|WP_059125750.1|... at 90.61%
>5  211aa, >gi|664112575|ref|WP_030650029.1|... at 94.79%
>6  210aa, >gi|518972364|ref|WP_020128239.1|... at 91.43%
>7  217aa, >gi|822881081|emb|CQR64871.1|... at 91.71%
>Cluster 1
>0  210aa, >gi|982918533|ref|WP_060397058.1|... at 98.10%
>1  210aa, >gi|981264640|ref|WP_059483404.1|... at 98.10%
>2  210aa, >gi|1061966479|ref|WP_069272452.1|... at 91.90%
>3  210aa, >gi|981738349|ref|WP_059932066.1|... at 98.57%
>4  210aa, >gi|981907955|ref|WP_060092158.1|... at 99.05%
>Cluster 2
>0  226aa, >gi|490297694|ref|WP_004193135.1|... at 98.67%
>1  225aa, >gi|446392873|ref|WP_000470728.1|... at 99.11%
>2  225aa, >gi|661250759|dbj|BAP10656.1|... at 98.67%
>3  225aa, >gi|254967139|gb|ACT97616.1|... at 98.67%
>4  225aa, >gi|545237245|ref|WP_021536379.1|... at 98.67%
>5  221aa, >gi|14794587|gb|AAK73397.1|AF327714_2... at 95.93%

How do I sort the fasta file using the cluster file which only has gi identifiers? Or in other words, how would I concatenate the respective peptide sequences from the fasta to their identifier in the cluster file? I'm using biopython, but I'm very very green to bioinformatics so I apologize if this is a naive question.

alignment sequence blast • 4.2k views
ADD COMMENT
0
Entering edit mode

Have you looked at a bunch of cd-hit output manipulation scripts provided with the package?

ADD REPLY
0
Entering edit mode

Yeah I've been sort of looking through the package for an appropriate script. It's unfortunately not very well documented so there's a little bit of guess work as to what each perl script provided actually does.

ADD REPLY
0
Entering edit mode

I am sorry but it's not clear to me what you are trying to do. Could you please explain what is your goal?

ADD REPLY
0
Entering edit mode

I am trying to create a phylogenetically unbiased PSSM for the TetR family of transcription repressors. To do so, I input ~10000 aa sequences from the TetR family (from BLAST) into the CD HIT webserver which uses some algorithm to cluster phylogentically similar sequences. The output file(s) contains only the sequence identifiers >gi|1239392| like shown above assigned to individual clusters, but not the actual sequences. I'm trying to sort either the original input fasta file by the clusters assigned by CD HIT, or sort each individual sequence to it's respective id in the cluster file so that I can create a multiple sequence alignment/PSSM for each cluster separately.

ADD REPLY
0
Entering edit mode

Since no one liner enthusiasts commented. I thought i try with awk :). Definitely the python solution by steve is much much better and recommended but I just couldn't help :) . This will pipe each sequence into a separate file which has a filename of cluster id. For example sequences in fasta file which fall in cluster 1 will be piped to file named 1. Surely not a robust solution and fill fail even the slightest formatting difference in the input files but it is just for fun! The input filenames being sequences.clusters for the cluster file and sequences.fasta for the fastafile.

awk 'BEGIN{lock=0}{if(FILENAME == ARGV[1]){if($1 ~ /^>/){if(substr($1,1,4)==">Clu"){x=$2}else{a[$3]=x} }}else{if(substr($1,1,1)==">"){if($1 in a){lock=1;file=a[$1]}else{lock=0}};if(lock == 1){print $0 > file  }  }}'  <(sed -e   's/\s\+/\t/g;s/\.\.\./\t/'  sequences.clusters) sequences.fasta
ADD REPLY
2
Entering edit mode

you are very brave. One of the reasons I started using Python was so I didn't have to use awk anymore :)

ADD REPLY
0
Entering edit mode

You are hurting my feelings :) :) as awk is just a diminutive of the word awesome :)

ADD REPLY
2
Entering edit mode
8.2 years ago
Sej Modha 5.3k

cd-hit's make_multi_seq.pl would be helpful for generating individual cluster multi-fasta file.

cd-hit -i db.fa -o dbout -d 0 -g 1
make_multi_seq.pl db.fa dbout.clstr multi-seq 1
ADD COMMENT
0
Entering edit mode

My god you made my life so much better with this. Beers are on me if we ever run into eachother.

ADD REPLY
1
Entering edit mode

Beers can wait but for now you can "accept" this answer (if it does what you needed) by using the check mark against @Sej's post.

ADD REPLY
1
Entering edit mode
8.2 years ago
steve ★ 3.5k

Going off of only the items you have posted, I would try the following:

First, adjust that Cluster file so that it can be read in by Biopython. In the terminal (bash), I would run something like this:

# the cluster file that you posted, saved to a txt:
$ head cluster.txt
>Cluster 0
>0  214aa, >gi|953781338|gb|ALO92858.1|... at 93.93%
>1  211aa, >gi|504484658|ref|WP_014671760.1|... at 95.26%
>2  211aa, >gi|664069898|ref|WP_030608495.1|... at 94.79%
>3  213aa, >gi|927897990|ref|WP_053850251.1|... at 92.49%
>4  213aa, >gi|973382757|ref|WP_059125750.1|... at 90.61%
>5  211aa, >gi|664112575|ref|WP_030650029.1|... at 94.79%
>6  210aa, >gi|518972364|ref|WP_020128239.1|... at 91.43%
>7  217aa, >gi|822881081|emb|CQR64871.1|... at 91.71%

# get rid of everything except the gi ID's
$ cat cluster.txt | sed -e 's/^>.*\(>.*\)$/\1/g' -e 's/^\(.*\)\(\.\.\..*\)$/\1/'| grep -E -v '^>Cluster.*' > cluster_new.txt

explanation:
sed
-e 's/^>.*\(>.*\)$/\1/g' : match the second '>' character in the string and keep it and all characters that follow it
-e 's/^\(.*\)\(\.\.\..*\)$/\1/' : (after previous operation) match the '...' string and keep only the preceding characters
grep -E -v '^>Cluster.*': return all lines that do not start with '>Cluster'

# preview the output
$ cat cluster_new.txt
>gi|953781338|gb|ALO92858.1|
>gi|504484658|ref|WP_014671760.1|
>gi|664069898|ref|WP_030608495.1|
>gi|927897990|ref|WP_053850251.1|
>gi|973382757|ref|WP_059125750.1|
>gi|664112575|ref|WP_030650029.1|
>gi|518972364|ref|WP_020128239.1|
>gi|822881081|emb|CQR64871.1|

Some details about the sed, bash, and grep used here:

$ bash --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)

$ sed --version
GNU sed version 4.2.1

$ grep --version
GNU grep 2.6.3

Now, in Python, I would try something like this:

#!/usr/bin/env python
# python 2.7

# Sort a fasta file with ~5k sequences and match to a cluster txt file with sequence identifiers
# http://biopython.org/wiki/SeqIO
# http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc16

from Bio import SeqIO

# the paths to the input and output files
fasta_file = "/ifs/home/steve/biopython_tests/test.fa" # your unsorted fasta file
cluster_file = "/ifs/home/steve/biopython_tests/cluster_new.txt" # your file with the modified cluster IDs
output_file = "/ifs/home/steve/biopython_tests/sorted.fa" # the output sorted sequences


# start an empty list for the ID
ID_list = []

# make a list of the ID's
for record in SeqIO.parse(open(cluster_file, "rU"), "fasta"):
    # add each record ID to the list
    ID_list.appendrecord.id)

print ID_list


# iterate over the ID's, save the matching fasta entries
for my_ID in ID_list:
    # print my_ID
    for record in SeqIO.parse(open(fasta_file, "rU"), "fasta"):
        if my_ID in record.id:
            # for appending to the output file
            output_handle = open(output_file, "a")
            SeqIO.write(record, output_handle, "fasta")
            output_handle.close()

If you had some more thorough examples it would be easier to test this to make sure it works, and I am sure there are other methods for this that might be better. Note that using only the examples you posted there, this does not actually output anything because none of the Cluster ID's you listed match your example Fasta records.

EDIT: small change to the sed command based on more information included in OP, here is sample output of the script (only one of the given sequences matched the given ID's):

$ cat sorted.fa
>gi|953781338|gb|ALO92858.1| TetR-family transcriptional regulator [Streptomyces hygroscopicus subsp. limoneus]
MCAMARPRKPLLSTDRIVETARDLVDREGLAAVSTRRLAAELGVSGPSLYNHFRTKDEIL
EAVADSVSAQVDLSMFEDGRDWRTALHDWAVSYRAALRDHPNIVPVLARGPGRRPAALRL
ADAVYGAMVDAGWPPAQATSIGALMRYFIMGSALGSFAGGFVDDASAYDPADYPHLGQAH
LLAEQQEKIDERAFEAGLAALLDGLAQQYRQVRV

Python version used here:

$ python --version
Python 2.7.3
ADD COMMENT
0
Entering edit mode

Hey Steve, thank's a lot for your reply. I'll edit the post to include some more information. Unfortunately, the grep command simply deleted the "cluster n" headings and left everything else. I'm pretty unfamiliar with regular expressions in UNIX, so I'm trying to understand that line still, maybe so I can modify it to leave the cluster headings and excise everything else but the gi IDs.

ADD REPLY
0
Entering edit mode

yes that was what the grep command was supposed to do. The sed commands are operating on the rest of the entries. I can update the post with a little more explanation of them. Keep in mind that the arguments might be slightly different for sed depending on which version you are using; sed on OS X bash has some small differences from the version that usually ships with Linux which I used here.

ADD REPLY
0
Entering edit mode

also keep in mind that you could also probably skip the bash steps if you find a way to do something similar in Python, which may or may not be easier. Since you mentioned Biopython I wrote this in the context of making it work with Biopython.

ADD REPLY
0
Entering edit mode

updated the post with more details

ADD REPLY
0
Entering edit mode

Thank you so much for your help. I think I can take it from here and modify it to my needs as they arise. Really appreciate the level of detail.

ADD REPLY

Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6