Question

Sort a fasta file with ~5k sequences and match to a cluster txt file with sequence identifiers

1

Entering edit mode

8.2 years ago

hridindu.0 • 0

Hey guys. I have a fasta file with about 5k sequences that looks something like this:

>gi|944203132|gb|ALL55193.1| TetR [synthetic construct]
MSRLDKSKVINSALELLNEVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALAIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEDQEHQVAKEERET
PTTDSMPPLLRQAIELFDHQGAEPAFLFGLELIICGLEKQLKCESGSSTLFGGANFNQSGNIADSSLSFTFTNSSNGPNL
ITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPLSPGWTDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDT
PPNPKKE

>gi|307752607|gb|ADN93291.1| rTetR [synthetic construct] >gi|749446379|gb|AJF22769.1| rTetR [synthetic construct]
MSRLDKSKVINGALELLNGVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALPIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEEQEHQVAKEERET
PTTDSMPPLLRQAIELFDRQGAEPAFLFGLELIICGLEKQLKCESGSSTLFGGANFNQSGNIADSSLSFTFTNSSNGPNL
ITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPLSPGWTDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDT
PPNPKKE

>gi|409127751|gb|AFV15310.1| TG [Cloning vector pRS416 pRS-TG]
MSRLDKSKVINSALELLNEVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALAIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEDQEHQVAKEERET
PTTDSMPPLLRQAIELFDHQGAEPAFLFGLELIICGLEKQLKCESGSGGGGGGGGGVSKGEELFTGVVPILVELDGDVNG
HKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDG
NYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQ
NTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYKLEGGRA

>gi|953781338|gb|ALO92858.1| TetR-family transcriptional regulator [Streptomyces hygroscopicus subsp. limoneus]
MCAMARPRKPLLSTDRIVETARDLVDREGLAAVSTRRLAAELGVSGPSLYNHFRTKDEILEAVADSVSAQVDLSMFEDGR
DWRTALHDWAVSYRAALRDHPNIVPVLARGPGRRPAALRLADAVYGAMVDAGWPPAQATSIGALMRYFIMGSALGSFAGG
FVDDASAYDPADYPHLGQAHLLAEQQEKIDERAFEAGLAALLDGLAQQYRQVRV

and a file made through CD HIT that clusters them like this:

>Cluster 0
>0  214aa, >gi|953781338|gb|ALO92858.1|... at 93.93%
>1  211aa, >gi|504484658|ref|WP_014671760.1|... at 95.26%
>2  211aa, >gi|664069898|ref|WP_030608495.1|... at 94.79%
>3  213aa, >gi|927897990|ref|WP_053850251.1|... at 92.49%
>4  213aa, >gi|973382757|ref|WP_059125750.1|... at 90.61%
>5  211aa, >gi|664112575|ref|WP_030650029.1|... at 94.79%
>6  210aa, >gi|518972364|ref|WP_020128239.1|... at 91.43%
>7  217aa, >gi|822881081|emb|CQR64871.1|... at 91.71%
>Cluster 1
>0  210aa, >gi|982918533|ref|WP_060397058.1|... at 98.10%
>1  210aa, >gi|981264640|ref|WP_059483404.1|... at 98.10%
>2  210aa, >gi|1061966479|ref|WP_069272452.1|... at 91.90%
>3  210aa, >gi|981738349|ref|WP_059932066.1|... at 98.57%
>4  210aa, >gi|981907955|ref|WP_060092158.1|... at 99.05%
>Cluster 2
>0  226aa, >gi|490297694|ref|WP_004193135.1|... at 98.67%
>1  225aa, >gi|446392873|ref|WP_000470728.1|... at 99.11%
>2  225aa, >gi|661250759|dbj|BAP10656.1|... at 98.67%
>3  225aa, >gi|254967139|gb|ACT97616.1|... at 98.67%
>4  225aa, >gi|545237245|ref|WP_021536379.1|... at 98.67%
>5  221aa, >gi|14794587|gb|AAK73397.1|AF327714_2... at 95.93%

How do I sort the fasta file using the cluster file which only has gi identifiers? Or in other words, how would I concatenate the respective peptide sequences from the fasta to their identifier in the cluster file? I'm using biopython, but I'm very very green to bioinformatics so I apologize if this is a naive question.

alignment sequence blast • 4.2k views

ADD COMMENT • link updated 8.2 years ago by Sej Modha 5.3k • written 8.2 years ago by hridindu.0 • 0

0

Entering edit mode

Have you looked at a bunch of cd-hit output manipulation scripts provided with the package?

ADD REPLY • link 8.2 years ago by Sej Modha 5.3k

0

Entering edit mode

Yeah I've been sort of looking through the package for an appropriate script. It's unfortunately not very well documented so there's a little bit of guess work as to what each perl script provided actually does.

ADD REPLY • link 8.2 years ago by hridindu.0 • 0

0

Entering edit mode

I am sorry but it's not clear to me what you are trying to do. Could you please explain what is your goal?

ADD REPLY • link 8.2 years ago by Sej Modha 5.3k

0

Entering edit mode

I am trying to create a phylogenetically unbiased PSSM for the TetR family of transcription repressors. To do so, I input ~10000 aa sequences from the TetR family (from BLAST) into the CD HIT webserver which uses some algorithm to cluster phylogentically similar sequences. The output file(s) contains only the sequence identifiers >gi|1239392| like shown above assigned to individual clusters, but not the actual sequences. I'm trying to sort either the original input fasta file by the clusters assigned by CD HIT, or sort each individual sequence to it's respective id in the cluster file so that I can create a multiple sequence alignment/PSSM for each cluster separately.

ADD REPLY • link 8.2 years ago by hridindu.0 • 0

0

Entering edit mode

Since no one liner enthusiasts commented. I thought i try with awk :). Definitely the python solution by steve is much much better and recommended but I just couldn't help :) . This will pipe each sequence into a separate file which has a filename of cluster id. For example sequences in fasta file which fall in cluster 1 will be piped to file named 1. Surely not a robust solution and fill fail even the slightest formatting difference in the input files but it is just for fun! The input filenames being sequences.clusters for the cluster file and sequences.fasta for the fastafile.

awk 'BEGIN{lock=0}{if(FILENAME == ARGV[1]){if($1 ~ /^>/){if(substr($1,1,4)==">Clu"){x=$2}else{a[$3]=x} }}else{if(substr($1,1,1)==">"){if($1 in a){lock=1;file=a[$1]}else{lock=0}};if(lock == 1){print $0 > file  }  }}'  <(sed -e   's/\s\+/\t/g;s/\.\.\./\t/'  sequences.clusters) sequences.fasta

ADD REPLY • link updated 8.2 years ago by GenoMax 147k • written 8.2 years ago by microfuge ★ 1.9k

2

Entering edit mode

you are very brave. One of the reasons I started using Python was so I didn't have to use awk anymore :)

ADD REPLY • link 8.2 years ago by steve ★ 3.5k

0

Entering edit mode

You are hurting my feelings :) :) as awk is just a diminutive of the word awesome :)

ADD REPLY • link 8.2 years ago by microfuge ★ 1.9k

score 2 · Answer 1 · 2016-09-23

2

Entering edit mode

8.2 years ago

Sej Modha 5.3k

cd-hit's make_multi_seq.pl would be helpful for generating individual cluster multi-fasta file.

cd-hit -i db.fa -o dbout -d 0 -g 1
make_multi_seq.pl db.fa dbout.clstr multi-seq 1

ADD COMMENT • link 8.2 years ago by Sej Modha 5.3k

0

Entering edit mode

My god you made my life so much better with this. Beers are on me if we ever run into eachother.

ADD REPLY • link 8.2 years ago by hridindu.0 • 0

1

Entering edit mode

Beers can wait but for now you can "accept" this answer (if it does what you needed) by using the check mark against @Sej's post.

ADD REPLY • link 8.2 years ago by GenoMax 147k

score 1 · Answer 2 · 2016-09-22

Going off of only the items you have posted, I would try the following:

First, adjust that Cluster file so that it can be read in by Biopython. In the terminal (bash), I would run something like this:

# the cluster file that you posted, saved to a txt:
$ head cluster.txt
>Cluster 0
>0  214aa, >gi|953781338|gb|ALO92858.1|... at 93.93%
>1  211aa, >gi|504484658|ref|WP_014671760.1|... at 95.26%
>2  211aa, >gi|664069898|ref|WP_030608495.1|... at 94.79%
>3  213aa, >gi|927897990|ref|WP_053850251.1|... at 92.49%
>4  213aa, >gi|973382757|ref|WP_059125750.1|... at 90.61%
>5  211aa, >gi|664112575|ref|WP_030650029.1|... at 94.79%
>6  210aa, >gi|518972364|ref|WP_020128239.1|... at 91.43%
>7  217aa, >gi|822881081|emb|CQR64871.1|... at 91.71%

# get rid of everything except the gi ID's
$ cat cluster.txt | sed -e 's/^>.*\(>.*\)$/\1/g' -e 's/^\(.*\)\(\.\.\..*\)$/\1/'| grep -E -v '^>Cluster.*' > cluster_new.txt

explanation:
sed
-e 's/^>.*$>.*$$/\1/g' : match the second '>' character in the string and keep it and all characters that follow it
-e 's/^$.*$$\.\.\..*$$/\1/' : (after previous operation) match the '...' string and keep only the preceding characters
grep -E -v '^>Cluster.*': return all lines that do not start with '>Cluster'

# preview the output
$ cat cluster_new.txt
>gi|953781338|gb|ALO92858.1|
>gi|504484658|ref|WP_014671760.1|
>gi|664069898|ref|WP_030608495.1|
>gi|927897990|ref|WP_053850251.1|
>gi|973382757|ref|WP_059125750.1|
>gi|664112575|ref|WP_030650029.1|
>gi|518972364|ref|WP_020128239.1|
>gi|822881081|emb|CQR64871.1|

Some details about the sed, bash, and grep used here:

$ bash --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)

$ sed --version
GNU sed version 4.2.1

$ grep --version
GNU grep 2.6.3

Now, in Python, I would try something like this:

#!/usr/bin/env python
# python 2.7

# Sort a fasta file with ~5k sequences and match to a cluster txt file with sequence identifiers
# http://biopython.org/wiki/SeqIO
# http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc16

from Bio import SeqIO

# the paths to the input and output files
fasta_file = "/ifs/home/steve/biopython_tests/test.fa" # your unsorted fasta file
cluster_file = "/ifs/home/steve/biopython_tests/cluster_new.txt" # your file with the modified cluster IDs
output_file = "/ifs/home/steve/biopython_tests/sorted.fa" # the output sorted sequences


# start an empty list for the ID
ID_list = []

# make a list of the ID's
for record in SeqIO.parse(open(cluster_file, "rU"), "fasta"):
    # add each record ID to the list
    ID_list.appendrecord.id)

print ID_list


# iterate over the ID's, save the matching fasta entries
for my_ID in ID_list:
    # print my_ID
    for record in SeqIO.parse(open(fasta_file, "rU"), "fasta"):
        if my_ID in record.id:
            # for appending to the output file
            output_handle = open(output_file, "a")
            SeqIO.write(record, output_handle, "fasta")
            output_handle.close()

If you had some more thorough examples it would be easier to test this to make sure it works, and I am sure there are other methods for this that might be better. Note that using only the examples you posted there, this does not actually output anything because none of the Cluster ID's you listed match your example Fasta records.

EDIT: small change to the sed command based on more information included in OP, here is sample output of the script (only one of the given sequences matched the given ID's):

$ cat sorted.fa
>gi|953781338|gb|ALO92858.1| TetR-family transcriptional regulator [Streptomyces hygroscopicus subsp. limoneus]
MCAMARPRKPLLSTDRIVETARDLVDREGLAAVSTRRLAAELGVSGPSLYNHFRTKDEIL
EAVADSVSAQVDLSMFEDGRDWRTALHDWAVSYRAALRDHPNIVPVLARGPGRRPAALRL
ADAVYGAMVDAGWPPAQATSIGALMRYFIMGSALGSFAGGFVDDASAYDPADYPHLGQAH
LLAEQQEKIDERAFEAGLAALLDGLAQQYRQVRV

Python version used here:

$ python --version
Python 2.7.3