Hey guys. I have a fasta file with about 5k sequences that looks something like this:
>gi|944203132|gb|ALL55193.1| TetR [synthetic construct]
MSRLDKSKVINSALELLNEVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALAIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEDQEHQVAKEERET
PTTDSMPPLLRQAIELFDHQGAEPAFLFGLELIICGLEKQLKCESGSSTLFGGANFNQSGNIADSSLSFTFTNSSNGPNL
ITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPLSPGWTDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDT
PPNPKKE
>gi|307752607|gb|ADN93291.1| rTetR [synthetic construct] >gi|749446379|gb|AJF22769.1| rTetR [synthetic construct]
MSRLDKSKVINGALELLNGVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALPIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEEQEHQVAKEERET
PTTDSMPPLLRQAIELFDRQGAEPAFLFGLELIICGLEKQLKCESGSSTLFGGANFNQSGNIADSSLSFTFTNSSNGPNL
ITTQTNSQALSQPIASSNVHDNFMNNEITASKIDDGNNSKPLSPGWTDQTAYNAFGITTGMFNTTTMDDVYNYLFDDEDT
PPNPKKE
>gi|409127751|gb|AFV15310.1| TG [Cloning vector pRS416 pRS-TG]
MSRLDKSKVINSALELLNEVGIEGLTTRKLAQKLGVEQPTLYWHVKNKRALLDALAIEMLDRHHTHFCPLEGESWQDFLR
NNAKSFRCALLSHRDGAKVHLGTRPTEKQYETLENQLAFLCQQGFSLENALYALSAVGHFTLGCVLEDQEHQVAKEERET
PTTDSMPPLLRQAIELFDHQGAEPAFLFGLELIICGLEKQLKCESGSGGGGGGGGGVSKGEELFTGVVPILVELDGDVNG
HKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDG
NYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQ
NTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYKLEGGRA
>gi|953781338|gb|ALO92858.1| TetR-family transcriptional regulator [Streptomyces hygroscopicus subsp. limoneus]
MCAMARPRKPLLSTDRIVETARDLVDREGLAAVSTRRLAAELGVSGPSLYNHFRTKDEILEAVADSVSAQVDLSMFEDGR
DWRTALHDWAVSYRAALRDHPNIVPVLARGPGRRPAALRLADAVYGAMVDAGWPPAQATSIGALMRYFIMGSALGSFAGG
FVDDASAYDPADYPHLGQAHLLAEQQEKIDERAFEAGLAALLDGLAQQYRQVRV
and a file made through CD HIT that clusters them like this:
>Cluster 0
>0 214aa, >gi|953781338|gb|ALO92858.1|... at 93.93%
>1 211aa, >gi|504484658|ref|WP_014671760.1|... at 95.26%
>2 211aa, >gi|664069898|ref|WP_030608495.1|... at 94.79%
>3 213aa, >gi|927897990|ref|WP_053850251.1|... at 92.49%
>4 213aa, >gi|973382757|ref|WP_059125750.1|... at 90.61%
>5 211aa, >gi|664112575|ref|WP_030650029.1|... at 94.79%
>6 210aa, >gi|518972364|ref|WP_020128239.1|... at 91.43%
>7 217aa, >gi|822881081|emb|CQR64871.1|... at 91.71%
>Cluster 1
>0 210aa, >gi|982918533|ref|WP_060397058.1|... at 98.10%
>1 210aa, >gi|981264640|ref|WP_059483404.1|... at 98.10%
>2 210aa, >gi|1061966479|ref|WP_069272452.1|... at 91.90%
>3 210aa, >gi|981738349|ref|WP_059932066.1|... at 98.57%
>4 210aa, >gi|981907955|ref|WP_060092158.1|... at 99.05%
>Cluster 2
>0 226aa, >gi|490297694|ref|WP_004193135.1|... at 98.67%
>1 225aa, >gi|446392873|ref|WP_000470728.1|... at 99.11%
>2 225aa, >gi|661250759|dbj|BAP10656.1|... at 98.67%
>3 225aa, >gi|254967139|gb|ACT97616.1|... at 98.67%
>4 225aa, >gi|545237245|ref|WP_021536379.1|... at 98.67%
>5 221aa, >gi|14794587|gb|AAK73397.1|AF327714_2... at 95.93%
How do I sort the fasta file using the cluster file which only has gi identifiers? Or in other words, how would I concatenate the respective peptide sequences from the fasta to their identifier in the cluster file? I'm using biopython, but I'm very very green to bioinformatics so I apologize if this is a naive question.
Have you looked at a bunch of cd-hit output manipulation scripts provided with the package?
Yeah I've been sort of looking through the package for an appropriate script. It's unfortunately not very well documented so there's a little bit of guess work as to what each perl script provided actually does.
I am sorry but it's not clear to me what you are trying to do. Could you please explain what is your goal?
I am trying to create a phylogenetically unbiased PSSM for the TetR family of transcription repressors. To do so, I input ~10000 aa sequences from the TetR family (from BLAST) into the CD HIT webserver which uses some algorithm to cluster phylogentically similar sequences. The output file(s) contains only the sequence identifiers >gi|1239392| like shown above assigned to individual clusters, but not the actual sequences. I'm trying to sort either the original input fasta file by the clusters assigned by CD HIT, or sort each individual sequence to it's respective id in the cluster file so that I can create a multiple sequence alignment/PSSM for each cluster separately.
Since no one liner enthusiasts commented. I thought i try with awk :). Definitely the python solution by steve is much much better and recommended but I just couldn't help :) . This will pipe each sequence into a separate file which has a filename of cluster id. For example sequences in fasta file which fall in cluster 1 will be piped to file named 1. Surely not a robust solution and fill fail even the slightest formatting difference in the input files but it is just for fun! The input filenames being sequences.clusters for the cluster file and sequences.fasta for the fastafile.
you are very brave. One of the reasons I started using Python was so I didn't have to use awk anymore :)
You are hurting my feelings :) :) as awk is just a diminutive of the word awesome :)