I have downloaded the local pfam data base from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release. There are two databases 1) Pfam-A.fasta.gz 2) Pfam-A.full.gz . The first one consist of sequences in fasta format with PFAM accession id in its header, but it does not contain all the sequences of a family. The second one consist of sequences in stockholm format. I have converted them to fasta, these contain all the sequences related to a family but the sequence header does not contain the PFAM accession id. I've more than 1500 pfam id's, I want to extract all the sequences that fall under a family (or accession id). Every stockholm alignment in (2) is having the pfam id at the top as, for example "#=GF AC PF00406". How can I get over this..any help will be greatly appreciated.
Pretty sure esl-afetch would work here (ships with HMMER). Or perhaps esl-sfetch. Or maybe if you convert with esl-reformat you will not lose info.. so many options.
# esl-afetch :: retrieve multiple sequence alignment(s) from a file
# Easel h3.1b1 (May 2013)
# Copyright (C) 2013 Howard Hughes Medical Institute.
# Freely distributed under the Janelia Farm Software License.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: esl-afetch [options] <msafile> <name> (retrieves one alignment named <name>)
Usage: esl-afetch [options] -f <msafile> <namefile> (retrieves all alignments named in <namefile>)
Usage: esl-afetch [options] --index <msafile> (indexes <msafile>)
where options are:
-h : help; show brief info on version and usage
-f : second cmdline arg is a file of names to retrieve
-o <f> : output alignments to file <f> instead of stdout
-O : output alignment to file named <key>
--informat <s> : specify that <msafile> is in format <s>
--outformat <s> : output fetched alignment(s) in format <s> [Stockholm]
--index : index the <msafile>, creating <msafile>.ssi
I tried it already, when I used it with the id (for ex: PF00013), it is saying Every alignment in file must have a name to be retrievable. Failed to find name of alignment #1. Hope here name means --> "#=GF ID name" in stockholm format. But I only have ID's I don't have the names, how can I find all the names in 30GB file.
ADD REPLY
• link
updated 4.0 years ago by
Ram
44k
•
written 9.8 years ago by
venu
7.1k
0
Entering edit mode
Are the seqs in Pfam-A.full aligned or not? Maybe you should have used esl-sfetch. What happens when you just e.g. grep "PF00406" Pfam-A.full
Pfam-A.full contains alignments in stockholm format. I guess esl-sfetch will not work on stockholm format. I can retrieve the accession names with ids but the problem here is, after retrieving with esl-afetch the file name of a family of alignments will be the accession name, that kicks me into more complex situation, If I get ids as file names, then further analysis will become easier. Even more when I used esl-afetch with multiple acc names in a file it is showing Error in configuration: Option -O is incompatible with option(s) -o,-f,--index. How to overcome this?
ADD REPLY
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
venu
7.1k
0
Entering edit mode
How about looping through your accession list with a simple
while read -r accessionNumber; do something with $accessionNumber; done <accessionList
of-course all the process is through looping, but when I extract HMM profiles from database it asks for pfam id not the name..I believe there is some way to rename those names to acc. ids.
ADD REPLY
• link
updated 4.0 years ago by
Ram
44k
•
written 9.8 years ago by
venu
7.1k
1
Entering edit mode
while read -r accessionNumber pfamID; do esl-afetch -o $accessionNumber Pfam-A.full $pfamID; done <accessionNumber-pfamID.map
and accessionNumber-pfamID.map has to be in format:
accessionNumber<TAB>pfamId
Also, just for clarification, I didn't really understand your ID conundrum. You have more than 1,500 pfam ids and you want to extract all seqs that match those but you're not happy with pfam ids.. confusing..
I tried it already, when I used it with the id (for ex: PF00013), it is saying Every alignment in file must have a name to be retrievable. Failed to find name of alignment #1. Hope here name means --> "#=GF ID name" in stockholm format. But I only have ID's I don't have the names, how can I find all the names in 30GB file.
Are the seqs in Pfam-A.full aligned or not? Maybe you should have used esl-sfetch. What happens when you just e.g.
grep "PF00406" Pfam-A.full
Pfam-A.full contains alignments in stockholm format. I guess esl-sfetch will not work on stockholm format. I can retrieve the accession names with ids but the problem here is, after retrieving with esl-afetch the file name of a family of alignments will be the accession name, that kicks me into more complex situation, If I get ids as file names, then further analysis will become easier. Even more when I used esl-afetch with multiple acc names in a file it is showing
Error in configuration: Option -O is incompatible with option(s) -o,-f,--index
. How to overcome this?How about looping through your accession list with a simple
of-course all the process is through looping, but when I extract HMM profiles from database it asks for pfam id not the name..I believe there is some way to rename those names to acc. ids.
and
accessionNumber-pfamID.map
has to be in format:Also, just for clarification, I didn't really understand your ID conundrum. You have more than 1,500 pfam ids and you want to extract all seqs that match those but you're not happy with pfam ids.. confusing..