Hello,
I have downloaded the local pfam data base from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release. There are two databases 1) Pfam-A.fasta.gz 2) Pfam-A.full.gz . The first one consist of sequences in fasta format with PFAM accession id in its header, but it does not contain all the sequences of a family. The second one consist of sequences in stockholm format. I have converted them to fasta, these contain all the sequences related to a family but the sequence header does not contain the PFAM accession id. I've more than 1500 pfam id's, I want to extract all the sequences that fall under a family (or accession id). Every stockholm alignment in (2) is having the pfam id at the top as, for example "#=GF AC PF00406". How can I get over this..any help will be greatly appreciated.
I tried it already, when I used it with the id (for ex: PF00013), it is saying Every alignment in file must have a name to be retrievable. Failed to find name of alignment #1. Hope here name means --> "#=GF ID name" in stockholm format. But I only have ID's I don't have the names, how can I find all the names in 30GB file.
Are the seqs in Pfam-A.full aligned or not? Maybe you should have used esl-sfetch. What happens when you just e.g.
grep "PF00406" Pfam-A.full
Pfam-A.full contains alignments in stockholm format. I guess esl-sfetch will not work on stockholm format. I can retrieve the accession names with ids but the problem here is, after retrieving with esl-afetch the file name of a family of alignments will be the accession name, that kicks me into more complex situation, If I get ids as file names, then further analysis will become easier. Even more when I used esl-afetch with multiple acc names in a file it is showing
Error in configuration: Option -O is incompatible with option(s) -o,-f,--index
. How to overcome this?How about looping through your accession list with a simple
of-course all the process is through looping, but when I extract HMM profiles from database it asks for pfam id not the name..I believe there is some way to rename those names to acc. ids.
and
accessionNumber-pfamID.map
has to be in format:Also, just for clarification, I didn't really understand your ID conundrum. You have more than 1,500 pfam ids and you want to extract all seqs that match those but you're not happy with pfam ids.. confusing..