I have a set of protein sequences (from Biomart ENSEMBL Release 62) of Homo sapiens. On these sequences I have run PFAM-HMMER to find domains on these sequences. I also have coordinates (coding sequence as well as genomic)for the exons that make up these proteins. I want to map the protein domains to these exons. How do I accomplish this ? Solutions in any of these- PERL, R, Bioconductor,Python, Biomart- will be very helpful. Thanks in advance.
But the pfam information is a protein sequence, i.e aminoacids, so how doyou import this with Genomic ranges and then overlap with the exons? Because, what you are trying to get is the genomic coordinates of this domains
I don't have a solution but wish to point out a pitfall. Small exons or exons that encode only the extreme amino- or carboxy-terminal portion of a Pfam domain may score too low to be retained for your downstream analysis. One solution to this is to map your Pfam hits against the model and note where there are missing residues. For example, an exon of 60 nt (20 aa) may encode Pfam model residues 1 to 20, but you don't see that hit (high E-value, low score, etc). At the same time, your first real, high-scoring hit begins with Pfam residue 21. Hmm, where are those first 20 aa? Are they in the upstream exon?
i have proteins domain start and end position as well as the .gff3 files for several plants can you help meto get position of domain start and end on cds?
i have seen your page on "backlocate" but it works only on UCSC but my data is not available at UCSC or ncbi.
thanks in advance !! i hope u will help me out
In the future, please add a note when you cross-post to other forums like the bioconductor email list.