Does FASTA files contains full sequence or sequence only for resolved regions when I download it from PDB? May be better idea is to download FASTA from UniProt (I want to check if it's full protein structure)?
Generally I'm trying to download some 100% resolved structures from PDB (only protein with resolution grater than x, longer than y and identity cut off z). Now I'm querying PDB using this XML:
<orgPdbCompositeQuery version="1.0">
<queryRefinement>
<queryRefinementLevel>0</queryRefinementLevel>
<orgPdbQuery>
<version>head</version>
<queryType>org.pdb.query.simple.ResolutionQuery</queryType>
<description>Resolution is x or less</description>
<refine.ls_d_res_high.comparator>between</refine.ls_d_res_high.comparator>
<refine.ls_d_res_high.max>%d</refine.ls_d_res_high.max>
</orgPdbQuery>
</queryRefinement>
<queryRefinement>
<queryRefinementLevel>1</queryRefinementLevel>
<conjunctionType>and</conjunctionType>
<orgPdbQuery>
<version>head</version>
<queryType>org.pdb.query.simple.SequenceLengthQuery</queryType>
<description>Sequence Length is x and more</description>
<v_sequence.chainLength.min>%d</v_sequence.chainLength.min>
</orgPdbQuery>
</queryRefinement>
<queryRefinement>
<queryRefinementLevel>2</queryRefinementLevel>
<conjunctionType>and</conjunctionType>
<orgPdbQuery>
<version>head</version>
<queryType>org.pdb.query.simple.ChainTypeQuery</queryType>
<description>Chain Type: there is a Protein chain but not any DNA or RNA or Hybrid</description>
<containsProtein>Y</containsProtein>
<containsDna>N</containsDna>
<containsRna>N</containsRna>
<containsHybrid>N</containsHybrid>
</orgPdbQuery>
</queryRefinement>
<queryRefinement>
<queryRefinementLevel>3</queryRefinementLevel>
<conjunctionType>and</conjunctionType>
<orgPdbQuery>
<version>head</version>
<queryType>org.pdb.query.simple.HomologueEntityReductionQuery</queryType>
<description>Representative Structures at x Sequence Identity</description>
<identityCutoff>%d</identityCutoff>
</orgPdbQuery>
</queryRefinement>
</orgPdbCompositeQuery>
Then, download PDB and FASTA and compare sequence length. It works probably fine (my Python script log some proteins differ length in FASTA vs PDB) but I find 5BU8 chain A. PDB says that there is two unique chains, but with same Uniprot ID and different length - FASTA file from PDB has 199 for chain A and 233 for chain B.
I really don't know what I should do now...
Sequences in PDB.
If you open your pdb file with any text editor, you will find there is "REMARK 465" section where you will get information about missing residues of your pdb chain and that will help you to understand why that region is not visible in 3D structure. For more information, you have to read that Article.
Ok, but for 5BU8 chain A there is no REMARK 465
Take a look at Protein Feature View to understand better how UniProt and PDB data are related.
http://www.rcsb.org/pdb/protein/Q9AYZ3?addPDB=5BU8
both chains are missing the first 56 residues in the ATOM section. Besides this, the SEQRES records are of different length for the two chains.
You could also take a look at the 'Wild Type Protein' search to identify PDB entries that cover a certain % of a UniProt sequence.