I wanted to extract the SNPs called for HBA1 and HBA2 genes in 1000 genome project. However, these two genes appear to have no SNPs - no missense or samesense SNPs.
I cross-checked on ensemble (1000 genome browser), and dbSNP reports about 500 non-synonymous SNPs in the exonic region of HBA1.
What are the odds that 629 people in the 1000 genome project happen to have exactly the same coding sequence for HBA1? I guess, very unlikely.
I must say that I first tried to reproduce your error at the 1000 genomes browser searching for HBA1 and I indeed didn't see the expected variation, but then I realized that the track I was looking at corresponded to 20100804 data, which is what Laura described, and not the latest release. in fact, this is the note at the welcome page of the browser:
The 1000 Genomes Browser
Ensembl-based browser provides early access to 1000genomes data
In order to
facilitate immediate analysis of the
1000genomes data by the whole
scientific community, this browser
(based on Ensembl) integrates the SNP
calls from the August 2010 release.
This data will be submitted to dbSNP,
and once rsid's have been allocated,
will be absorbed into the UCSC and
Ensembl browsers according to their
respective release cycles. Until that
point any non rs SNP id's on this site
are temporary and will NOT be
maintained.
as I really can't give any other advice but to look on the 1000 genomes website for this information, since I haven't found a way to look for this information I can only suggest to digest their raw data as we did. in case you want to save time, you may want to have a look to the the raw genotypes we processed from this latest release (interesting note for any BioStar reader: there are only bi-allelic markers because their genotype caller limits it - we have asked the project to include a note on the readme file to clarify this). if you go to our ENGINES tool and try searching for HBA1 and HBA2 and selecting all 14 available populations, you will end up looking at 26 variants, 20 of them being in dbSNP132 too and 6 of them being new, and having most of them very low MAF values (19 of them are below 0.1). although this is not as much as the 500 sites you were expecting, I really hope this result helps in some way.
indeed we did Tarbem. I thought you were referring to the 1000 genomes data, so the files I understood you were interested in were those at the project's ftp site: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20101123/interim_phase1_release/ which, by the way, although they have been placed on 20101123 folder they are from the May 2011 release. a little bit confusing, I guess.
The release directories are named for the sequence release the data is based on rather the date they are released on
You cam see snp tracks coloured for consequences from vcf files using the attach remote file option from manage your data so you can attach the vcf files from the 20101123 release
Hey Jorge,
Thanks for your reply, it was very helpful - (I did not know about bi-allelic markers.)
I parsed the following file: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/ASN1_flat/ds_flat_ch16.flat.gz (HBA1 resides on chr16)
... and did not hit any variation at genomic regions corresponding to exons for HBA1.
How did you exactly recover those 26 variants? Did you guys parse some different file?
indeed we did Tarbem. I thought you were referring to the 1000 genomes data, so the files I understood you were interested in were those at the project's ftp site: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20101123/interim_phase1_release/ which, by the way, although they have been placed on 20101123 folder they are from the May 2011 release. a little bit confusing, I guess.
The release directories are named for the sequence release the data is based on rather the date they are released on
You cam see snp tracks coloured for consequences from vcf files using the attach remote file option from manage your data so you can attach the vcf files from the 20101123 release