1000 GENOME RETRIEVE DATA
2
0
Entering edit mode
8.7 years ago

Hello Everyone, I am a computer science and working on a bioinformatics project. I want to download SCN9a Gene sequence for all the 1000 individuals from 1000 genomes project. I have been struggling a lot . I tried using SRA toolkit and other stuff but it didn't work out . Requesting you guys to hep me out. The location for the SCN9a gene is Molecular Location on chromosome 2: base pairs 166,195,185 to 166,375,987

gene RNA-Seq • 3.6k views
ADD COMMENT
0
Entering edit mode

please, define "download SCN9a Gene sequence for all the 1k individuals"

ADD REPLY
0
Entering edit mode

I want each of the 1000 individual unique coding sequences or mRNA sequences for the human SCN9a gene

ADD REPLY
3
Entering edit mode
8.6 years ago

if only fasta sequences are to be obtained, I would download all variants for that region

bcftools view -Oz -r 2:166195185-166375987 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz > SCN9a.1000g.vcf.gz
tabix -p vcf SCN9a.1000g.vcf.gz

then the reference must be downloaded to be indexed locally (pity that 1000g doesn't have that index remotely, because the reference could be queried directly rather than downloaded)

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
samtools faidx human_g1k_v37.fasta.gz

and then build each sample's sequence by changing the reference with those variants

for sample in `bcftools view -h 1000g.SCN9a.vcf.gz | grep "^#CHROM" | cut -f10-`; do
  bcftools view -c1 -Oz -s $sample -o 1000g.$sample.vcf.gz SCN9a.1000g.vcf.gz
  tabix -p vcf 1000g.$sample.vcf.gz
  samtools faidx human_g1k_v37.fasta.gz 2:166195185-166375987 \
  | bcftools consensus 1000g.$sample.vcf.gz -o 1000g.SCN9a.$sample.fa
done

to do all this you will "only" need samtools and bcftools in an unix-like environment connected to internet.

ADD COMMENT
0
Entering edit mode

I tried your solution and also changed the 1000g.SCN9a.vcf.gz to SCN9a.1000g.vcf.gz while building it but still I am unable to get the fasta file.

Error:
Failed to open 1000g.HG00096.vcf.gz: could not load index
[download_from_remote] fail to open remote file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
[fai_load] failed to open remote FASTA index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
Could not load fai index of ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
Failed to open 1000g.HG00097.vcf.gz: could not load index
[download_from_remote] fail to open remote file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
[fai_load] failed to open remote FASTA index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai

I even tried after downloading the file manually to my system but its giving me this error. After generatiing the fai file as well. My code:

for sample in `./bcftools view -h SCN9a.1000g.vcf.gz | grep "^#CHROM" | cut -f10-`; do
  ./bcftools view -c1 -Oz -s "$sample" -o "vcf/1000g.$sample.vcf.gz" 'SCN9a.1000g.vcf.gz'
  samtools faidx human_g1k_v37.fasta 2:166195185-166375987 | ./bcftools consensus vcf/1000g.$sample.vcf.gz -o output/1000g.SCN9a.$sample.fa
done

Error :Failed to open vcf/1000g.HG00096.vcf.gz: could not load index
Failed to open vcf/1000g.HG00097.vcf.gz: could not load index
Failed to open vcf/1000g.HG00099.vcf.gz: could not load index
Failed to open vcf/1000g.HG00100.vcf.gz: could not load index
Failed to open vcf/1000g.HG00101.vcf.gz: could not load index
Failed to open vcf/1000g.HG00102.vcf.gz: could not load index
Failed to open vcf/1000g.HG00103.vcf.gz: could not load index
Failed to open vcf/1000g.HG00105.vcf.gz: could not load index
Failed to open vcf/1000g.HG00106.vcf.gz: could not load index
Failed to open vcf/1000g.HG00107.vcf.gz: could not load index
Failed to open vcf/1000g.HG00108.vcf.gz: could not load index
Failed to open vcf/1000g.HG00109.vcf.gz: could not load index
Failed to open vcf/1000g.HG00110.vcf.gz: could not load index
Failed to open vcf/1000g.HG00111.vcf.gz: could not load index
Failed to open vcf/1000g.HG00112.vcf.gz: could not load index
Failed to open vcf/1000g.HG00113.vcf.gz: could not load index
ADD REPLY
0
Entering edit mode

it seems like the 1000g reference must indeed be downloaded locally in order to generate the missing fai index. pity they don't have that fai file remotely too, since the reference could be queried as stated rather than downloaded. so you did right.

but regarding the next error, there's a step missing in my code right after the bcftools command, just before the samtools command: the vcf.gz index must be generated. I've updated my answer to reflect this.

ADD REPLY
0
Entering edit mode

did you try to use the http protocol instead of ftp ?

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

ADD REPLY
1
Entering edit mode

yes Pierre. both http and ftp protocols give the same error, since the .fasta.gz.fai is not there (although there is a .fasta.fai indeed)

$ samtools faidx http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai 2:166195185-166375987
[download_from_remote] fail to open remote file http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai.fai
[fai_load] failed to open remote FASTA index http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai.fai
Could not load fai index of http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai

I've reported it as a bug, since the GRCh38 reference is perfectly described in that folder. either they could index the compressed .fasta.gz reference, or they could uncompress it. I would go for the first one, since the download time would be much lower, but I'm just giving some ideas out.

ADD REPLY
0
Entering edit mode

Thanks a lot Jorge !!!! Now that we have the raw DNA sequences for the SCN9A gene from all available individuals, we intend to extract the coding sequence from them using information available on NCBI as a guide. We would ultimately like to align these coding sequences to identify SNPs within the exons. After testing the process by extracting and comparing the CDS from two individuals we found that the base pairs did not line up and were not at all similar to the reference gene available from NCBI and we are not sure why this is the case.

ADD REPLY
0
Entering edit mode

file:///home/ahmed/Pictures/Desktop_005.png

ADD REPLY
0
Entering edit mode
8.6 years ago

Dear you can download whole genome using SRA tool kit and and you are saying you are knowing the position; so can extract the sequences from that position. You can try some perl or other script for that. Hope it will help you.

ADD COMMENT
0
Entering edit mode

Can you give me the steos as to how I can do that ??

ADD REPLY

Login before adding your answer.

Traffic: 2331 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6