Hello Everyone,
I am a computer science and working on a bioinformatics project. I want to download SCN9a Gene sequence for all the 1000 individuals from 1000 genomes project. I have been struggling a lot . I tried using SRA toolkit and other stuff but it didn't work out . Requesting you guys to hep me out. The location for the SCN9a gene is Molecular Location on chromosome 2: base pairs 166,195,185 to 166,375,987
then the reference must be downloaded to be indexed locally (pity that 1000g doesn't have that index remotely, because the reference could be queried directly rather than downloaded)
I tried your solution and also changed the 1000g.SCN9a.vcf.gz to SCN9a.1000g.vcf.gz while building it but still I am unable to get the fasta file.
Error:
Failed to open 1000g.HG00096.vcf.gz: could not load index
[download_from_remote] fail to open remote file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
[fai_load] failed to open remote FASTA index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
Could not load fai index of ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
Failed to open 1000g.HG00097.vcf.gz: could not load index
[download_from_remote] fail to open remote file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
[fai_load] failed to open remote FASTA index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
I even tried after downloading the file manually to my system but its giving me this error. After generatiing the fai file as well. My code:
for sample in`./bcftools view -h SCN9a.1000g.vcf.gz |grep"^#CHROM"|cut -f10-`;do
./bcftools view -c1 -Oz -s "$sample" -o "vcf/1000g.$sample.vcf.gz"'SCN9a.1000g.vcf.gz'
samtools faidx human_g1k_v37.fasta 2:166195185-166375987 | ./bcftools consensus vcf/1000g.$sample.vcf.gz -o output/1000g.SCN9a.$sample.fa
done
Error :Failed to open vcf/1000g.HG00096.vcf.gz: could not load index
Failed to open vcf/1000g.HG00097.vcf.gz: could not load index
Failed to open vcf/1000g.HG00099.vcf.gz: could not load index
Failed to open vcf/1000g.HG00100.vcf.gz: could not load index
Failed to open vcf/1000g.HG00101.vcf.gz: could not load index
Failed to open vcf/1000g.HG00102.vcf.gz: could not load index
Failed to open vcf/1000g.HG00103.vcf.gz: could not load index
Failed to open vcf/1000g.HG00105.vcf.gz: could not load index
Failed to open vcf/1000g.HG00106.vcf.gz: could not load index
Failed to open vcf/1000g.HG00107.vcf.gz: could not load index
Failed to open vcf/1000g.HG00108.vcf.gz: could not load index
Failed to open vcf/1000g.HG00109.vcf.gz: could not load index
Failed to open vcf/1000g.HG00110.vcf.gz: could not load index
Failed to open vcf/1000g.HG00111.vcf.gz: could not load index
Failed to open vcf/1000g.HG00112.vcf.gz: could not load index
Failed to open vcf/1000g.HG00113.vcf.gz: could not load index
it seems like the 1000g reference must indeed be downloaded locally in order to generate the missing fai index. pity they don't have that fai file remotely too, since the reference could be queried as stated rather than downloaded. so you did right.
but regarding the next error, there's a step missing in my code right after the bcftools command, just before the samtools command: the vcf.gz index must be generated. I've updated my answer to reflect this.
yes Pierre. both http and ftp protocols give the same error, since the .fasta.gz.fai is not there (although there is a .fasta.fai indeed)
$ samtools faidx http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai 2:166195185-166375987
[download_from_remote] fail to open remote file http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai.fai
[fai_load] failed to open remote FASTA index http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai.fai
Could not load fai index of http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz.fai
I've reported it as a bug, since the GRCh38 reference is perfectly described in that folder. either they could index the compressed .fasta.gz reference, or they could uncompress it. I would go for the first one, since the download time would be much lower, but I'm just giving some ideas out.
Thanks a lot Jorge !!!! Now that we have the raw DNA sequences for the SCN9A gene from all available individuals, we intend to extract the coding sequence from them using information available on NCBI as a guide. We would ultimately like to align these coding sequences to identify SNPs within the exons. After testing the process by extracting and comparing the CDS from two individuals we found that the base pairs did not line up and were not at all similar to the reference gene available from NCBI and we are not sure why this is the case.
Dear you can download whole genome using SRA tool kit and and you are saying you are knowing the position; so can extract the sequences from that position. You can try some perl or other script for that. Hope it will help you.
please, define "download SCN9a Gene sequence for all the 1k individuals"
I want each of the 1000 individual unique coding sequences or mRNA sequences for the human SCN9a gene