Hello, I want to download Aquaporin 1 Gene sequence for all the 1000 individuals from 1000 genomes project. I have tried a lot . I tried using bcf tools ,vcf tools but it gives me some error . The location for the Aquaporin 1 gene is chromosome 7: 30911853-30925516. I have first downloaded the vcf file for the particular region as :-
bcftools view -Oz -r 7:30911853-30925516 "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr7.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz">aqp1.1000g.vcf.gz
tabix -p vcf aqp1.1000g.vcf.gz
Then I downloaded the reference fasta sequnce from :- http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/ and named as human_ref.fa.gz.
Then I indexed fasta file as:
samtools faidx human_human_ref.fa.gz
and then build each sample's sequence by changing the reference with those variants.
#!/bin/bash
for sample in `bcftools view -h aqp1.1000g.vcf.gz | grep "^#CHROM" | cut -f10-`; do
bcftools view -c1 -Oz -s $sample -o 1000g.$sample.vcf.gz aqp1.1000g.vcf.gz
tabix -p vcf 1000g.$sample.vcf.gz
samtools faidx human_ref.fa.gz 7:30911853-30925516 | bcftools consensus 1000g.$sample.vcf.gz -o
1000g.aqp1.$sample.fa
done
But this is giving me error as :-
Note: the --sample option not given, applying all records regardless of the genotype
[W::fai_get_val] Reference 7:30911853-30925516 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in 7:30911853-30925516
Applied 0 variants
it may be important for the reference sequence names to exactly match e.g. both should say either
chr7
or just7
I used the chr 7 for both the files and the error now comes as :
Error is for all samples