Question

Extracting information from VCF file for many specific positions in specific chromosomes

0

Entering edit mode

18 months ago

mohsamir2016 ▴ 30

Dear all, I have an excel file that I created from VCF file for common SNPs across 6 samples. This excel have the chromosomes and the position of the SNPs only (see example table1) Table1 Now I would like to obtain the other information (eg. allels, Genotype, depth, etc) from the VCF files of the 6 samples (i.e. the one that contains these positions).
I tried using AWK command like here for position 23432 on chr. 1 for the 6 file :

awk -F " " '$1=="1" && $2=="23432"' file1.vcf
awk -F " " '$1=="1" && $2=="23432"' file2.vcf
awk -F " " '$1=="1" && $2=="23432"' file3.vcf
awk -F " " '$1=="1" && $2=="23432"' file4.vcf
awk -F " " '$1=="1" && $2=="23432"' file5.vcf
awk -F " " '$1=="1" && $2=="23432"' file6.vcf

he issue is that these SNPs I have are thousands positions, so I need an automated way to do this

Could you advise on that ?

Thanks

SNP RNA GATK seq • 893 views

ADD COMMENT • link updated 18 months ago by Pierre Lindenbaum 164k • written 18 months ago by mohsamir2016 ▴ 30

score 0 · Answer 1 · 2023-05-24

0

Entering edit mode

18 months ago

Pierre Lindenbaum 164k

Ses the option --regions-file of bcftools view.

ADD COMMENT • link 18 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

I went into the bcftools view -R but I could not understand it from the documentation. Could you please give me an example code that I can run and test the results ?

Thanks

ADD REPLY • link 18 months ago by mohsamir2016 ▴ 30

0

Entering edit mode

what don't you understand from the documentation ?

Regions can be specified either on command line or in a VCF, BED, or tab-delimited file (the default). The columns of the tab-delimited file can contain either positions (two-column format: CHROM, POS) or intervals (three-column format: CHROM, BEG, END), but not both. Positions are 1-based and inclusive. The columns of the tab-delimited BED file are also CHROM, POS and END (trailing columns are ignored), but coordinates are 0-based, half-open. To indicate that a file be treated as BED rather than the 1-based tab-delimited file, the file must have the ".bed" or ".bed.gz" suffix (case-insensitive). Uncompressed files are stored in memory, while bgzip-compressed and tabix-indexed region files are streamed. Note that sequence names must match exactly, "chr20" is not the same as "20". Also note that chromosome ordering in FILE will be respected, the VCF will be processed in the order in which chromosomes first appear in FILE. However, within chromosomes, the VCF will always be processed in ascending genomic coordinate order no matter what order they appear in FILE. Note that overlapping regions in FILE can result in duplicated out of order positions in the output. This option requires indexed VCF/BCF files.

ADD REPLY • link 18 months ago by Pierre Lindenbaum 164k