Question

How to subset a set of variations from a VCF on specific chromosome and between 2 postions?

0

Entering edit mode

6.2 years ago

NT ▴ 20

Hi,

I'm a very beginner on using bash so my question may seem stupid for some of you. I have a VCF annotated file with a big number of samples. I want to subset a file from this one with all the variations of a gene (located on the chromosome ($1 = chr9) and between the position ($2 = POS) 81583683 and 81689305. I used the awk command after modifications awk '{$1== "chr9" && 81583683 <$2< 81689305}' VCF1 > VCF2 but had always error message.

Can anyone tell me please if the awk command is correct in this case for selection with 2 conditions or I should use another command?

Thank you

awk bash vcf subset • 1.6k views

ADD COMMENT • link 6.2 years ago by NT ▴ 20

0

Entering edit mode

Thank u for help! I used the command of bcftools after indexing the vcf file. my command line looks like this: bcftools view file1.vcf.gz "chr9:81583683-81689305" -O v file2.vcf. It works but it doesn't return all the variations that i want to get, just some of them while I want to get all the variations even the duplicated one.

ADD REPLY • link 6.2 years ago by NT ▴ 20

1

Entering edit mode

while I want to get all the variations even the duplicated one

show us the variants ignored by the command above

ADD REPLY • link 6.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Its huge number of variations ignored (I have file with 800 samples and i want to search the variations for all the samples in this region). The command generates only some of variation and just once ( for exemple, if a variation appears in 5 samples, i want to find 5 lines with this variation in the generated file, however with this line command, either I don't find it in the generated file or i find it just one time (on line))

ADD REPLY • link 6.2 years ago by NT ▴ 20

1

Entering edit mode

that's still not clear to me

ADD REPLY • link 6.2 years ago by Pierre Lindenbaum 166k

score 2 · Answer 1 · 2019-06-11

2

Entering edit mode

6.2 years ago

Pierre Lindenbaum 166k

you want:

 awk -F '\t' '($0 ~ /^#/ || ("chr9" && 81583683 <$2 && $2< 81689305))' VCF1 > VCF2

or, better, after indexing the VCF1:

bcftools view vcf1.vcf.gz "chr9:81583683-81689305"

ADD COMMENT • link 6.2 years ago by Pierre Lindenbaum 166k