pulling out a single pseudochromosome from my vcf file
1
1
Entering edit mode
5.5 years ago

Hi everyone,

I am trying to extract a single chromosome from my dataset so that I can run analyses using only this sex chromosome and compare to results of similar analyses on the entire genome. However, all of my chromosomes are labeled as "Pseudochromosomes", i.e., the chromosome I want to extract and make it's own vcf file is "Pseudochromosome_Z". I read that the following command should do the trick for regularly named chromosomes:

 grep -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

However, how do I do this for pseudochromosomes? Thanks for bearing with my ignorance.

genome • 1.7k views
ADD COMMENT
0
Entering edit mode

grep -v will work to exclude things specified. So try adding -v to your command above.

ADD REPLY
0
Entering edit mode

Thanks for the reply. Unfortunately, that is not working for me. When I add "-v", the new file is the same size as the original vcf. I want a new vcf file that will contain only the Z chromosome. When I try without "v", the new file is only 10kb, tiny and containing no information besides a chromosome list.

ADD REPLY
0
Entering edit mode

Is your regular expression correct?

ADD REPLY
0
Entering edit mode

I'm not entirely sure. I don't have a vcf file with chromosomes labeled that would fit as "chr". Only the vcf file with Pseudochromosomes, so I can't really test whether the original line works. I got the line from here: How to extract specific chromosome from vcf file

I just tried this: grep -v -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

Any idea what I'm doing wrong? I'm guessing it probably has no idea what "Z" is since I believe it's looking for just a chromosome number and not a pseudochromosome. Thanks!

ADD REPLY
0
Entering edit mode

You have to use the actual names that are in your file in expression. Show the header of your vcf file and a few lines. Someone can correct the expression.

ADD REPLY
0
Entering edit mode

I've supplied what I think is the correct information below. Would you mind helping me correcting my expression? I believe it is something close to the following:

 grep -v -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

Thanks!

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.8+htslib-1.8
##bcftoolsCommand=mpileup -Ou -q 20 -a DP -r Pseudochromosome_33 -f /rhome/abrelsford/shared/bwaDb/bCalAnn1_v1.p.fasta -b bamlist.txt
##reference=file:///rhome/abrelsford/shared/bwaDb/bCalAnn1_v1.p.fasta
##contig=<ID=Pseudochromosome_1,length=197551010>
##contig=<ID=Pseudochromosome_2,length=151342139>
##contig=<ID=Pseudochromosome_3,length=114810999>
##contig=<ID=Pseudochromosome_4,length=18597117>
##contig=<ID=Pseudochromosome_4A,length=44745344>
##contig=<ID=Pseudochromosome_4B,length=23291945>
##contig=<ID=Pseudochromosome_5,length=16645885>
##contig=<ID=Pseudochromosome_5A,length=43880846>
##contig=<ID=Pseudochromosome_6,length=35401958>
##contig=<ID=Pseudochromosome_7,length=39139214>
##contig=<ID=Pseudochromosome_8,length=31090148>
##contig=<ID=Pseudochromosome_9,length=25686456>
##contig=<ID=Pseudochromosome_10,length=22664390>
##contig=<ID=Pseudochromosome_11,length=20302349>
##contig=<ID=Pseudochromosome_12,length=21352500>
##contig=<ID=Pseudochromosome_13,length=17696115>
##contig=<ID=Pseudochromosome_14,length=15497061>
##contig=<ID=Pseudochromosome_15,length=13887164>
##contig=<ID=Pseudochromosome_17,length=10473655>
##contig=<ID=Pseudochromosome_18,length=11617720>
##contig=<ID=Pseudochromosome_19,length=11166003>
##contig=<ID=Pseudochromosome_20,length=15116875>
##contig=<ID=Pseudochromosome_21,length=7710055>
##contig=<ID=Pseudochromosome_22,length=5198135>
##contig=<ID=Pseudochromosome_23,length=6423903>
##contig=<ID=Pseudochromosome_24,length=6461084>
##contig=<ID=Pseudochromosome_25,length=2161911>
##contig=<ID=Pseudochromosome_26,length=6306732>
##contig=<ID=Pseudochromosome_27,length=5749075>
##contig=<ID=Pseudochromosome_28,length=5713987>
##contig=<ID=Pseudochromosome_33,length=1888140>
##contig=<ID=Pseudochromosome_Z,length=74081004>
ADD REPLY
3
Entering edit mode
5.5 years ago
GenoMax 147k

While following may work,

grep -w 'Pseudochromosome_Z' your.vcf > Z.vcf

you should ideally use bcftools for this since it will preserve the headers etc. You will need to add these to VCF yourself with method above.

bcftools view input.vcf.gz --regions Pseudochromosome_Z

You will need to sort and tabix compress the vcf file. (Ref: How to extract specific chromosome from vcf file )

ADD COMMENT
0
Entering edit mode

Thanks! I've made some progress now! I was able to compress and index the vcf file, and this command (bcftools view input.vcf.gz --regions Pseudochromosome_Z) worked, allowing me to view the Z chromosome. How can I save a separate vcf file with only the information from the Z chromosome?

ADD REPLY
0
Entering edit mode
bcftools view input.vcf.gz --regions Pseudochromosome_Z > Z.vcf
ADD REPLY
0
Entering edit mode

Thank you for your help! This resolved the issue.

ADD REPLY
0
Entering edit mode

You can accept the parent answer to provide closure to this thread.

ADD REPLY

Login before adding your answer.

Traffic: 1638 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6