Question

Slicing a vcf file to just a few thousand SNPs with PyVCF.

0

Entering edit mode

8.3 years ago

Pedro Morell ▴ 10

Hi, I'm trying to reduce a vcf file from 1000 Genomes to just a set of SNPs (around 40k). I have the list of dbSNPs IDs stored on a pandas Series, and I'm trying to retrieve just those genes with a "if record in series:" , but it's not working. Any suggestion on how to call for just certaing SNPs?

Thank you.

Python VCF 1000Genomes • 3.1k views

ADD COMMENT • link updated 8.3 years ago by Pierre Lindenbaum 166k • written 8.3 years ago by Pedro Morell ▴ 10

Pierre Lindenbaum · Accepted Answer · 2017-05-03

2

Entering edit mode

8.3 years ago

Pierre Lindenbaum 166k

using gatk : https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

--keepIDs / -IDs

List of variant IDs to select
If a file containing a list of IDs is provided to this argument, the tool will only select variants whose ID field is present in this list of IDs. The matching is done by exact string matching. The expected file format is simply plain text with one ID per line

ADD COMMENT • link 8.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks! I'm trying to use this method, but I'm having an Error with the input. My file looks like this:

rs369720564
rs369711657
rs369708999
rs369700777
rs369687712

So, according to the description (plain text, a single ID per row) it should be working. Any advices?

ADD REPLY • link updated 8.3 years ago by Pierre Lindenbaum 166k • written 8.3 years ago by Pedro Morell ▴ 10

0

Entering edit mode

I'm having an Error with the input

what is the error ?

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

ERROR MESSAGE: Invalid argument value 'HCL.txt' at position 4.

Also tryed with a toy list copied from the IDs in the vcf and got the same message, so the format is the issue.

ADD REPLY • link 8.3 years ago by Pedro Morell ▴ 10

0

Entering edit mode

can you send the full cmd line please.

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

java -jar GenomeAnalysisTK.jar -T SelectVariants --variant input.vcf --keepIDs snps.txt --out output.vcf

ADD REPLY • link 8.3 years ago by Pedro Morell ▴ 10

0

Entering edit mode

first, GATK cannot work without the option -R ref.fa

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

second, this is not the cmd line you used as there is no such argument value 'HCL.txt' at position 4.

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I actually copy/pasted it, but decided to change the names in favor of clarity. I tryed with the reference genome, and now I'm getting this error instead:

ERROR MESSAGE: Input files /home/hjorvik/Escritorio/TFM/Datos/1KG/ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf and reference have incompatible contigs. Please see https://software.broadinstitute.org/gatk/documentation/article?id=63for more information. Error details: No overlapping contigs found.

This is the cmd line I used, this time with its original names:

java -jar GenomeAnalysisTK.jar -R GRCh38_full_analysis_set_plus_decoy_hla.fa -T SelectVariants --variant ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf -IDs HCL.txt --out chrm6_trimed.vcf

ADD REPLY • link 8.3 years ago by Pedro Morell ▴ 10

1

Entering edit mode

http://gatkforums.broadinstitute.org/gatk/discussion/2396/input-files-known-and-reference-have-incompatible-contigs

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

check #chrom column in your VCF and match with that from fasta file headers (chromosome/contig names)

ADD REPLY • link 8.3 years ago by cpad0112 21k

0

Entering edit mode

I done this, but still have the same problem. I've realized that, even if I changed my chrom column, the contigs in the header are still labeled as 1,2,3,4. How can I change them using gawk?

ADD REPLY • link 8.3 years ago by Pedro Morell ▴ 10