Slicing a vcf file to just a few thousand SNPs with PyVCF.
1
0
Entering edit mode
7.6 years ago
Pedro Morell ▴ 10

Hi, I'm trying to reduce a vcf file from 1000 Genomes to just a set of SNPs (around 40k). I have the list of dbSNPs IDs stored on a pandas Series, and I'm trying to retrieve just those genes with a "if record in series:" , but it's not working. Any suggestion on how to call for just certaing SNPs?

Thank you.

Python VCF 1000Genomes • 2.8k views
ADD COMMENT
2
Entering edit mode
7.6 years ago

using gatk : https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

--keepIDs / -IDs

List of variant IDs to select
If a file containing a list of IDs is provided to this argument, the tool will only select variants whose ID field is present in this list of IDs. The matching is done by exact string matching. The expected file format is simply plain text with one ID per line
ADD COMMENT
0
Entering edit mode

Thanks! I'm trying to use this method, but I'm having an Error with the input. My file looks like this:

rs369720564
rs369711657
rs369708999
rs369700777
rs369687712

So, according to the description (plain text, a single ID per row) it should be working. Any advices?

ADD REPLY
0
Entering edit mode

I'm having an Error with the input

what is the error ?

ADD REPLY
0
Entering edit mode

ERROR MESSAGE: Invalid argument value 'HCL.txt' at position 4.

Also tryed with a toy list copied from the IDs in the vcf and got the same message, so the format is the issue.

ADD REPLY
0
Entering edit mode

can you send the full cmd line please.

ADD REPLY
0
Entering edit mode
java -jar GenomeAnalysisTK.jar -T SelectVariants --variant input.vcf --keepIDs snps.txt --out output.vcf
ADD REPLY
0
Entering edit mode

first, GATK cannot work without the option -R ref.fa

ADD REPLY
0
Entering edit mode

second, this is not the cmd line you used as there is no such argument value 'HCL.txt' at position 4.

ADD REPLY
0
Entering edit mode

I actually copy/pasted it, but decided to change the names in favor of clarity. I tryed with the reference genome, and now I'm getting this error instead:

ERROR MESSAGE: Input files /home/hjorvik/Escritorio/TFM/Datos/1KG/ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf and reference have incompatible contigs. Please see https://software.broadinstitute.org/gatk/documentation/article?id=63for more information. Error details: No overlapping contigs found.

This is the cmd line I used, this time with its original names:

java -jar GenomeAnalysisTK.jar -R GRCh38_full_analysis_set_plus_decoy_hla.fa -T SelectVariants --variant ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf -IDs HCL.txt --out chrm6_trimed.vcf
ADD REPLY
0
Entering edit mode

check #chrom column in your VCF and match with that from fasta file headers (chromosome/contig names)

ADD REPLY
0
Entering edit mode

I done this, but still have the same problem. I've realized that, even if I changed my chrom column, the contigs in the header are still labeled as 1,2,3,4. How can I change them using gawk?

ADD REPLY

Login before adding your answer.

Traffic: 2153 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6