Restricting vcf entries based on ID length
1
0
Entering edit mode
14 months ago

I have an issue working with vcf files in that when I rename entries with no ID to be CHR:POS:REF:ALT, the ID's produced are sometimes too long when converted to plink .bed files. I do not want to change the name convention to something different than CHR:POS:REF:ALT and I do not want to removed them. Is there away to filter out entries in VCF files such that if their ID's are longer than say, 15 characters, they get removed ?

Example, the following entries with these ID's will be kept:

rs145699
rs343930204
chr6:10550:A:T
chr6:54032:G:C

The following would be removed:

chr6:38458939:A:TTTTCCT
chr6:35908:CCCCCCG:G

I am looking for a command that uses bcftools to do this ideally. Thank you!

bcftools vcf vcftools • 669 views
ADD COMMENT
0
Entering edit mode
  • Note that 15 is always too low of a length limit. Even without the "chr" prefix, there are lots of SNPs on e.g. chr10 with POS > 99999999, which would be filtered out by this rule. The lowest limit I'd ever recommend imposing is 39; this corresponds to EIGENSOFT's capacity.
  • Alternatively, you can just filter out indels, if there are any other reasons your analysis would have problems with them. "plink2 --snps-only" is one easy way to do that; there are many others.
ADD REPLY
1
Entering edit mode
14 months ago
 awk -F '\t' '$0 ~ /^#/ || length($3)<=15' < in.vcf
ADD COMMENT

Login before adding your answer.

Traffic: 2012 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6