Issues filtering VCF file with bcftools/vcftools
1
0
Entering edit mode
8 months ago
ekirsch • 0

Hi all,

I have been trying to filter down my vcf file so that it only includes the first 131st scaffolds (because these are the ones over 1MB). However, I am running into issues where every time an output file is created, it still contains all 336 scaffolds that were in the original vcf file. I have been trying to use both vcftools and bcftools to accomplish this, as well as trying to do it by including the desired scaffolds or excluding the ones I don't want. Here are some of the commands I have tried.:

vcftools --gzvcf 41-Passerculus_final.vcf.gz --bed contigs.bed --recode --recode-INFO-all --out 1MBsubset_vcftools
vcftools --gzvcf 41-Passerculus_final.vcf.gz --exclude exclude_contigs.txt --recode --recode-INFO-all --out filtered_exclude_output
bcftools view -T contigs.bed 41-Passerculus_final.vcf.gz -Oz -o 1MB_bcftools_filtered_output.vcf.gz
bcftools view -Oz -R 1MBcontigIDs.txt -o 1MBfiltered.vcf.gz 41-Passerculus_final.vcf.gz
bcftools view -Oz -r $(grep '^##contig=<ID=JAKOOL0100000[0-9]\{2\}\.' 41-Passerculus_final.vcf.gz | awk -F'[=.]' '{if ($4 <= 131) print $3}' | tr '\n' ',' && echo 'JAKOOL0100000[0-9]\{2\}[0-3][0-9]\.,') -o 1MBfiltered.vcf.gz 41-Passerculus_final.vcf.gz

I think that the issues might be arising from the scaffolds being ordered weirdly in my vcf file. For example, the numbers don't always increment by one (e.g., JAKOOL010000010.1 following JAKOOL010000001.) and lower numbers appear after higher ones (e.g., JAKOOL010000002.1 appearing after JAKOOL010000019.). Here is a subset of what this ordering looks like:

JAKOOL010000001.1
JAKOOL010000010.1
JAKOOL010000100.1
JAKOOL010000101.1
JAKOOL010000102.1
JAKOOL010000103.1
JAKOOL010000104.1
JAKOOL010000105.1
JAKOOL010000106.1
JAKOOL010000107.1
JAKOOL010000108.1
JAKOOL010000109.1
JAKOOL010000011.1
JAKOOL010000110.1
JAKOOL010000111.1
JAKOOL010000112.1
JAKOOL010000113.1
JAKOOL010000114.1
JAKOOL010000115.1
JAKOOL010000116.1
JAKOOL010000117.1
JAKOOL010000118.1
JAKOOL010000119.1
JAKOOL010000012.1
JAKOOL010000120.1

I thought this would be fine as long as this ordering matches in my vcf file and the .txt or .bed file I am providing to it, which it does, but the commands still don't seem to be able to filter correctly.

Does anyone have any advice on how to proceed?

Thank you so much.

vcftools vcf bcftools • 751 views
ADD COMMENT
0
Entering edit mode
8 months ago

in

bcftools view -T contigs.bed 41-Passerculus_final.vcf.gz -Oz -o 1MB_bcftools_filtered_output.vcf.gz

please, show me the outputs of

cut -f 1 contigs.bed | uniq | sort | uniq | cat -n | tail
bcftools query -f '%CHROM\n' 1MB_bcftools_filtered_output.vcf.gz | uniq | sort | uniq | cat -n | tail
bcftools query -f '%CHROM\n' 41-Passerculus_final.vcf.gz | uniq | sort | uniq | cat -n | tail
ADD COMMENT
0
Entering edit mode

Hi Pierre,

So sorry for the delay in responding to this!

Here is the output of the commands.

bash-4.2$ cut -f 1 contigs.bed | uniq | sort | uniq | cat -n | tail     
   123  JAKOOL010000122.1
   124  JAKOOL010000123.1
   125  JAKOOL010000124.1
   126  JAKOOL010000125.1
   127  JAKOOL010000126.1
   128  JAKOOL010000127.1
   129  JAKOOL010000128.1
   130  JAKOOL010000129.1
   131  JAKOOL010000130.1
   132  JAKOOL010000131.1

bash-4.2$ bcftools query -f '%CHROM\n' 1MB_bcftools_filtered_output.vcf.gz | uniq | sort | uniq | cat -n | tail    
   122  JAKOOL010000122.1
   123  JAKOOL010000123.1
   124  JAKOOL010000124.1
   125  JAKOOL010000125.1
   126  JAKOOL010000126.1
   127  JAKOOL010000127.1
   128  JAKOOL010000128.1
   129  JAKOOL010000129.1
   130  JAKOOL010000130.1
   131  JAKOOL010000131.1

bash-4.2$ bcftools query -f '%CHROM\n' 41-Passerculus_final.vcf.gz | uniq | sort | uniq | cat -n | tail  
   328  JAKOOL010000328.1
   329  JAKOOL010000329.1
   330  JAKOOL010000330.1
   331  JAKOOL010000331.1
   332  JAKOOL010000332.1
   333  JAKOOL010000333.1
   334  JAKOOL010000334.1
   335  JAKOOL010000335.1
   336  JAKOOL010000336.1
   337  JAKOOL010000337.1
ADD REPLY
0
Entering edit mode

so where is the problem ? the output 1MB_bcftools_filtered_output.vcf.gz contains only 131 distinct chromosomes as expected.

ADD REPLY
0
Entering edit mode

I guess so, I was confused because when I do less 1MB_bcftools_filtered_output.vcf.gz I still see all of the scaffold names in the header. Is this just reflecting the formatting not changing but the actual content is changing? here's a sample of what I see: ##contig=<ID=JAKOOL010000013.1,length=21102707,assembly=unknown> ##contig=<ID=JAKOOL010000130.1,length=1009167,assembly=unknown> ##contig=<ID=JAKOOL010000131.1,length=1005555,assembly=unknown> ##contig=<ID=JAKOOL010000132.1,length=945838,assembly=unknown> ##contig=<ID=JAKOOL010000133.1,length=942350,assembly=unknown> ##contig=<ID=JAKOOL010000134.1,length=941482,assembly=unknown> ##contig=<ID=JAKOOL010000135.1,length=926186,assembly=unknown> ##contig=<ID=JAKOOL010000136.1,length=912072,assembly=unknown> ##contig=<ID=JAKOOL010000137.1,length=865439,assembly=unknown> ##contig=<ID=JAKOOL010000138.1,length=835501,assembly=unknown> ##contig=<ID=JAKOOL010000139.1,length=803828,assembly=unknown> ##contig=<ID=JAKOOL010000014.1,length=21052054,assembly=unknown>

ADD REPLY
0
Entering edit mode

the dictionary defines your reference genome and it should be immutable.

ADD REPLY

Login before adding your answer.

Traffic: 1920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6