Hi all,
I have been trying to filter down my vcf file so that it only includes the first 131st scaffolds (because these are the ones over 1MB). However, I am running into issues where every time an output file is created, it still contains all 336 scaffolds that were in the original vcf file. I have been trying to use both vcftools and bcftools to accomplish this, as well as trying to do it by including the desired scaffolds or excluding the ones I don't want. Here are some of the commands I have tried.:
vcftools --gzvcf 41-Passerculus_final.vcf.gz --bed contigs.bed --recode --recode-INFO-all --out 1MBsubset_vcftools
vcftools --gzvcf 41-Passerculus_final.vcf.gz --exclude exclude_contigs.txt --recode --recode-INFO-all --out filtered_exclude_output
bcftools view -T contigs.bed 41-Passerculus_final.vcf.gz -Oz -o 1MB_bcftools_filtered_output.vcf.gz
bcftools view -Oz -R 1MBcontigIDs.txt -o 1MBfiltered.vcf.gz 41-Passerculus_final.vcf.gz
bcftools view -Oz -r $(grep '^##contig=<ID=JAKOOL0100000[0-9]\{2\}\.' 41-Passerculus_final.vcf.gz | awk -F'[=.]' '{if ($4 <= 131) print $3}' | tr '\n' ',' && echo 'JAKOOL0100000[0-9]\{2\}[0-3][0-9]\.,') -o 1MBfiltered.vcf.gz 41-Passerculus_final.vcf.gz
I think that the issues might be arising from the scaffolds being ordered weirdly in my vcf file. For example, the numbers don't always increment by one (e.g., JAKOOL010000010.1 following JAKOOL010000001.) and lower numbers appear after higher ones (e.g., JAKOOL010000002.1 appearing after JAKOOL010000019.). Here is a subset of what this ordering looks like:
JAKOOL010000001.1
JAKOOL010000010.1
JAKOOL010000100.1
JAKOOL010000101.1
JAKOOL010000102.1
JAKOOL010000103.1
JAKOOL010000104.1
JAKOOL010000105.1
JAKOOL010000106.1
JAKOOL010000107.1
JAKOOL010000108.1
JAKOOL010000109.1
JAKOOL010000011.1
JAKOOL010000110.1
JAKOOL010000111.1
JAKOOL010000112.1
JAKOOL010000113.1
JAKOOL010000114.1
JAKOOL010000115.1
JAKOOL010000116.1
JAKOOL010000117.1
JAKOOL010000118.1
JAKOOL010000119.1
JAKOOL010000012.1
JAKOOL010000120.1
I thought this would be fine as long as this ordering matches in my vcf file and the .txt or .bed file I am providing to it, which it does, but the commands still don't seem to be able to filter correctly.
Does anyone have any advice on how to proceed?
Thank you so much.
Hi Pierre,
So sorry for the delay in responding to this!
Here is the output of the commands.
so where is the problem ? the output
1MB_bcftools_filtered_output.vcf.gz
contains only 131 distinct chromosomes as expected.I guess so, I was confused because when I do
less 1MB_bcftools_filtered_output.vcf.gz
I still see all of the scaffold names in the header. Is this just reflecting the formatting not changing but the actual content is changing? here's a sample of what I see:##contig=<ID=JAKOOL010000013.1,length=21102707,assembly=unknown> ##contig=<ID=JAKOOL010000130.1,length=1009167,assembly=unknown> ##contig=<ID=JAKOOL010000131.1,length=1005555,assembly=unknown> ##contig=<ID=JAKOOL010000132.1,length=945838,assembly=unknown> ##contig=<ID=JAKOOL010000133.1,length=942350,assembly=unknown> ##contig=<ID=JAKOOL010000134.1,length=941482,assembly=unknown> ##contig=<ID=JAKOOL010000135.1,length=926186,assembly=unknown> ##contig=<ID=JAKOOL010000136.1,length=912072,assembly=unknown> ##contig=<ID=JAKOOL010000137.1,length=865439,assembly=unknown> ##contig=<ID=JAKOOL010000138.1,length=835501,assembly=unknown> ##contig=<ID=JAKOOL010000139.1,length=803828,assembly=unknown> ##contig=<ID=JAKOOL010000014.1,length=21052054,assembly=unknown>
the dictionary defines your reference genome and it should be immutable.