Entering edit mode
2.4 years ago
X
•
0
Hi,
I have a vcf.gz file which contains about 1000 samples, but I want to extract about 400 of them based on sample ID. I have a separate csv file containing the sample IDs - my idea is to create a for loop that goes through the sample ID column of the vcd and matches it with the IDs from the csv file.
How can I do this so that I create a new VCF file containing only the sample ID matches? Thank you for your help - I am a beginner in bioinformatics so this may be a very basic task.
Thank you very much. I am trying to bcftools method, but am a bit confused on how it works. What does it mean by "one sample per line" when doing":
bcftools view -S sample.txt
each line in sample.txt contains one sample name.
Thank you! Just an additional question - this is what I came up with, but it seems to run into errors. Do you know if I'm lacking something in my script?
set -x
{SCRIPT=${SCRATCH}/HLA
source ${SCRIPT}/myenv/bin/activate
module load NiaEnv/2019b && \ module load gcc/8.3.0 && \ module load bcftools\
TOTAL=/project/j/jle/Shared/joint-call/February2022/total.vcf.gz sampleID=/scratch/j/jle/jasminl/HLA/participant_ID.txt OUTPUTS=/scratch/j/jle/jasminl/HLA/Output
for ${TOTAL}; do bcftools view -Oz -S ${sampleID} > ${OUTPUTS} sample.vcf.gz done }
this is unrelated to your original question and "it it seems to run into errors"
Oh! I see, Ill make a separate post for it. Thank you for your help!