Hi everyone,
This is more of a scripting question but hopefully someone can help.
I have used GATK's HaplotypeCaller to call SNPs for 150 samples, and Picard's GatherVcfs to merge each sample into a single GVCF file. I now want to import the 150 merged GVCFs into GenomicsDB to perform joint genotyping.
Somewhere along the way every sample has been renamed 'Sample1' and so GenomicsDBImport is throwing out a duplicated samples error.
Does anyone know an efficient way of replacing the sample names in all 150 files? I thought about doing a nested loop in Linux something like:
for F in $(cat $fileList)
do
for G in $(cat $newNames)
do
bcftools reheader ${F}.g.vcf.gz -s $G
done
done
But
a) I'm not sure if I have that loop set up correctly
b) I'm getting more confused by bcftools requiring a file and not a string as input. Would I need to create 150 files with a single name in and then provide a list of those file within newNames
?
Any help would be much appreciated!
EDIT: Oops, I just realised it's given the last name in id_lookup.txt to each sample in the subset I tested it on, any ideas?? :/
Perhaps paste some of the IDs that you have? If there are special characters, this can cause a problem
Hi Kevin, thanks for getting back to me. This ids are just alphanumeric, no symbols. Eg P0811Y23, P0811Y24 etc
There may be an issue with line-end encodings in your samples files - not sure. I cannot see how they appear your VCF or samples file. If I recall, the delimiter is a space. Also, keep in mind that we are referring to a multi-sample single VCF file, right? You are saying that, e.g., if you supplied my file above, it would convert everything to
newname4
?Ah no, this is where the confusion lies! I have 150 single-sample vcf files and each one has the same sample name. So I want to rehead the nth vcf file with the nth row from id_lookup.txt. This is why I think I’d need a nested loop but I can’t get it to work :/
In that case, you could still use a single loop, and your input [to the loop] would be the llst of VCFs plus the list of new sample names, and both of these would be perfectly aligned by row. Here is a mix of code and pseudocode: