Changing sample name in multiple VCF files
1
0
Entering edit mode
3.5 years ago
USA_225478 • 0

Hi everyone,

This is more of a scripting question but hopefully someone can help.

I have used GATK's HaplotypeCaller to call SNPs for 150 samples, and Picard's GatherVcfs to merge each sample into a single GVCF file. I now want to import the 150 merged GVCFs into GenomicsDB to perform joint genotyping.

Somewhere along the way every sample has been renamed 'Sample1' and so GenomicsDBImport is throwing out a duplicated samples error.

Does anyone know an efficient way of replacing the sample names in all 150 files? I thought about doing a nested loop in Linux something like:

for F in $(cat $fileList)  
do
    for G in $(cat $newNames)  
    do
        bcftools reheader ${F}.g.vcf.gz -s $G 
    done
done

But

a) I'm not sure if I have that loop set up correctly

b) I'm getting more confused by bcftools requiring a file and not a string as input. Would I need to create 150 files with a single name in and then provide a list of those file within newNames?

Any help would be much appreciated!

gatk bcftools linux vcf • 2.6k views
ADD COMMENT
0
Entering edit mode
3.5 years ago

You could try this, assuming that your VCFs are in a directory called vcfs/:

find vcfs/ -name "*.vcf.gz" | while read vcf ;
do
  echo -e "--input file is:\t""${vcf}" ;
  out=$(echo "${vcf}" | sed 's/\.vcf\.gz/_reheader.vcf.gz/g') ;
  echo -e "--output file is:\t""${out}" ;
  bcftools reheader \
    --samples id_lookup.txt \
    "${vcf}" -Oz > "${out}" ;
  echo "Done." ;
done ;

id_lookup.txt looks like (space-delimited):

oldname1 newname1
oldname2 newname2
oldname3 newname3
oldname4 newname4

Kevin

ADD COMMENT
0
Entering edit mode

EDIT: Oops, I just realised it's given the last name in id_lookup.txt to each sample in the subset I tested it on, any ideas?? :/

ADD REPLY
0
Entering edit mode

Perhaps paste some of the IDs that you have? If there are special characters, this can cause a problem

ADD REPLY
0
Entering edit mode

Hi Kevin, thanks for getting back to me. This ids are just alphanumeric, no symbols. Eg P0811Y23, P0811Y24 etc

ADD REPLY
0
Entering edit mode

There may be an issue with line-end encodings in your samples files - not sure. I cannot see how they appear your VCF or samples file. If I recall, the delimiter is a space. Also, keep in mind that we are referring to a multi-sample single VCF file, right? You are saying that, e.g., if you supplied my file above, it would convert everything to newname4?

ADD REPLY
0
Entering edit mode

Ah no, this is where the confusion lies! I have 150 single-sample vcf files and each one has the same sample name. So I want to rehead the nth vcf file with the nth row from id_lookup.txt. This is why I think I’d need a nested loop but I can’t get it to work :/

ADD REPLY
0
Entering edit mode

In that case, you could still use a single loop, and your input [to the loop] would be the llst of VCFs plus the list of new sample names, and both of these would be perfectly aligned by row. Here is a mix of code and pseudocode:

paste VCFs.list newnames.list | while read vcf newname ;
do
  # extract old name from input VCF (see https://www.biostars.org/p/139362/)
  # write old name and new name to a space-delimited temporary file, rename.txt.tmp
  # rename via bcftools reheader -s rename.txt.tmp
done ;
rm rename.txt.tmp ;
ADD REPLY

Login before adding your answer.

Traffic: 2278 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6