In the bash below the unique headers of each vcf.gz are stored in a text file with the same name. There can be multiple vcf.gz in a directory, usually 3 files, that I need to fix the header in each file before further processing it. My question is how can I match each text file (using only the numerical prefix) with its vcf.gz and pass the stored variables of each to the reheader code (# edit the header) ? Thank you :). I apologize for the long post just trying to include all the details.
The steps for the bash are below:
1. vcf.gz created (#gunzip files) - works as expected
2. unique undefined annotations for each vcf.gz - creates the _header.txt file but doesn't strip off the prefix
3. match and store vcf.gz as $file1 and .txt as $file2
4. run the bcftools reheader
5. repeat for other files
There are always 3 vcf.gz files in the following format below in the directory:
Files in /home/cmccabe/Desktop/NGS/test
16-0000_File-A_variant_strandbias_readcount.vcf.gz
16-0002_File-A_variant_strandbias_readcount.vcf.gz
16-0005_File-A_variant_strandbias_readcount.vcf.gz
The numeric prefix of each is always 7 characters xx-xxxx and unique. The portion of code in the portion of code below up until the # match files uses each vcf.gz file to create a corresponding .txt file with only the unique numeric prefix followed by _header.
Files in /home/cmccabe/Desktop/NGS/test (if code works as expected)
16-0000_File-A_variant_strandbias_readcount.vcf.gz
16-0002_File-B_variant_strandbias_readcount.vcf.gz
16-0005_File-A_variant_strandbias_readcount.vcf.gz
16-0000_header.txt
16-0002_header.txt
16-0005_header.txt
I am trying to match the 16-0000.vcf.gz with the 16-0000.txt and read 16-0000.vcf.gz into $file1 and 16-0000.txt into $file2.
These two variables $file1 and $file2 would be passed to the reheader command (# edit the header), below which is part of a loop.
After the $file1 and $file2 are processed for 16-0000, the process is repeated for the remaining two files, 16-0002 and 16-0005.
There is only one directory that contains the files and that is /home/cmccabe/Desktop/NGS/test. The numerical prefixes are not being stripped off and used to perform the match between files nor are the variables being passed to reheader. I hope this is a start, but there probably is a better way. Thank you :).
#!/bin/bash
# gunzip files
logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf ; do
echo "Start vcf.gz creation: $(date) - file: $f"
bname=`basename $f`
gzip $f
echo "End vcf.gz creation: $(date) - file: $f"
done >> "$logfile"
# find undefined annotations
logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf.gz ; do
echo "Start vcf missing header creation: $(date) - file: $f"
bname=`basename $f`
pref=${bname%%.vcf.gz}
bcftools view -h $f > /home/cmccabe/Desktop/NGS/test/${pref}_header.txt
echo "End missing header creation: $(date) - file: $f"
done >> "$logfile"
# match files
IAm=${0##*/}
InDir1='/home/cmccabe/Desktop/NGS/test'
InDir2='/home/cmccabe/Desktop/NGS/test'
OutDir='/home/cmccabe/Desktop/NGS/test'
cd "$InDir1"
for file1 in *.txt
do # Grab file prefix.
p=${file1%%_*}
# Find matching file2.
file2=$(printf '%s' "$InDir2/$p"_*.vcf)
if [ ! -f "$file2" ]
then printf '%s: No single file matching %s found.\n' "$IAm" \
"$file1" >&2
continue
fi
# store matches
out=${file1##*/} && ${file2##*/}
done
# edit the header
logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf.gz ; do
echo "Start vcf edit header creation: $(date) - file: $f"
bname=`basename $f`
pref=${bname%%.vcf.gz}
bcftools reheader -h $file1 $file2 > ${pref}_fixed.vcf.gz
echo "End edit header creation: $(date) - file: $f"
done >> "$logfile"
Hi cmccabe! Were you able to solve this yourself? If not, what is your question exactly? How we can help you?