Hi all, I just received a dataset of 2217 human genotypes on which I would like to apply some machine learning algorithm. However, it is in the form of 22 multi-samples chromosomes vcf files and for my applications, I would prefer to have 2217 files containing the full genotypes of each sample. I just discovered bcftools last week and nobody in my lab was able to help me as no one of us had previous experience in bioinformatics before starting this project. Thanks to previous posts on biostars, I managed to come with this naive script to extract the samples from a specific file.
NUMBERSAMPLES=2217
_file=$1.vcf.gz
CHROM=`echo $_file | cut -d. -f1`
ITERATION=0
echo "Commencing extraction of samples from $_file"
for sample in `bcftools view -h $_file | grep "^#CHROM" | cut -f10-`; do
ITERATION=$((ITERATION+1))
#### Create one folder for each sample
USERNAME=`echo $sample | cut -d. -f1`
FILENAME="$CHROM.$USERNAME.vcf.gz"
if [ ! -d ./NEWSAMPLES/$USERNAME ]; then
mkdir ./NEWSAMPLES/$USERNAME
fi
echo "Extracting $USERNAME from $_file. Name of the extracted sample file : $FILENAME"
bcftools view -c1 -Oz -s $sample -o $FILENAME $_file
echo "Moving $FILENAME to ./NEWSAMPLES/$USERNAME"
mv -v ./$FILENAME ./NEWSAMPLES/$USERNAME
echo "$ITERATION samples done over $NUMBERSAMPLES in $_file"
done
However, it appears to be very inefficient so I wonder if you had some tips to make it run on decent amount of time (e.g. would the parallelize() or extracting batches of samples help ?) Also, for those who are already familiar with machine learning on this kind of data, is it necessary to keep the vcf format or would it be better to convert it ?
Thanks in advance for your answers !
+1 for using
make
. My suggestion using pattern rules:Thank you both for your answers ! I tried your solutions during interactive sessions (so with limited resources) on the cluster I have access to. I got a segmentation fault (core dumped) and as I am not really comfortable with make yet I still didn't find its origin. I don't know if it was due to a lack of resources (each chromosome file is over a few GB) or to an error from the script itself (tell me if you have other suggestions). I'll let you know how it did after I solved that.
Edit : I fixed that issue but I am still trying to adapt the script to fit my needs. GNU make is a world that didn't know at all !