Optimizing sample extraction and concatenation from vcf files
1
2
Entering edit mode
8.2 years ago
Theo ▴ 10

Hi all, I just received a dataset of 2217 human genotypes on which I would like to apply some machine learning algorithm. However, it is in the form of 22 multi-samples chromosomes vcf files and for my applications, I would prefer to have 2217 files containing the full genotypes of each sample. I just discovered bcftools last week and nobody in my lab was able to help me as no one of us had previous experience in bioinformatics before starting this project. Thanks to previous posts on biostars, I managed to come with this naive script to extract the samples from a specific file.

  NUMBERSAMPLES=2217 
  _file=$1.vcf.gz
  CHROM=`echo $_file | cut -d. -f1`
  ITERATION=0

  echo "Commencing extraction of samples from $_file"

  for sample in `bcftools view -h $_file | grep "^#CHROM" | cut -f10-`; do

    ITERATION=$((ITERATION+1))

    #### Create one folder for each sample
    USERNAME=`echo $sample | cut -d. -f1`
    FILENAME="$CHROM.$USERNAME.vcf.gz"


     if [ ! -d ./NEWSAMPLES/$USERNAME ]; then

        mkdir ./NEWSAMPLES/$USERNAME

     fi

     echo "Extracting $USERNAME from $_file. Name of the extracted sample file : $FILENAME"
     bcftools view -c1 -Oz -s $sample -o $FILENAME $_file

     echo "Moving $FILENAME to ./NEWSAMPLES/$USERNAME"
     mv -v ./$FILENAME ./NEWSAMPLES/$USERNAME

     echo "$ITERATION samples done over $NUMBERSAMPLES in $_file"

  done

However, it appears to be very inefficient so I wonder if you had some tips to make it run on decent amount of time (e.g. would the parallelize() or extracting batches of samples help ?) Also, for those who are already familiar with machine learning on this kind of data, is it necessary to keep the vcf format or would it be better to convert it ?

Thanks in advance for your answers !

SNP genome bcftools vcf • 1.8k views
ADD COMMENT
2
Entering edit mode
8.2 years ago

Using a simple Makefile

SHELL=/bin/bash
INPUT=your.input.vcf.gz

SAMPLES=$(shell gunzip -c ${INPUT}| grep -m1 '\#CHROM' | cut -f 10- )

define fun

$(1)/$(1).vcf.gz: ${INPUT}
    mkdir -p $$(dir $$@) && bcftools view -c1 -Oz -s "$(1)" -o $$@ $$<

endef

all:  $(foreach S,${SAMPLES}, ${S}/${S}.vcf.gz)

$(eval $(foreach S,${SAMPLES},$(call fun,$S)))

Then invoke this Makefile with option -j (number of parallel jobs)

make -j 123
ADD COMMENT
2
Entering edit mode

+1 for using make. My suggestion using pattern rules:

SHELL=/bin/bash
INPUT=your.input.vcf.gz

SAMPLES=$(shell gunzip -c ${INPUT}| grep -m1 '\#CHROM' | cut -f 10- )

all:  $(foreach S,${SAMPLES}, ${S}/${S}.vcf.gz)

%.vcf.gz: $(INPUT)
    mkdir -p $(dir $@)
    bcftools view -c1 -Oz -s "$(notdir $*)" -o $@ $<
ADD REPLY
0
Entering edit mode

Thank you both for your answers ! I tried your solutions during interactive sessions (so with limited resources) on the cluster I have access to. I got a segmentation fault (core dumped) and as I am not really comfortable with make yet I still didn't find its origin. I don't know if it was due to a lack of resources (each chromosome file is over a few GB) or to an error from the script itself (tell me if you have other suggestions). I'll let you know how it did after I solved that.

Edit : I fixed that issue but I am still trying to adapt the script to fit my needs. GNU make is a world that didn't know at all !

ADD REPLY

Login before adding your answer.

Traffic: 2015 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6