Question

Issues with GBS pipeline, samtools and bwa specifically

0

Entering edit mode

20 days ago

dimeresia ▴ 10

I have been working through some GBS data trying to use GB-eaSy pipeline (https://github.com/dpwickland/GB-eaSy). But I am having issues after excluding certain samples from the data set.

Because there is no straightforward way to exclude samples from a data set, or combine samples from several data sets, I have broken the pipeline into two halves. The first is to set-up the directory structure and demultiplex the reads. Which works flawlessly every time.

#!/bin/bash

set -e
set -x
. ./GB-eaSy_parameters.txt

##SET UP DIRECTORY STRUCTURE
mkdir -p Intermediate_files/1.Demultiplexed_reads Intermediate_files/2.bam_alignments/ Intermediate_files/3.mpileup/ Intermediate_files/4.Raw_SNPs/ Results/

##DEMULTIPLEX READS
if [ -z $RAW_READS_R2 ] #if RAW_READS_R2 variable not set in parameters file, then demultiplex single-end reads
    then
        java -jar $GBSX --Demultiplexer -t $NUM_CORES -f1 $RAW_READS_R1 -i $BARCODES_FILE -gzip true -ca $ADAPTER_SEQ -minsl $MIN_LENGTH -o Intermediate_files/1.Demultiplexed_reads; rm Intermediate_files/1.Demultiplexed_reads/*undetermined*

elif [ -n $RAW_READS_R2 ] #if RAW_READS_R2 variable set in parameters file, then demultiplex paired-end reads
    then
        java -jar $GBSX --Demultiplexer -t $NUM_CORES -f1 $RAW_READS_R1 -f2 $RAW_READS_R2 -i $BARCODES_FILE -gzip true -ca $ADAPTER_SEQ -minsl $MIN_LENGTH -o Intermediate_files/1.Demultiplexed_reads; rm Intermediate_files/1.Demultiplexed_reads/*undetermined*

The second half is where i am having issues. After demultiplexing I typically go in and manually remove or add samples in the 1.Demultiplexed_reads directory that I want to be included/excluded from the final analysis.

This is the second half of the pipeline, and I get errors on the "##ALIGN TO REFERENCE portion:

#!/bin/bash

set -e
set -x
. ./GB-eaSy_parameters.txt

##ALIGN TO REFERENCE
if [ -z $RAW_READS_R2 ] #if RAW_READS_R2 variable not set in parameters file, then align single-end reads to reference genome
    then
        parallel --max-procs $NUM_CORES --keep-order --link "bwa mem  $REF_GENOME {} | samtools sort -o Intermediate_files/2.bam_alignments/{/.}.sorted_bam; samtools index Intermediate_files/2.bam_alignments/{/.}.sorted_bam" ::: `ls Intermediate_files/1.Demultiplexed_reads/*.R1.fastq.gz` 

elif [ -n $RAW_READS_R2 ] #if RAW_READS_R2 variable set in parameters file, then align paired-end reads to reference genome
    then
        parallel --max-procs $NUM_CORES --keep-order --link "bwa mem  $REF_GENOME {1} {2} | samtools sort -o Intermediate_files/2.bam_alignments/{1/.}.sorted_bam; samtools index Intermediate_files/2.bam_alignments/{1/.}.sorted_bam" ::: `ls Intermediate_files/1.Demultiplexed_reads/*.R1.fastq.gz` ::: `ls Intermediate_files/1.Demultiplexed_reads/*.R2.fastq.gz`
fi

##CREATE LIST OF SORTED BAM FILES
ls -1 Intermediate_files/2.bam_alignments/*.sorted_bam > Intermediate_files/2.bam_alignments/samples_list.txt

#start GNU parallel to run each region (e.g. chromosome, scaffold) on separate CPU core
parallel  --gnu --max-procs $NUM_CORES --keep-order "\

##GENERATE PILEUP 
bcftools mpileup --regions {} --output-type z --skip-indels --annotate AD,DP --fasta-ref $REF_GENOME --min-MQ 20 --min-BQ 20  --no-version -b Intermediate_files/2.bam_alignments/samples_list.txt -o Intermediate_files/3.mpileup/mpileup_{}.vcf.gz;\

##CALL SNPS
bcftools call --multiallelic-caller --variants-only --no-version Intermediate_files/3.mpileup/mpileup_{}.vcf.gz | sed -e 's|$(pwd)\/||g' -e 's/Intermediate_files\/2\.bam_alignments\///g' -e  's/\.R.\.fastq.sorted_bam//g'  > Intermediate_files/4.Raw_SNPs/raw_SNPs_{}.vcf;\

" ::: `grep ">" $REF_GENOME | cut -d ' ' -f1 | sed 's/>//g'`

##COMBINE SNPS FROM ALL REGIONS
bcftools concat --no-version `ls -v Intermediate_files/4.Raw_SNPs/*.vcf` > Results/all_SNPs_raw.vcf 

##FILTER VCF 
vcftools --vcf Results/all_SNPs_raw.vcf --minDP $MIN_DEPTH --recode --stdout | awk '/#/ || /[0-9]\/[0-9]/' >Results/all_SNPs_minDP$MIN_DEPTH.vcf

I keep getting the following errors from the linux cluster that I am running this on:

[W::hts_set_opt] Cannot change block size for this format
samtools sort: failed to read header from "-"
[E::hts_open_format] Failed to open file "Intermediate_files/2.bam_alignments/Verdant-1.R1.fastq.sorted_bam" : No such file or directory
samtools index: failed to open "Intermediate_files/2.bam_alignments/Verdant-1.R1.fastq.sorted_bam": No such file or directory

This error is repeated for the majority of the samples but not all.

Am I doing something wrong? or is there something that I can modify to repair this pipeline?

samtools bwa genotyping linux genomics • 229 views

ADD COMMENT • link 20 days ago by dimeresia ▴ 10