Issues with GBS pipeline, samtools and bwa specifically
0
0
Entering edit mode
20 days ago
dimeresia ▴ 10

I have been working through some GBS data trying to use GB-eaSy pipeline (https://github.com/dpwickland/GB-eaSy). But I am having issues after excluding certain samples from the data set.

Because there is no straightforward way to exclude samples from a data set, or combine samples from several data sets, I have broken the pipeline into two halves. The first is to set-up the directory structure and demultiplex the reads. Which works flawlessly every time.

#!/bin/bash

set -e
set -x
. ./GB-eaSy_parameters.txt

##SET UP DIRECTORY STRUCTURE
mkdir -p Intermediate_files/1.Demultiplexed_reads Intermediate_files/2.bam_alignments/ Intermediate_files/3.mpileup/ Intermediate_files/4.Raw_SNPs/ Results/

##DEMULTIPLEX READS
if [ -z $RAW_READS_R2 ] #if RAW_READS_R2 variable not set in parameters file, then demultiplex single-end reads
    then
        java -jar $GBSX --Demultiplexer -t $NUM_CORES -f1 $RAW_READS_R1 -i $BARCODES_FILE -gzip true -ca $ADAPTER_SEQ -minsl $MIN_LENGTH -o Intermediate_files/1.Demultiplexed_reads; rm Intermediate_files/1.Demultiplexed_reads/*undetermined*

elif [ -n $RAW_READS_R2 ] #if RAW_READS_R2 variable set in parameters file, then demultiplex paired-end reads
    then
        java -jar $GBSX --Demultiplexer -t $NUM_CORES -f1 $RAW_READS_R1 -f2 $RAW_READS_R2 -i $BARCODES_FILE -gzip true -ca $ADAPTER_SEQ -minsl $MIN_LENGTH -o Intermediate_files/1.Demultiplexed_reads; rm Intermediate_files/1.Demultiplexed_reads/*undetermined*

The second half is where i am having issues. After demultiplexing I typically go in and manually remove or add samples in the 1.Demultiplexed_reads directory that I want to be included/excluded from the final analysis.

This is the second half of the pipeline, and I get errors on the "##ALIGN TO REFERENCE portion:

#!/bin/bash

set -e
set -x
. ./GB-eaSy_parameters.txt

##ALIGN TO REFERENCE
if [ -z $RAW_READS_R2 ] #if RAW_READS_R2 variable not set in parameters file, then align single-end reads to reference genome
    then
        parallel --max-procs $NUM_CORES --keep-order --link "bwa mem  $REF_GENOME {} | samtools sort -o Intermediate_files/2.bam_alignments/{/.}.sorted_bam; samtools index Intermediate_files/2.bam_alignments/{/.}.sorted_bam" ::: `ls Intermediate_files/1.Demultiplexed_reads/*.R1.fastq.gz` 

elif [ -n $RAW_READS_R2 ] #if RAW_READS_R2 variable set in parameters file, then align paired-end reads to reference genome
    then
        parallel --max-procs $NUM_CORES --keep-order --link "bwa mem  $REF_GENOME {1} {2} | samtools sort -o Intermediate_files/2.bam_alignments/{1/.}.sorted_bam; samtools index Intermediate_files/2.bam_alignments/{1/.}.sorted_bam" ::: `ls Intermediate_files/1.Demultiplexed_reads/*.R1.fastq.gz` ::: `ls Intermediate_files/1.Demultiplexed_reads/*.R2.fastq.gz`
fi

##CREATE LIST OF SORTED BAM FILES
ls -1 Intermediate_files/2.bam_alignments/*.sorted_bam > Intermediate_files/2.bam_alignments/samples_list.txt

#start GNU parallel to run each region (e.g. chromosome, scaffold) on separate CPU core
parallel  --gnu --max-procs $NUM_CORES --keep-order "\

##GENERATE PILEUP 
bcftools mpileup --regions {} --output-type z --skip-indels --annotate AD,DP --fasta-ref $REF_GENOME --min-MQ 20 --min-BQ 20  --no-version -b Intermediate_files/2.bam_alignments/samples_list.txt -o Intermediate_files/3.mpileup/mpileup_{}.vcf.gz;\

##CALL SNPS
bcftools call --multiallelic-caller --variants-only --no-version Intermediate_files/3.mpileup/mpileup_{}.vcf.gz | sed -e 's|$(pwd)\/||g' -e 's/Intermediate_files\/2\.bam_alignments\///g' -e  's/\.R.\.fastq.sorted_bam//g'  > Intermediate_files/4.Raw_SNPs/raw_SNPs_{}.vcf;\

" ::: `grep ">" $REF_GENOME | cut -d ' ' -f1 | sed 's/>//g'`

##COMBINE SNPS FROM ALL REGIONS
bcftools concat --no-version `ls -v Intermediate_files/4.Raw_SNPs/*.vcf` > Results/all_SNPs_raw.vcf 

##FILTER VCF 
vcftools --vcf Results/all_SNPs_raw.vcf --minDP $MIN_DEPTH --recode --stdout | awk '/#/ || /[0-9]\/[0-9]/' >Results/all_SNPs_minDP$MIN_DEPTH.vcf

I keep getting the following errors from the linux cluster that I am running this on:

[W::hts_set_opt] Cannot change block size for this format
samtools sort: failed to read header from "-"
[E::hts_open_format] Failed to open file "Intermediate_files/2.bam_alignments/Verdant-1.R1.fastq.sorted_bam" : No such file or directory
samtools index: failed to open "Intermediate_files/2.bam_alignments/Verdant-1.R1.fastq.sorted_bam": No such file or directory

This error is repeated for the majority of the samples but not all.

Am I doing something wrong? or is there something that I can modify to repair this pipeline?

samtools bwa genotyping linux genomics • 229 views
ADD COMMENT

Login before adding your answer.

Traffic: 2517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6