My aim is to make teak pangenome using short reads and then call SVs from it.
bam_dir="/home/tnau104/Documents/bam_files"
echo "Creating sample.lst from BAM files..."
ls ${bam_dir}/SO_11111_*.bam | sed 's|.*SO_11111_\([0-9A-Za-z_]*\)_.*|\1|' | sort | uniq > sample.lst
echo "Completed sample.lst creation."
fq="myfastq"
TRIMMOMATIC="/usr/share/Trimmomatic-0.39/dist/jar/trimmomatic-0.39.jar"
ILLUMINACLIP="/usr/share/Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa:2:30:10"
trimmed_fastq="myfastq_trimmed"
# Create the trimmed FASTQ directory if it doesn't exist
mkdir -p "$trimmed_fastq"
echo "Trimming reads for all samples..."
while read sample; do
java -jar "$TRIMMOMATIC" PE -phred33 \
"${fq}/SO_11111_${sample}_R1.fq.gz" "${fq}/SO_11111_${sample}_R2.fq.gz" \
"${trimmed_fastq}/SO_11111_${sample}_R1_paired_trimmed.fq.gz" "${trimmed_fastq}/SO_11111_${sample}_R1_unpaired_trimmed.fq.gz" \
"${trimmed_fastq}/SO_11111_${sample}_R2_paired_trimmed.fq.gz" "${trimmed_fastq}/SO_11111_${sample}_R2_unpaired_trimmed.fq.gz" \
ILLUMINACLIP:$ILLUMINACLIP LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36
done < sample.lst
end_time=$(date +"%Y-%m-%d %H:%M:%S")
echo "Completed trimming reads."
bam="mybam"
sorted_bam_dir="mybam_sorted"
SAMTOOLS=$(command -v samtools)
# Create the directory
mkdir -p "$sorted_bam_dir"
mkdir -p "$bam"
echo "Sorting and indexing BAM files..."
for bam_file in ${bam}/SO_11111_*_RG.bam; do
# Generate the sorted BAM file path within the mybam_sorted directory
sorted_bam="${sorted_bam_dir}/$(basename ${bam_file/.bam/.sorted.bam})"
# Sort the BAM file and output it to the specified directory
$SAMTOOLS sort -o "$sorted_bam" "$bam_file"
# Index the sorted BAM file
$SAMTOOLS index "$sorted_bam"
done
echo "Completed sorting and indexing BAM files."
bam="mybam_sorted"
fa="teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa"
DELLY=$(command -v delly)
# Create the directory
mkdir -p "$bam"
echo "Calling variants using Delly..."
# Collect all the BAM files into a single variable
bam_files=""
for sorted_bam in ${bam}/SO_11111_*_RG.sorted.bam; do
bam_files+="$sorted_bam "
done
# Call variants on all BAM files together
$DELLY call -o samples_delly.bcf -g "$fa" $bam_files
# Convert BCF to VCF
bcftools view samples_delly.bcf > samples_delly.vcf
echo "Compressing VCF files for all samples..."
# Create necessary output directories if they don't exist
mkdir -p myvcf mypack mygam myvcf_compressed
# Compress the VCF file
bgzip -o samples_delly.vcf.gz samples_delly.vcf
# Index the compressed VCF
tabix -f -p vcf samples_delly.vcf.gz
echo "Completed calling variants using Delly."
VG=$(which vg) # or VG=/usr/bin/vg, or unalias vg && VG=$(command -v vg)
sudo chmod -R 777 /tmp
fa="teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa"
vcf="samples_delly.vcf.gz"
prefix="pangenome_graph"
date
echo "Autoindexing the graph..."
$VG autoindex --workflow giraffe \
--prefix "$prefix" \
--ref-fasta "$fa" \
--vcf "$vcf" \
--threads 4 \
-R XG # generate an XG index alongside GBZ
echo "Completed autoindexing the graph."
date
fq="myfastq_trimmed"
prefix="pangenome_graph"
VG=$(which vg)
mkdir -p mygam
date
echo "Giraffe mapping of short reads..."
while read sample; do
echo "Processing sample $sample..."
$VG giraffe -Z "${prefix}.giraffe.gbz" -m "${prefix}.min" -d "${prefix}.dist" \
-f "${fq}/SO_11111_${sample}_R1_paired_trimmed.fq.gz" -f "${fq}/SO_11111_${sample}_R2_paired_trimmed.fq.gz" \
-t 4 -N "SO_11111_$sample" > "mygam/SO_11111_${sample}.gam"
done < sample.lst
echo "Completed Giraffe mapping"
date
I am getting warning in the final command
Giraffe mapping of short reads... Processing sample 10... Guessing that pangenome_graph.xg is XG warning[vg::giraffe]: Refusing to perform too-large rescue alignment of 65 bp against 39878 bp dagified subgraph for read A00609:581:H7KGJDSX5:1:1101:20392:5149 which would use more than 1572864 cells and might exhaust Dozeu's allocator; suppressing further warnings.
I have tried with single sample SO_11111_10 (fastq R1 & R2) + pangenome graphs (gbz, min, dist) but still getting it.
But the output GAM is being generated. What shall I do now? Should I proceed or not? If not, how to solve? Could it be related to this warning I got when running autoindex (but all indeces got created without errors)
warning:[vg::Constructor] Lowercase characters found in Scaffold_Un687; coercing to uppercase. Warning: insertion END and POS do not agree (complex insertions not canonicalizeable) [canonicalize] Scaffold_Un696 11586 INS00360992 T <INS> 1000
PASS
PRECISE;SVTYPE=INS;SVMETHOD=EMBL.DELLYv0.9.1;END=11587;SVLEN=27;PE=0;MAPQ=0;CT=NtoN;CIPOS=-7,7;CIEND=-7,7;SRMAPQ=60;INSLEN=27;HOMLEN=7;SR=20;SRQ=1;CONSENSUS=AATTAATTGCCCATTAATACGATTCTTGAATTCCAATAACAAAAACGAAAATTTTAAATTAAAAATTTAGGTTTGTGCTCTCTCGCTCTCATTAAAAATTTAGGTTTGTGCTCTCTCTCTCTCTCTCTCTCTCTCTCCCCACCCTAAAAGAAAATTAACATAAATAAATTCTGAAGCGTATTTAA;CE=1.84967;SPAN=27 END: 11587 POS: 11586
But isn't that a problem? Will my output not be incomplete if that read is skipped?
As a side note, I am following the pipeline
vg autoindex --> giraffe map --> pack --> snarls --> vg call for SV
Should I change anything? Or use xg instead? I came across a github issue thread but it is 3 years old so I am not sure if it is applicable now (https://github.com/vgteam/vg/issues/3369). But my pipeline works fine for lesser samples, and I was told autoindex pipeline is latest, and "XG graph should be unnecessary these days" ...
No, it shouldn't be a problem. It's missing just one read out of the whole set, and for this one its mate pair was mapped so you'll still have at least some of the information. It's pretty normal for some reads to not be mapped or to be mapped ambiguously or to the wrong place; genomes are complicated and repetitive and reads have errors so there will always be some amount of mapping ambiguity but downstream tools take this into account and use the whole read set to try to correct for the errors. There will always be errors though :)
The autoindex and giraffe step are definitely right, I'm less familiar with the SV calling part though. I think this is our most up-to-date pipeline https://github.com/vgteam/vg/wiki/SV-Genotyping-and-variant-calling
Thank you for the clarity! Will continue with my pipeline then.