Question

VG giraffe autoindex and alignement mapping warning "Refusing to perform too-large rescue alignment"

0

Entering edit mode

4 weeks ago

s_135 ▴ 10

My aim is to make teak pangenome using short reads and then call SVs from it.

bam_dir="/home/tnau104/Documents/bam_files"


echo "Creating sample.lst from BAM files..."

ls ${bam_dir}/SO_11111_*.bam | sed 's|.*SO_11111_\([0-9A-Za-z_]*\)_.*|\1|' | sort | uniq > sample.lst
        echo "Completed sample.lst creation."


  fq="myfastq"
  TRIMMOMATIC="/usr/share/Trimmomatic-0.39/dist/jar/trimmomatic-0.39.jar"
  ILLUMINACLIP="/usr/share/Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa:2:30:10"
  trimmed_fastq="myfastq_trimmed"

  # Create the trimmed FASTQ directory if it doesn't exist
  mkdir -p "$trimmed_fastq"

  echo "Trimming reads for all samples..."
  while read sample; do
        java -jar "$TRIMMOMATIC" PE -phred33 \
            "${fq}/SO_11111_${sample}_R1.fq.gz" "${fq}/SO_11111_${sample}_R2.fq.gz" \
            "${trimmed_fastq}/SO_11111_${sample}_R1_paired_trimmed.fq.gz" "${trimmed_fastq}/SO_11111_${sample}_R1_unpaired_trimmed.fq.gz" \
            "${trimmed_fastq}/SO_11111_${sample}_R2_paired_trimmed.fq.gz" "${trimmed_fastq}/SO_11111_${sample}_R2_unpaired_trimmed.fq.gz" \
            ILLUMINACLIP:$ILLUMINACLIP LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36
  done < sample.lst
  end_time=$(date +"%Y-%m-%d %H:%M:%S")
  echo "Completed trimming reads."


    bam="mybam"
    sorted_bam_dir="mybam_sorted"
    SAMTOOLS=$(command -v samtools)

    # Create the directory
    mkdir -p "$sorted_bam_dir"
    mkdir -p "$bam"

    echo "Sorting and indexing BAM files..."
    for bam_file in ${bam}/SO_11111_*_RG.bam; do
        # Generate the sorted BAM file path within the mybam_sorted directory
        sorted_bam="${sorted_bam_dir}/$(basename ${bam_file/.bam/.sorted.bam})"

        # Sort the BAM file and output it to the specified directory
        $SAMTOOLS sort -o "$sorted_bam" "$bam_file"

        # Index the sorted BAM file
        $SAMTOOLS index "$sorted_bam"
    done
    echo "Completed sorting and indexing BAM files."

    bam="mybam_sorted"
    fa="teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa"
    DELLY=$(command -v delly)

    # Create the directory
    mkdir -p "$bam"

    echo "Calling variants using Delly..."

    # Collect all the BAM files into a single variable
    bam_files=""
    for sorted_bam in ${bam}/SO_11111_*_RG.sorted.bam; do
        bam_files+="$sorted_bam "
    done

    # Call variants on all BAM files together
    $DELLY call -o samples_delly.bcf -g "$fa" $bam_files

    # Convert BCF to VCF
    bcftools view samples_delly.bcf > samples_delly.vcf

echo "Compressing VCF files for all samples..."

# Create necessary output directories if they don't exist
mkdir -p myvcf mypack mygam myvcf_compressed

# Compress the VCF file
bgzip -o samples_delly.vcf.gz samples_delly.vcf

# Index the compressed VCF
tabix -f -p vcf samples_delly.vcf.gz

echo "Completed calling variants using Delly."

VG=$(which vg)  # or VG=/usr/bin/vg, or unalias vg && VG=$(command -v vg)
sudo chmod -R 777 /tmp

fa="teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa"
vcf="samples_delly.vcf.gz"
prefix="pangenome_graph"

date
echo "Autoindexing the graph..."
$VG autoindex --workflow giraffe \
    --prefix "$prefix" \
    --ref-fasta "$fa" \
    --vcf "$vcf" \
    --threads 4 \
    -R XG  # generate an XG index alongside GBZ
echo "Completed autoindexing the graph."
date

fq="myfastq_trimmed"
prefix="pangenome_graph"
VG=$(which vg)

mkdir -p mygam 
date
echo "Giraffe mapping of short reads..."

while read sample; do
    echo "Processing sample $sample..."
    $VG giraffe -Z "${prefix}.giraffe.gbz" -m "${prefix}.min" -d "${prefix}.dist" \
        -f "${fq}/SO_11111_${sample}_R1_paired_trimmed.fq.gz" -f "${fq}/SO_11111_${sample}_R2_paired_trimmed.fq.gz" \
        -t 4 -N "SO_11111_$sample" > "mygam/SO_11111_${sample}.gam"
done < sample.lst

echo "Completed Giraffe mapping"
date

I am getting warning in the final command

Giraffe mapping of short reads... Processing sample 10... Guessing that pangenome_graph.xg is XG warning[vg::giraffe]: Refusing to perform too-large rescue alignment of 65 bp against 39878 bp dagified subgraph for read A00609:581:H7KGJDSX5:1:1101:20392:5149 which would use more than 1572864 cells and might exhaust Dozeu's allocator; suppressing further warnings.

I have tried with single sample SO_11111_10 (fastq R1 & R2) + pangenome graphs (gbz, min, dist) but still getting it.

But the output GAM is being generated. What shall I do now? Should I proceed or not? If not, how to solve? Could it be related to this warning I got when running autoindex (but all indeces got created without errors)

warning:[vg::Constructor] Lowercase characters found in Scaffold_Un687; coercing to uppercase. Warning: insertion END and POS do not agree (complex insertions not canonicalizeable) [canonicalize] Scaffold_Un696 11586 INS00360992 T <INS> 1000
PASS
PRECISE;SVTYPE=INS;SVMETHOD=EMBL.DELLYv0.9.1;END=11587;SVLEN=27;PE=0;MAPQ=0;CT=NtoN;CIPOS=-7,7;CIEND=-7,7;SRMAPQ=60;INSLEN=27;HOMLEN=7;SR=20;SRQ=1;CONSENSUS=AATTAATTGCCCATTAATACGATTCTTGAATTCCAATAACAAAAACGAAAATTTTAAATTAAAAATTTAGGTTTGTGCTCTCTCGCTCTCATTAAAAATTTAGGTTTGTGCTCTCTCTCTCTCTCTCTCTCTCTCTCCCCACCCTAAAAGAAAATTAACATAAATAAATTCTGAAGCGTATTTAA;CE=1.84967;SPAN=27 END: 11587 POS: 11586

variant calling pangenome giraffe vg • 415 views

ADD COMMENT • link 4 weeks ago by s_135 ▴ 10

score 2 · Accepted Answer · 2024-10-28

2

Entering edit mode

4 weeks ago

Xian ▴ 80

If you got an output GAM then you should be able to proceed. That warning is normal, it's just letting you know that the alignment problem for that read would use too much memory, so it didn't align it.

ADD COMMENT • link 4 weeks ago by Xian ▴ 80

0

Entering edit mode

it's just letting you know that the alignment problem for that read would use too much memory, so it didn't align it

But isn't that a problem? Will my output not be incomplete if that read is skipped?

As a side note, I am following the pipeline

vg autoindex --> giraffe map --> pack --> snarls --> vg call for SV

Should I change anything? Or use xg instead? I came across a github issue thread but it is 3 years old so I am not sure if it is applicable now (https://github.com/vgteam/vg/issues/3369). But my pipeline works fine for lesser samples, and I was told autoindex pipeline is latest, and "XG graph should be unnecessary these days" ...

ADD REPLY • link 4 weeks ago by s_135 ▴ 10

0

Entering edit mode

No, it shouldn't be a problem. It's missing just one read out of the whole set, and for this one its mate pair was mapped so you'll still have at least some of the information. It's pretty normal for some reads to not be mapped or to be mapped ambiguously or to the wrong place; genomes are complicated and repetitive and reads have errors so there will always be some amount of mapping ambiguity but downstream tools take this into account and use the whole read set to try to correct for the errors. There will always be errors though :)

The autoindex and giraffe step are definitely right, I'm less familiar with the SV calling part though. I think this is our most up-to-date pipeline https://github.com/vgteam/vg/wiki/SV-Genotyping-and-variant-calling