Question

why the file is low after combining the Hi-C reads using Arima genomics pipeline of mapping

0

Entering edit mode

8 months ago

rj.rezwan ▴ 10

Hi, I am using Arima genomics pipeline for Hi-C data to make scaffolds for the assembled contigs. So I ran this given command on slurm. My input files representing the abc_f1.filter.bam and abc_f2.filter.bam are 42GB and 43 GB, respectively. My assembled contigs file (abc_assembly.hic.p_ctg.fasta) is 1.65 GB. After combining and quality check command, the output file (abc_combine2.bam) has the size 72MB after 2 hours of running the code. I have also shared the log file tail part and there is no error in this. Actaully, there is no error in the code, can I say my output file is fine for the next step?

#!/bin/bash
#
#SBATCH --job-name=combine
#SBATCH --output=combine.%j.out
#SBATCH --partition=batch
#SBATCH --cpus-per-task=32
#SBATCH --time=25:00:00
#SBATCH --mem=200G

MAPQ_FILTER=10
module load samtools/1.16.1 perl/5.38.0/intel2022.3
export PATH=~path/arima_yahs/arima_pipeline/:$PATH

two_read_bam_combiner.pl abc_f1.filter.bam abc_f2.filter.bam samtools $MAPQ_FILTER | samtools view -bS -t abc_assembly.hic.p_ctg.fasta - | samtools sort -@ 32 -o abc_combine2.bam

here is the tail view of log file

413000000
414000000
415000000
416000000
417000000
418000000
419000000
420000000
421000000
422000000
[bam_sort_core] merging from 0 files and 32 in-memory blocks...

I shall be grateful to you.

bwa mapping genome samtools plants • 648 views

ADD COMMENT • link 8 months ago by rj.rezwan ▴ 10

0

Entering edit mode

$MAPQ_FILTER

Since this is set to 10 one guess is that is some how excluding/filtering a large number of alignments.

ADD REPLY • link 8 months ago by GenoMax 147k

0

Entering edit mode

What do you suggest here? Should I remove this MAPQ_FILTER=10 parameter to get all alignments?

ADD REPLY • link 8 months ago by rj.rezwan ▴ 10

0

Entering edit mode

I don't know what else the perl script is doing but that would be worth a try.

ADD REPLY • link 8 months ago by GenoMax 147k

0

Entering edit mode

here is the script of two_read_bam_combiner.pl

#!/usr/bin/perl
use strict;

MAIN : {
    my ($read1_bam, $read2_bam, $samtools, $mq) = @ARGV;

        if ( (! defined($read1_bam)) || (! defined($read2_bam)) ){
        die ("Usage: ./two_read_bam_combiner.pl <read 1 bam> <read 2 bam> <path to samtools> <minimum map quality filter>\n");
    }

    open(FILE1, "$samtools view -h $read1_bam |");
    open(FILE2, "$samtools view -h $read2_bam |");

    my $line1 = <FILE1>;
    my $line2 = <FILE2>;

    my $counter = 0;
    my $new_counter = 0;

    while (defined($line1)){
        if ($line1 =~ /^(\@)SQ/){
            if ($line1 ne $line2){
                print($line1);
                print($line2);
                die ("Inconsistent BAM headers. BAM files must be aligned to same reference.");
            }
            else{
                print($line1);
            }
            $line1 = <FILE1>;
            $line2 = <FILE2>;
            next;
        }

        $counter++;
        if ($counter == ($new_counter + 1000000)){
            print(STDERR $counter . "\n");
            $new_counter = $counter;
        }

        chomp $line1;
        chomp $line2;

        my ($id1, $flag1, $chr_from1, $loc_from1, $mapq1, $cigar1, $d1_1, $d2_1, $d3_1, $read1, $read_qual1, @rest1) = split(/\t/, $line1);
        my ($id2, $flag2, $chr_from2, $loc_from2, $mapq2, $cigar2, $d1_2, $d2_2, $d3_2, $read2, $read_qual2, @rest2) = split(/\t/, $line2);

        if ($id1 ne $id2){
            die ("The read id's of the two files do not match up at line number $counter. Files should be from the same sample and sorted in identical order.\n");
        }

        my $bin1 = reverse(dec2bin($flag1));
        my $bin2 = reverse(dec2bin($flag2));

        my @binary1 = split(//, $bin1);
        my @binary2 = split(//, $bin2);

        my $trouble = 0;
        if (($binary1[2] == 1) || ($mapq1 < $mq)){
            $trouble = 1;
        }
        if (($binary2[2]== 1) || ($mapq2 < $mq)) {
            $trouble = 1;
        }

        my $proper_pair1;
        my $proper_pair2;
        my $dist1;
        my $dist2;

        if (($binary1[2] == 0) && ($binary2[2] == 0)){
            $proper_pair1 = 1;
            $proper_pair2 = 1;
            if ($chr_from1 eq $chr_from2){
                my $dist = abs($loc_from1 - $loc_from2);
                if ($loc_from1 >= $loc_from2){
                    $dist1 = -1 * $dist;
                    $dist2 = $dist;
                }
                else{
                    $dist1 = $dist;
                    $dist2 = -1 * $dist;
                }
            }
            else{
                $dist1 = 0;
                $dist2 = 0;
            }
        }
        else{
            $proper_pair1 = 0;
            $proper_pair2 = 0;
            $dist1 = 0;
            $dist2 = 0;
        }

        my $new_bin1 = join("", "0" x 21, $binary1[10], $binary1[9], "001", $binary2[4], $binary1[4], $binary2[2], $binary1[2], $proper_pair1, "1");
        my $new_bin2 = join("", "0" x 21, $binary2[10], $binary2[9], "010", $binary1[4], $binary2[4], $binary1[2], $binary2[2], $proper_pair2, "1");

        my $new_flag1 = bin2dec($new_bin1);
        my $new_flag2 = bin2dec($new_bin2);

        unless ($trouble == 1){
            print(join("\t", $id1, $new_flag1, $chr_from1, $loc_from1, $mapq1, $cigar1, $chr_from2, $loc_from2, $dist1, $read1, $read_qual1, @rest1) . "\n");
            print(join("\t", $id2, $new_flag2, $chr_from2, $loc_from2, $mapq2, $cigar2, $chr_from1, $loc_from1, $dist2, $read2, $read_qual2, @rest2) . "\n");
        }
        $line1 = <FILE1>;
        $line2 = <FILE2>;
    }
}

sub dec2bin {
    return unpack("B32", pack("N", shift));
}

sub bin2dec {
    return unpack("N", pack("B32", substr("0" x 32 . shift, -32)));
}

ADD REPLY • link 8 months ago by rj.rezwan ▴ 10

0

Entering edit mode

Did you try the script without the filter? Did that restore the alignments in final file.