why the file is low after combining the Hi-C reads using Arima genomics pipeline of mapping
0
0
Entering edit mode
8 months ago
rj.rezwan ▴ 10

Hi, I am using Arima genomics pipeline for Hi-C data to make scaffolds for the assembled contigs. So I ran this given command on slurm. My input files representing the abc_f1.filter.bam and abc_f2.filter.bam are 42GB and 43 GB, respectively. My assembled contigs file (abc_assembly.hic.p_ctg.fasta) is 1.65 GB. After combining and quality check command, the output file (abc_combine2.bam) has the size 72MB after 2 hours of running the code. I have also shared the log file tail part and there is no error in this. Actaully, there is no error in the code, can I say my output file is fine for the next step?

#!/bin/bash
#
#SBATCH --job-name=combine
#SBATCH --output=combine.%j.out
#SBATCH --partition=batch
#SBATCH --cpus-per-task=32
#SBATCH --time=25:00:00
#SBATCH --mem=200G

MAPQ_FILTER=10
module load samtools/1.16.1 perl/5.38.0/intel2022.3
export PATH=~path/arima_yahs/arima_pipeline/:$PATH

two_read_bam_combiner.pl abc_f1.filter.bam abc_f2.filter.bam samtools $MAPQ_FILTER | samtools view -bS -t abc_assembly.hic.p_ctg.fasta - | samtools sort -@ 32 -o abc_combine2.bam

here is the tail view of log file

413000000
414000000
415000000
416000000
417000000
418000000
419000000
420000000
421000000
422000000
[bam_sort_core] merging from 0 files and 32 in-memory blocks...

I shall be grateful to you.

bwa mapping genome samtools plants • 651 views
ADD COMMENT
0
Entering edit mode

$MAPQ_FILTER

Since this is set to 10 one guess is that is some how excluding/filtering a large number of alignments.

ADD REPLY
0
Entering edit mode

What do you suggest here? Should I remove this MAPQ_FILTER=10 parameter to get all alignments?

ADD REPLY
0
Entering edit mode

I don't know what else the perl script is doing but that would be worth a try.

ADD REPLY
0
Entering edit mode

here is the script of two_read_bam_combiner.pl

#!/usr/bin/perl
use strict;

MAIN : {
    my ($read1_bam, $read2_bam, $samtools, $mq) = @ARGV;

        if ( (! defined($read1_bam)) || (! defined($read2_bam)) ){
        die ("Usage: ./two_read_bam_combiner.pl <read 1 bam> <read 2 bam> <path to samtools> <minimum map quality filter>\n");
    }

    open(FILE1, "$samtools view -h $read1_bam |");
    open(FILE2, "$samtools view -h $read2_bam |");

    my $line1 = <FILE1>;
    my $line2 = <FILE2>;

    my $counter = 0;
    my $new_counter = 0;

    while (defined($line1)){
        if ($line1 =~ /^(\@)SQ/){
            if ($line1 ne $line2){
                print($line1);
                print($line2);
                die ("Inconsistent BAM headers. BAM files must be aligned to same reference.");
            }
            else{
                print($line1);
            }
            $line1 = <FILE1>;
            $line2 = <FILE2>;
            next;
        }

        $counter++;
        if ($counter == ($new_counter + 1000000)){
            print(STDERR $counter . "\n");
            $new_counter = $counter;
        }

        chomp $line1;
        chomp $line2;

        my ($id1, $flag1, $chr_from1, $loc_from1, $mapq1, $cigar1, $d1_1, $d2_1, $d3_1, $read1, $read_qual1, @rest1) = split(/\t/, $line1);
        my ($id2, $flag2, $chr_from2, $loc_from2, $mapq2, $cigar2, $d1_2, $d2_2, $d3_2, $read2, $read_qual2, @rest2) = split(/\t/, $line2);

        if ($id1 ne $id2){
            die ("The read id's of the two files do not match up at line number $counter. Files should be from the same sample and sorted in identical order.\n");
        }

        my $bin1 = reverse(dec2bin($flag1));
        my $bin2 = reverse(dec2bin($flag2));

        my @binary1 = split(//, $bin1);
        my @binary2 = split(//, $bin2);

        my $trouble = 0;
        if (($binary1[2] == 1) || ($mapq1 < $mq)){
            $trouble = 1;
        }
        if (($binary2[2]== 1) || ($mapq2 < $mq)) {
            $trouble = 1;
        }

        my $proper_pair1;
        my $proper_pair2;
        my $dist1;
        my $dist2;

        if (($binary1[2] == 0) && ($binary2[2] == 0)){
            $proper_pair1 = 1;
            $proper_pair2 = 1;
            if ($chr_from1 eq $chr_from2){
                my $dist = abs($loc_from1 - $loc_from2);
                if ($loc_from1 >= $loc_from2){
                    $dist1 = -1 * $dist;
                    $dist2 = $dist;
                }
                else{
                    $dist1 = $dist;
                    $dist2 = -1 * $dist;
                }
            }
            else{
                $dist1 = 0;
                $dist2 = 0;
            }
        }
        else{
            $proper_pair1 = 0;
            $proper_pair2 = 0;
            $dist1 = 0;
            $dist2 = 0;
        }

        my $new_bin1 = join("", "0" x 21, $binary1[10], $binary1[9], "001", $binary2[4], $binary1[4], $binary2[2], $binary1[2], $proper_pair1, "1");
        my $new_bin2 = join("", "0" x 21, $binary2[10], $binary2[9], "010", $binary1[4], $binary2[4], $binary1[2], $binary2[2], $proper_pair2, "1");

        my $new_flag1 = bin2dec($new_bin1);
        my $new_flag2 = bin2dec($new_bin2);

        unless ($trouble == 1){
            print(join("\t", $id1, $new_flag1, $chr_from1, $loc_from1, $mapq1, $cigar1, $chr_from2, $loc_from2, $dist1, $read1, $read_qual1, @rest1) . "\n");
            print(join("\t", $id2, $new_flag2, $chr_from2, $loc_from2, $mapq2, $cigar2, $chr_from1, $loc_from1, $dist2, $read2, $read_qual2, @rest2) . "\n");
        }
        $line1 = <FILE1>;
        $line2 = <FILE2>;
    }
}

sub dec2bin {
    return unpack("B32", pack("N", shift));
}

sub bin2dec {
    return unpack("N", pack("B32", substr("0" x 32 . shift, -32)));
}
ADD REPLY
0
Entering edit mode

Did you try the script without the filter? Did that restore the alignments in final file.

ADD REPLY
0
Entering edit mode

I did not try this option. I try this and will share if get some good results.

ADD REPLY

Login before adding your answer.

Traffic: 2560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6