R/perl/python script to bin scaffold coverage by cM for many samples
2
0
Entering edit mode
9.0 years ago
rob234king ▴ 610

I have two files, a file with each row showing the coverage for scaffold across many samples, and one file with position information as scaffold, chromosome and cM location tab delineated. please note file 1 is not scaffold ordered to the same as file2 it is just in this example.

file1:

scaffold_id       sample1   sample2 
Scaffold86180     10        5
Scaffold863688    20        0
Scaffold868772    23        6
Scaffold934477    1         7
Scaffold992750    23        67
Scaffold2030550   15        16
Scaffold2282709   156       17
Scaffold383332    178       1895
Scaffold3943711   10        185
Scaffold6328630   189       1975
Scaffold6682550   15        88
Scaffold844703    13        98
Scaffold876147    12        1
Scaffold1237644   12        5
Scaffold1243015   16        8
Scaffold1251638   18        36
Scaffold1422442   195       33
Scaffold1440480   14        3

file 2: position information

Scaffold86180       1A      0
Scaffold863688      1A      0
Scaffold868772      1A      0
Scaffold934477      1A      0
Scaffold992750      1A      0
Scaffold2030550     1A      1.7075
Scaffold2282709     1A      1.7075
Scaffold383332      1A      1.7075
Scaffold3943711     1A      1.7075
Scaffold6328630     1A      1.7075
Scaffold6682550     1A      1.7075
Scaffold844703      1A      1.7075
Scaffold876147      1A      1.7075
Scaffold1237644     1A      3.415
Scaffold1243015     1A      3.415
Scaffold1251638     1A      3.415
Scaffold1422442     1A      3.415
Scaffold1440480     1A      3.415

What I would like to do is maybe in R but perl/python/linux if easier, to operate on each column to sum the coverage by chromosome cM, for example to file 3 (note not real results and the file is 2gb so will not work in excel). Any ideas how to simply do this?

file 3

Chrom  cM        samp1    samp2
1A     0         10       34
1A     1.7075    23       45
perl python R • 2.1k views
ADD COMMENT
0
Entering edit mode
9.0 years ago
ka6AjxQY ▴ 80

Assuming your files are tab-delimited and sort according to column 1, and also may not be the most ideal solution you were looking for, but try this:

join -t$'\t' -a1 -1 1 -2 1 f1 f2 | awk '{samp1[$4"\t"$5]+=$2; samp2[$4"\t"$5]+=$3} END { print "Chrom\tcM\tsamp1\tsamp2"; for (i in samp1) {print i"\t" samp1[i] "\t" samp2[i]}}' > file3
ADD COMMENT
0
Entering edit mode
9.0 years ago

Dear rob234king,

I wrote a commented version of the script. To use the script you should have perl installed. place the files in the same folder as the script and on the command line write:

perl script.pl -file1 1.txt -file2 2.txt -cMlimit 4 -stepsize 2 -chr 1A

here, -file1 and -file2 refer to the options for your files 1.txt and 2.txt. -cMlimit is the maximum cM (be careful with big M) cM you want to compute to. -stepsize is the bin width of cM intervals, so if it is 2, than your bins will be 0-2, then 2-4 an so on. -chr is the chromosome name, in your case 1A. I commented the script and paste it here so everyone can use it. Please test it and tell me if there are mistakes etc. It might be useful to me later as well. Your files 1 and 2 need not to be ordered. Your result file will be in the same folder with name resultFile.txt.

I hope this helps,

Good luck with your research,

############################################CODE HERE###################################################

#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;

#the global variables
my $stepCounter = 0;
my $stepSize;
my $chr;
my @chrList;
my $fileName1;
my $fileName2;
my $sampleNumber;
my $cMlimit;

#get the options
GetOptions("stepsize:f" => \$stepSize,"chr:s" => \$chr,"file1=s"=>\$fileName1,"file2=s"=>\$fileName2, "cMlimit=i"=>\$cMlimit);
$fileName1 = "./".$fileName1;
$fileName2 = "./".$fileName2;

#if chr is not defined it uses a standard list of human chromosome names
if (defined $chr) {
    push(@chrList,$chr);
} else {
    @chrList = (1..22,"X","Y");
}
$stepSize = defined $stepSize ? $stepSize : 1; 

#open your file1 and file 2 which is in the same location of the script.
open(my $file1, '<',$fileName1) or die("cant open file 1!");
open(my $file2, '<',$fileName2) or die("cant open file 2!");
#skip the headers
skipHeader($file1);
skipHeader($file2);

#form a hash that will contain information for each scaffold the first key will be scaffold name, the entries will be name,chr,cM,sample1,sample2...
my %data = ();
while (<$file1>) {
    chomp;
    my @line = split(/\s+|\t+/,$_);
    $data{$line[0]}{"name"} = $line[0];
    for (my $i=1;$i<= $#line;$i++) {
        $data{$line[0]}{"sample".$i} = $line[$i]; 
    }
}
while (<$file2>) {    
    chomp;
    my @line = split(/\s+|\t+/,$_);
    $data{$line[0]}{"chr"} = $line[1];
    $data{$line[0]}{"cM"} = $line[2];
}

#initiate a file handle which will contain your result.
open(my $resultFile, '>',"./resultFile.txt");
#print the first header..
print $resultFile "chr"."\t"."cM"."\t"."samples..\n";

#iterate over the hash
for (my $i = 0;$i<scalar(@chrList);$i++) {
    $stepCounter = 0;
    for (my $j = $stepCounter*$stepSize;$j<=$cMlimit; $j+=$stepSize) {
        print $resultFile $chrList[$i]."\t".$stepCounter*$stepSize."\t";
        for (my $k=1;$k <=$sampleNumber;$k++) {
            my $sum = 0;
            my $count = 0;
            foreach my $key (keys %data) {
                if($data{$key}{"chr"} eq $chrList[$i] && $data{$key}{"cM"}>=$stepSize*$stepCounter && $data{$key}{"cM"}<$stepSize*($stepCounter+1)) {
                    $sum += $data{$key}{"sample".$k};
                    $count++;
                }
            }
            my $result;
            if ($count != 0) {
                $result = sprintf("%.1f",$sum/$count);
            } else {
                $result = 0;
            }
            if ($k == $sampleNumber) {
                print $resultFile $result."\n";
            } else {
                print $resultFile $result."\t";
            }
        }
        $stepCounter++;
    }
}

sub skipHeader {
    my $handle = $_[0];
    if ($_[0] == $file1) {
        my $line = <$handle>;
        $sampleNumber =   scalar(split(/\s+|\t+/,$line))-1;
    } else {
        <$handle>;
    }
}

############################################CODE HERE###################################################
ADD COMMENT
0
Entering edit mode

Hi Ibrahim, Thanks for the script. I'm just looking through it to understand it. There is an arbitrary 2cM bin that the user specifies, is it an easy fix to have a variable bin size according to the data as below example. Due to varying sizes of scaffolds the bin sizes are variable too. I think just changing below line with the $cMlimit of the data should do it but I'm not sure yet how to implement the logic..

for (my $j = $stepCounter*$stepSize;$j<=$cMlimit; $j+=$stepSize) {

cM

0
1.7075
3.415
12.608
14.7975
16.987
18.213
19.439
ADD REPLY
0
Entering edit mode

Dear rob234king,

I did what you wanted. Now, instead of using -stepSize argument you use a -stepFile argument. This should be a HEADERLESS file with bin start values separated by new line characters (here it looks like as if there are double newlines, not like here, like 1 newline), for example:

0
2
3

So here your bins will be 0-2 and 2-3. In the command line type (just an example):

perl scriptNew.pl -chr 1A -file1 1.txt -file2 2.txt -stepFile stepList.txt -cMlimit 5

Of course, all three files are assumed to be in the same directory as the script. I paste the commented script below. Compare both scripts to see how the flow is modified.

I hope this helps,

############################################CODE HERE###########################################

#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;

#the global variables
my $stepCounter = 0;
my $stepFile;
my @stepList;
my $chr;
my @chrList;
my $fileName1;
my $fileName2;
my $sampleNumber;
my $cMlimit;

#get the options
GetOptions("stepFile=s" => \$stepFile,"chr:s" => \$chr,"file1=s"=>\$fileName1,"file2=s"=>\$fileName2, "cMlimit=i"=>\$cMlimit);
$fileName1 = "./".$fileName1;
$fileName2 = "./".$fileName2;
$stepFile = "./".$stepFile;

#if chr is not defined it uses a standard list of human chromosome names
if (defined $chr) {
    push(@chrList,$chr);
} else {
    @chrList = (1..22,"X","Y");
} 

#open your file1 and file 2 which is in the same location of the script. Now you need 3 files, the last one is a step file with steps separated by newlines.It should be HEADERLESS.
open(my $file1, '<',$fileName1) or die("cant open file 1!");
open(my $file2, '<',$fileName2) or die("cant open file 2!");
open(my $file3, '<',$stepFile) or die("cant open stepFile!");
#skip the headers
skipHeader($file1);
skipHeader($file2);

#form a hash that will contain information for each scaffold the first key will be scaffold name, the entries will be name,chr,cM,sample1,sample2...
my %data = ();
while (<$file1>) {
    chomp;
    my @line = split(/\s+|\t+/,$_);
    $data{$line[0]}{"name"} = $line[0];
    for (my $i=1;$i<= $#line;$i++) {
        $data{$line[0]}{"sample".$i} = $line[$i]; 
    }
}
while (<$file2>) {    
    chomp;
    my @line = split(/\s+|\t+/,$_);
    $data{$line[0]}{"chr"} = $line[1];
    $data{$line[0]}{"cM"} = $line[2];
}
#push the bin steps in a array called @stepList
while (<$file3>) {    
    chomp;
    push(@stepList,$_);
}

#initiate a file handle which will contain your result.
open(my $resultFile, '>',"./resultFile.txt");
#print the first header..
print $resultFile "chr"."\t"."cM"."\t"."samples..\n";

#iterate over the hash
for (my $i = 0;$i<scalar(@chrList);$i++) {
    $stepCounter = 0;
    for (my $j = $stepList[$stepCounter];$j<=$cMlimit;) {
        print $resultFile $chrList[$i]."\t".$stepList[$stepCounter]."\t";
        for (my $k=1;$k <=$sampleNumber;$k++) {
            my $sum = 0;
            my $count = 0;
            foreach my $key (keys %data) {
                if($data{$key}{"chr"} eq $chrList[$i] &amp;&amp; $data{$key}{"cM"}>=$stepList[$stepCounter] && $data{$key}{"cM"}<$stepList[$stepCounter+1]) {
                    $sum += $data{$key}{"sample".$k};
                    $count++;
                }
            }
            my $result;
            if ($count != 0) {
                $result = sprintf("%.1f",$sum/$count);
            } else {
                $result = 0;
            }
            if ($k == $sampleNumber) {
                print $resultFile $result."\n";
            } else {
                print $resultFile $result."\t";
            }
        }
        $stepCounter++;
        #while you are iterating over the hash, it might that the cM limit is still not reached while the step list of bins are finished, if so, then terminate the loop.
        if ($stepCounter == scalar(@stepList)) {
            last;
        } else {
            $j = $stepList[$stepCounter];
        }
    }
}

sub skipHeader {
    my $handle = $_[0];
    if ($_[0] == $file1) {
        my $line = <$handle>;
        $sampleNumber =   scalar(split(/\s+|\t+/,$line))-1;
    } else {
        <$handle>;
    }
}

############################################CODE HERE###########################################
ADD REPLY

Login before adding your answer.

Traffic: 1512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6