Question

R/perl/python script to bin scaffold coverage by cM for many samples

0

Entering edit mode

9.9 years ago

rob234king ▴ 610

I have two files, a file with each row showing the coverage for scaffold across many samples, and one file with position information as scaffold, chromosome and cM location tab delineated. please note file 1 is not scaffold ordered to the same as file2 it is just in this example.

file1:

scaffold_id       sample1   sample2 
Scaffold86180     10        5
Scaffold863688    20        0
Scaffold868772    23        6
Scaffold934477    1         7
Scaffold992750    23        67
Scaffold2030550   15        16
Scaffold2282709   156       17
Scaffold383332    178       1895
Scaffold3943711   10        185
Scaffold6328630   189       1975
Scaffold6682550   15        88
Scaffold844703    13        98
Scaffold876147    12        1
Scaffold1237644   12        5
Scaffold1243015   16        8
Scaffold1251638   18        36
Scaffold1422442   195       33
Scaffold1440480   14        3

file 2: position information

Scaffold86180       1A      0
Scaffold863688      1A      0
Scaffold868772      1A      0
Scaffold934477      1A      0
Scaffold992750      1A      0
Scaffold2030550     1A      1.7075
Scaffold2282709     1A      1.7075
Scaffold383332      1A      1.7075
Scaffold3943711     1A      1.7075
Scaffold6328630     1A      1.7075
Scaffold6682550     1A      1.7075
Scaffold844703      1A      1.7075
Scaffold876147      1A      1.7075
Scaffold1237644     1A      3.415
Scaffold1243015     1A      3.415
Scaffold1251638     1A      3.415
Scaffold1422442     1A      3.415
Scaffold1440480     1A      3.415

What I would like to do is maybe in R but perl/python/linux if easier, to operate on each column to sum the coverage by chromosome cM, for example to file 3 (note not real results and the file is 2gb so will not work in excel). Any ideas how to simply do this?

file 3

Chrom  cM        samp1    samp2
1A     0         10       34
1A     1.7075    23       45

perl python R • 2.6k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 9.9 years ago by rob234king ▴ 610

Ram · Answer 1 · 2015-11-13

0

Entering edit mode

9.9 years ago

ka6AjxQY ▴ 80

Assuming your files are tab-delimited and sort according to column 1, and also may not be the most ideal solution you were looking for, but try this:

join -t$'\t' -a1 -1 1 -2 1 f1 f2 | awk '{samp1[$4"\t"$5]+=$2; samp2[$4"\t"$5]+=$3} END { print "Chrom\tcM\tsamp1\tsamp2"; for (i in samp1) {print i"\t" samp1[i] "\t" samp2[i]}}' > file3

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 9.9 years ago by ka6AjxQY ▴ 80

Ram · Answer 2 · 2015-11-15

Dear rob234king,

I wrote a commented version of the script. To use the script you should have perl installed. place the files in the same folder as the script and on the command line write:

perl script.pl -file1 1.txt -file2 2.txt -cMlimit 4 -stepsize 2 -chr 1A

here, -file1 and -file2 refer to the options for your files 1.txt and 2.txt. -cMlimit is the maximum cM (be careful with big M) cM you want to compute to. -stepsize is the bin width of cM intervals, so if it is 2, than your bins will be 0-2, then 2-4 an so on. -chr is the chromosome name, in your case 1A. I commented the script and paste it here so everyone can use it. Please test it and tell me if there are mistakes etc. It might be useful to me later as well. Your files 1 and 2 need not to be ordered. Your result file will be in the same folder with name resultFile.txt.

I hope this helps,

Good luck with your research,

############################################CODE HERE###################################################

#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;

#the global variables
my $stepCounter = 0;
my $stepSize;
my $chr;
my @chrList;
my $fileName1;
my $fileName2;
my $sampleNumber;
my $cMlimit;

#get the options
GetOptions("stepsize:f" => \$stepSize,"chr:s" => \$chr,"file1=s"=>\$fileName1,"file2=s"=>\$fileName2, "cMlimit=i"=>\$cMlimit);
$fileName1 = "./".$fileName1;
$fileName2 = "./".$fileName2;

#if chr is not defined it uses a standard list of human chromosome names
if (defined $chr) {
    push(@chrList,$chr);
} else {
    @chrList = (1..22,"X","Y");
}
$stepSize = defined $stepSize ? $stepSize : 1; 

#open your file1 and file 2 which is in the same location of the script.
open(my $file1, '<',$fileName1) or die("cant open file 1!");
open(my $file2, '<',$fileName2) or die("cant open file 2!");
#skip the headers
skipHeader($file1);
skipHeader($file2);

#form a hash that will contain information for each scaffold the first key will be scaffold name, the entries will be name,chr,cM,sample1,sample2...
my %data = ();
while (<$file1>) {
    chomp;
    my @line = split(/\s+|\t+/,$_);
    $data{$line[0]}{"name"} = $line[0];
    for (my $i=1;$i<= $#line;$i++) {
        $data{$line[0]}{"sample".$i} = $line[$i]; 
    }
}
while (<$file2>) {    
    chomp;
    my @line = split(/\s+|\t+/,$_);
    $data{$line[0]}{"chr"} = $line[1];
    $data{$line[0]}{"cM"} = $line[2];
}

#initiate a file handle which will contain your result.
open(my $resultFile, '>',"./resultFile.txt");
#print the first header..
print $resultFile "chr"."\t"."cM"."\t"."samples..\n";

#iterate over the hash
for (my $i = 0;$i<scalar(@chrList);$i++) {
    $stepCounter = 0;
    for (my $j = $stepCounter*$stepSize;$j<=$cMlimit; $j+=$stepSize) {
        print $resultFile $chrList[$i]."\t".$stepCounter*$stepSize."\t";
        for (my $k=1;$k <=$sampleNumber;$k++) {
            my $sum = 0;
            my $count = 0;
            foreach my $key (keys %data) {
                if($data{$key}{"chr"} eq $chrList[$i] && $data{$key}{"cM"}>=$stepSize*$stepCounter && $data{$key}{"cM"}<$stepSize*($stepCounter+1)) {
                    $sum += $data{$key}{"sample".$k};
                    $count++;
                }
            }
            my $result;
            if ($count != 0) {
                $result = sprintf("%.1f",$sum/$count);
            } else {
                $result = 0;
            }
            if ($k == $sampleNumber) {
                print $resultFile $result."\n";
            } else {
                print $resultFile $result."\t";
            }
        }
        $stepCounter++;
    }
}

sub skipHeader {
    my $handle = $_[0];
    if ($_[0] == $file1) {
        my $line = <$handle>;
        $sampleNumber =   scalar(split(/\s+|\t+/,$line))-1;
    } else {
        <$handle>;
    }
}

############################################CODE HERE###################################################