UPDATE
So far, the provided code plus some extra commands gets me very close to my goal of having the output like this (for example):
CHR POS REF ALT Homozygous_Ref Homozygous_Alt Heterozygous
1 10177 A AC 694 320 1490
1 10190 T AG 0 100 3000
1 94848 C CT 482 0 3410
*The "0" represents a genotype that is not listed for this chr pos + ref/alt combination. (see the definition section below). The "0" is a place holder and keeps the correct info in the correct column.
The only problem is making that "0" place holder appear when the genotype is not present.
If anyone knows how to make that "0" place holder appear, let me know. Thanks!
Also these type of lines contain genotype counts that are also challenging to get into the right columns:
chr1:15274 A/G,T A|A=2 A|G=5 A|T=21 G|A=3 G|G=89 G|T=774 T|A=26 T|G=779 T|T=805
chr1:536895 T/C,G C|C=3 C|T=84 G|C=1 G|T=11 T|C=82 T|G=19 T|T=2304
chr1:565736 A/G,T A|A=2487 A|G=1 A|T=8 G|G=1 T|A=7
Also I only need ref and alts that are mono or bi-allelic (i.e. A and AC or TA or GC, etc)
Here is the perl script thus far:
#!/usr/bin/env perl
# Usage:
# cat PLINK.txt | perl plink-convert3.pl > CONV.txt
use warnings;
use strict;
my $header = <>; # ignore header, remove if plink has no header
while (<>) {
chomp();
my ($chr, $pos, $ref_alt, @gts) = split(/[:\s]+/, $_);
my @hom;
my @homc;
my $hetc;
for my $gt (@gts) {
my ($v1, $v2, $c) = split("[|=]", $gt);
if ($v1 eq $v2) { # homo
push @hom, $v1;
push @homc, $c;
} else { # het
$hetc+=$c;
}
}
my @hom_order = sort{$homc[$a] <=> $homc[$b]}(0..$#homc);
my ($ref, @alt) = @hom[@hom_order];
my ($refc, @altc) = @homc[@hom_order];
@alt = (join(",", @alt)) if @alt > 1;
@altc = (join(",", @altc)) if @altc > 1;
my @refalts = split /(\/)/, $ref_alt;
print join(" ", $chr, $pos, @refalts, $refc, @altc, $hetc),"\n";
}
# Next step is to run the following code from the command line
# perl -pe 's/(\/)/ /g' gcounts_file.txt > file.txt
# perl -pe 's/[chr]//g' gcounts_file.txt > file.txt
Original Post
I am trying to figure out how to use perl to reformat a g counts output file from
plink/seq. The header in the output file plus the 2nd line looks like this:
VAR REF/ALT GENOTYPES
chr1:10177 A/AC AC|A=751 AC|AC=320 A|A=694 A|AC=739
I managed to parse the file to look like this:
CHR POS REF ALT C1 C2 C3----> C30
1 10177 A AC AC A 751 AC AC 320 A A 694 A AC 739
Each line has varying lengths so C30 is a guess as to how many columns there are in this file.
Goal: I need to reformat this file to have the following headers AND content style:
CHR POS REF ALT Homozygous_Ref Homozygous_Alt Heterozygous
1 10177 A AC 694 320 1490
Definitions:
Homozygous_Ref
A A, value =694
Homozygous_Alt
AC AC, value=320
Heterozygous
AC A, value =751
A AC, value = 739
ultimate value = 751 + 739
Problem: I need to write a perl script to search and replace Homozygous Ref ( A A 694) and Homozygous Alt ( AC AC 320) with their numerical values (694 and 320). As well as, add the two heterozygous values and display only their sum in the heterozygous column.
The challenging part is that each line in the parsed file displays the heterozygous and homozygous info in various ways (i.e., no particular order). I want to use the REF and ALT columns in some sort of perl if then statement. But am open to using any other approach (egrep, sed, awk).
Can anyone help me with this? I appreciate the help and suggestions.
Ah, a more complex scenario probably would have more than one ALT... If so, could you post such a plink formatted entry?
Thanks so much Thackl. I tried the script but the output was different from the command line example you provided above. Here is the output (CONV.txt) using Perl 5 version 18: (looks like some info wasn't added and some info was left out)
Here is a snippet of the actual g-counts output from plink/seq, we plan on only keeping the mono-allelic and bi-allelic entries (i.e, A T | AB C | AG TC) for our ref and alt columns.
Here are examples of a multiple entry for an ALT
Okay, the problem probably is the whitespace delimiter between columns. I can't really tell from the posted data sample it if is tabs or spaces or even multiple spaces. I assumed a single space in my first code. And in your output, do you need tabs or spaces.
Code is updated, should handle tabs and spaces and multiple ALTs. Problems might be records with only one homozygous allele (
chr1:10235 T/TA TA|T=4 T|T=2498 T|TA=2
), the "pos..pos" annotation (chr1:10616..10637
..) and the formatting of multi ALT output. Currently I use "," as a delimiter...Thanks again Thackl, great job with the script. I can fix the REF/ALT column, everything else in the output looks correct.