Question

Perl script for reformatting a plink/seq output file

3

Entering edit mode

9.1 years ago

plink_9857 ▴ 50

UPDATE

So far, the provided code plus some extra commands gets me very close to my goal of having the output like this (for example):

CHR   POS         REF    ALT    Homozygous_Ref     Homozygous_Alt    Heterozygous
  1       10177       A       AC              694                       320                         1490
  1       10190       T       AG                 0                        100                          3000
  1       94848       C      CT                 482                        0                           3410

*The "0" represents a genotype that is not listed for this chr pos + ref/alt combination. (see the definition section below). The "0" is a place holder and keeps the correct info in the correct column.

The only problem is making that "0" place holder appear when the genotype is not present.

If anyone knows how to make that "0" place holder appear, let me know. Thanks!

Also these type of lines contain genotype counts that are also challenging to get into the right columns:

chr1:15274    A/G,T    A|A=2 A|G=5 A|T=21 G|A=3 G|G=89 G|T=774 T|A=26 T|G=779 T|T=805
chr1:536895    T/C,G    C|C=3 C|T=84 G|C=1 G|T=11 T|C=82 T|G=19 T|T=2304
chr1:565736    A/G,T    A|A=2487 A|G=1 A|T=8 G|G=1 T|A=7

Also I only need ref and alts that are mono or bi-allelic (i.e. A and AC or TA or GC, etc)

Here is the perl script thus far:

#!/usr/bin/env perl

# Usage:
#   cat PLINK.txt | perl plink-convert3.pl > CONV.txt

use warnings;
use strict;

my $header = <>; # ignore header, remove if plink has no header

while (<>) {
    chomp();
    my ($chr, $pos, $ref_alt, @gts) = split(/[:\s]+/, $_);

    my @hom;
    my @homc;
    my $hetc;

    for my $gt (@gts) {
        my ($v1, $v2, $c) = split("[|=]", $gt);
        if ($v1 eq $v2) {       # homo
            push @hom, $v1;
            push @homc, $c;
        } else {                # het
            $hetc+=$c;
        }
    }

    my @hom_order = sort{$homc[$a] <=> $homc[$b]}(0..$#homc);
    my ($ref, @alt) = @hom[@hom_order];
    my ($refc, @altc) = @homc[@hom_order];

    @alt = (join(",", @alt)) if @alt > 1;
    @altc = (join(",", @altc)) if @altc > 1;


    my @refalts = split /(\/)/, $ref_alt;

    print join(" ", $chr, $pos, @refalts, $refc, @altc, $hetc),"\n";
}
 # Next step is to run the following code from the command line
 # perl -pe 's/(\/)/ /g' gcounts_file.txt > file.txt
 # perl -pe 's/[chr]//g' gcounts_file.txt > file.txt

Original Post

I am trying to figure out how to use perl to reformat a g counts output file from

plink/seq. The header in the output file plus the 2nd line looks like this:

VAR REF/ALT GENOTYPES
chr1:10177 A/AC AC|A=751 AC|AC=320 A|A=694 A|AC=739

I managed to parse the file to look like this:

CHR POS REF ALT C1 C2 C3----> C30
1 10177 A AC      AC A 751   AC AC 320    A A 694   A AC 739

Each line has varying lengths so C30 is a guess as to how many columns there are in this file.

Goal: I need to reformat this file to have the following headers AND content style:

CHR POS REF ALT Homozygous_Ref Homozygous_Alt Heterozygous
1 10177 A AC 694 320 1490

Definitions:

Homozygous_Ref
A A, value =694

Homozygous_Alt
AC AC, value=320

Heterozygous
AC A, value =751
A AC, value = 739

ultimate value = 751 + 739

Problem: I need to write a perl script to search and replace Homozygous Ref ( A A 694) and Homozygous Alt ( AC AC 320) with their numerical values (694 and 320). As well as, add the two heterozygous values and display only their sum in the heterozygous column.

The challenging part is that each line in the parsed file displays the heterozygous and homozygous info in various ways (i.e., no particular order). I want to use the REF and ALT columns in some sort of perl if then statement. But am open to using any other approach (egrep, sed, awk).

Can anyone help me with this? I appreciate the help and suggestions.

allele plinkseq genotype reformat perl • 3.5k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 9.1 years ago by plink_9857 ▴ 50

Ram · Accepted Answer · 2015-10-09

3

Entering edit mode

9.1 years ago

thackl ★ 3.0k

EDIT: Updated according to Update

Handles "0" columns, ignores records with "long" variants.

in.tsv:

VAR     REF/ALT GENOTYPES
chr1:10177	A/AC	AC|A=751 AC|AC=320 A|A=694 A|AC=739
chr1:10235	T/TA	TA|T=4 T|T=2498 T|TA=2
chr1:10505	A/T	T|T=343 A|T=534
chr1:10616..10637    CCGCCGTTGCAAAGGCGCGCCG/C   CCGCCGTTGCAAAGGCGCGCCG|C=14 C|C=2469 C|CCGCCGTTGCAAAGGCGCGCCG=21
chr1:15274	A/G,T	A|A=2 A|G=5 A|T=21 G|A=3 G|G=89 G|T=774 T|A=26 T|G=779 T|T=805

out.tsv:

1	10177	A	AC	694	320	1490
1	10235	T	TA	2498	0	6
1	10505	A	T	0	343	534
1	15274	A	G,T	2	89,805	1

plink-convert.pl:

use warnings;
use strict;

my $header = <>; # ignore header, remove if plink has no header

while (<>) {
    chomp();
    my ($chr, $pos, $ref_alts, @gts) = split(/[:\s]+/, $_);
    $chr =~ s/^\D+//; # remove "chr" prefix
    
    my ($ref, @alts) = split(/[\/,]/, $ref_alts);

    # ignore non-mono/bi allelic
    # this will also ignore something like A/T,TTA !!
    next if grep{length($_)>2}($ref, @alts);
    
    my %homc;
    @homc{$ref, @alts} = (0) x (@alts+1);
    
    my $hetc;

    for my $gt (@gts) {
        my ($v1, $v2, $c) = split(/[|=]/, $gt);
        if ($v1 eq $v2) {       # homo
            $homc{$v1} = $c;
        } else {                # het
            $hetc+=$c;
        }
    }

    my $alt = join(",", @alts); 
    my $altc = join(",", @homc{@alts});

    print join("\t", $chr, $pos, $ref, $alt, $homc{$ref}, $altc, $hetc),"\n";
}

ADD COMMENT • link 9.1 years ago by thackl ★ 3.0k

0

Entering edit mode

Ah, a more complex scenario probably would have more than one ALT... If so, could you post such a plink formatted entry?

ADD REPLY • link 9.1 years ago by thackl ★ 3.0k

0

Entering edit mode

Thanks so much Thackl. I tried the script but the output was different from the command line example you provided above. Here is the output (CONV.txt) using Perl 5 version 18: (looks like some info wasn't added and some info was left out)

VAR    REF/ALT    GENOTYPES  
chr1 10177    A/AC    AC|A=751 A 694 739
chr1 10235    T/TA    TA|T=4 2
chr1 10352    T/TA    TA|T=1010 T 396 1015
chr1 10505    A/T    A|A=2503
chr1 10506    C/G    C|C=2503
chr1 10511    G/A    A|G=1
chr1 10539    C/A    C|A=3
chr1 10542    C/T    C|C=2503
chr1 10579    C/A    C|A=1

Here is a snippet of the actual g-counts output from plink/seq, we plan on only keeping the mono-allelic and bi-allelic entries (i.e, A T | AB C | AG TC) for our ref and alt columns.

VAR    REF/ALT    GENOTYPES
chr1:10177    A/AC    AC|A=751 AC|AC=320 A|A=694 A|AC=739
chr1:10235    T/TA    TA|T=4 T|T=2498 T|TA=2
chr1:10352    T/TA    TA|T=1010 TA|TA=83 T|T=396 T|TA=1015
chr1:10505    A/T    A|A=2503 A|T=1
chr1:10506    C/G    C|C=2503 C|G=1
chr1:10511    G/A    A|G=1 G|G=2503
chr1:10539    C/A    C|A=3 C|C=2501
chr1:10542    C/T    C|C=2503 T|C=1
chr1:10579    C/A    C|A=1 C|C=2503
chr1:10616..10637    CCGCCGTTGCAAAGGCGCGCCG/C    CCGCCGTTGCAAAGGCGCGCCG|C=14 C|C=2469 C|CCGCCGTTGCAAAGGCGCGCCG=21
chr1:10642    G/A    A|G=10 G|A=11 G|G=2483
chr1:11008    C/G    C|C=2082 C|G=207 G|C=196 G|G=19
chr1:11012    C/G    C|C=2082 C|G=207 G|C=196 G|G=19
chr1:11063    T/G    G|T=11 T|G=4 T|T=2489
chr1:13011    T/G    G|T=1 T|G=2 T|T=2501
chr1:13110    G/A    A|G=75 G|A=59 G|G=2370
chr1:13116    T/G    G|G=36 G|T=207 T|G=207 T|T=2054
chr1:13118    A/G    A|A=2054 A|G=207 G|A=207 G|G=36
chr1:13156    G/C    C|G=1 G|G=2503
chr1:13259    G/A    G|A=2 G|G=2502
chr1:13273    G/C    C|C=16 C|G=193 G|C=251 G|G=2044
chr1:13284    G/A    A|G=3 G|A=4 G|G=2497
chr1:13289    C/T    C|C=2501 C|T=3
chr1:13289..13291    CCT/C    CCT|C=11 CCT|CCT=2484 C|CCT=9
chr1:13313    T/G    T|G=1 T|T=2503
chr1:13365    C/T    C|C=2503 C|T=1
chr1:13380    C/G    C|C=2463 C|G=17 G|C=24
chr1:13382    C/G    C|C=2503 G|C=1
chr1:13445    C/G    C|C=2501 C|G=2 G|C=1
chr1:13453    T/C    C|T=1 T|C=3 T|T=2500
chr1:13482    G/C    C|G=2 G|G=2502
chr1:13483    G/C    C|G=4 G|C=6 G|G=2494
chr1:13494    A/G    A|A=2497 A|G=4 G|A=3
chr1:13543    T/G    G|T=1 T|G=2 T|T=2501
chr1:13550    G/A    A|G=10 G|A=7 G|G=2487
chr1:14462    A/G    A|A=2503 A|G=1
chr1:14464    A/T    A|A=2050 A|T=207 T|A=221 T|T=26
chr1:14564    G/A    G|A=1 G|G=2503
chr1:14599    T/A    A|A=14 A|T=356 T|A=355 T|T=1779
chr1:14604    A/G    A|A=1779 A|G=355 G|A=356 G|G=14
chr1:14674    G/A    G|A=1 G|G=2503
chr1:14719    C/A    A|C=1 C|C=2503
chr1:14728    C/A    A|C=1 C|A=1 C|C=2502
chr1:14775    C/T    C|C=2502 C|T=1 T|C=1
chr1:14860    G/A    G|A=1 G|G=2503
chr1:14874    G/C    G|C=1 G|G=2503
chr1:14930    A/G    A|A=198 A|G=1081 G|A=1116 G|G=109

Here are examples of a multiple entry for an ALT

chr1:15274    A/G,T    A|A=2 A|G=5 A|T=21 G|A=3 G|G=89 G|T=774 T|A=26 T|G=779 T|T=805
chr1:536895    T/C,G    C|C=3 C|T=84 G|C=1 G|T=11 T|C=82 T|G=19 T|T=2304
chr1:565736    A/G,T    A|A=2487 A|G=1 A|T=8 G|G=1 T|A=7

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by plink_9857 ▴ 50

1

Entering edit mode

Okay, the problem probably is the whitespace delimiter between columns. I can't really tell from the posted data sample it if is tabs or spaces or even multiple spaces. I assumed a single space in my first code. And in your output, do you need tabs or spaces.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by thackl ★ 3.0k

1

Entering edit mode

Code is updated, should handle tabs and spaces and multiple ALTs. Problems might be records with only one homozygous allele (chr1:10235 T/TA TA|T=4 T|T=2498 T|TA=2), the "pos..pos" annotation (chr1:10616..10637 ..) and the formatting of multi ALT output. Currently I use "," as a delimiter...

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by thackl ★ 3.0k

0

Entering edit mode

Thanks again Thackl, great job with the script. I can fix the REF/ALT column, everything else in the output looks correct.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by plink_9857 ▴ 50