Question

How to extract specific columns based on a set of different header patterns in a big tab-delimited file (PERL)

1

Entering edit mode

7.9 years ago

cmcouto.silva ▴ 60

Hello folk, I'm beginner in perl and I'm trying to extract the whole column with a set of specific patterns in tab-delimited file (very large). I saw some scripts but only when you know which column are you interested in (small file).

My file structure looks like this:

##info1
##info2
##info3
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ID01    ID02    ID03    etc...
3   66894   rs9681213   0   1   .   PASS    .   GT  0|1 0|1 0|1 etc...
3   95973   rs1400176   0   1   .   PASS    .   GT  1|1 1|1 1|1 etc...
3   104972  rs990284    0   1   .   PASS    .   GT  0|1 0|1 0|0 etc...
3   114133  rs954824    0   1   .   PASS    .   GT  1|1 1|1 1|1 etc...
and so on...

Let's suppose that I want to extract four columns (ID01, ID03, ID45, ID80). These IDs could be (optionally) in another text file for parsing. I suppose I have to storage these patterns in an array, right!?

I tried to split the @_ data (for getting the lines) but I couldn't get the correct column $_[?] to print off. The file is too big and I can't see in which column number they are. I really get stuck...

That's the code I've been working on:

#! usr/bin/perl
use strict;
use warnings;

my $file = $ARGV[0] || die;
my @fname = split(/_/, $file);      #my file name is chr3_etc.vcf


open(FILE, '<', "$file") || die;
open(OUT,'>', "$fname[0]_NAD.vcf");

my @NAD_ids = qw(NAD15 NAD54 NAD55 NAD56 NAD57 NAD58 NAD59 NAD64 NAD93 NAD93 NAD98);

while (<FILE>) {
    next if m/##/;      #exclude the additional information above the table
    chomp; @_=split;    #getting the columns
        if (@NAD_ids) {
            print OUT "$_[0]\t$_[1]\t$_[2]\t$_[3]\t$_[4]\t$_[5]\t$_[6]\t$_[7]\t$_[8]\t$_[($NAD_ids[0])]" . "\n";    # see the last one (problem);
        }

    }

close FILE; close OUT;
exit;

Please consider the "NADs" like the IDs.

SNP snp perl • 5.6k views

ADD COMMENT • link updated 7.9 years ago by SES 8.6k • written 7.9 years ago by cmcouto.silva ▴ 60

1

Entering edit mode

So, what error messages are you getting when you run your code, why is it not working the way you want it to?

I don't understand what

       if (@NAD_ids)

Is supposed to do, are you trying to compare each line of the file to the list of IDs, are you trying to print entire lines where the IDs in certain columns match your list?

ADD REPLY • link 7.9 years ago by mastal511 ★ 2.1k

1

Entering edit mode

I think OP is trying to ensure they actually do need to pick a column.

ADD REPLY • link 7.9 years ago by Ram 44k

0

Entering edit mode

No, I don't! That's exactly what Ram said as a reply here. Just to make sure the IDs exist. Also, I'm trying to do what he said in his comment. That's actually the way I'm thinking to solve this, but if anyone knows another way, I entirely open to see how you 'd do that!

ADD REPLY • link 7.9 years ago by cmcouto.silva ▴ 60

1

Entering edit mode

Ah, OK. This looks like genotype data, so first you want to check that the file contains the IDs/samples you want. But then, for the computer to print out the correct columns, you will have to tell it which columns to print (give it the array index of the matching columns), which is essentially what Ram said.

ADD REPLY • link 7.9 years ago by mastal511 ★ 2.1k

2

Entering edit mode

7.9 years ago

Ram 44k

$_[($NAD_ids[0])]"

As far as I can see, you're trying to index an array with the contents of the line ($_) using the column name ($NAD_ids[0]) as the subscript. You know it's a column name, but the program doesn't.

You may want to store the header as an array, find the index of the target columns in that header array and then use those indices to index each line.

ADD COMMENT • link 7.9 years ago by Ram 44k

0

Entering edit mode

Yeah, that's it! Makes sense...

I store the header as an array so that I find the index (e.g. using grep or modules like List::MoreUtils), right!? I didn't try it yet, but I think I can do it. But I'll have to find the targets one by one? Anyway, thank you for your feedback!

ADD REPLY • link 7.9 years ago by cmcouto.silva ▴ 60

1

Entering edit mode

I've given you the lead to follow - the rest is on you :)

It's a fairly trivial solution, I'm sure you'll figure it out.

ADD REPLY • link 7.9 years ago by Ram 44k

1

Entering edit mode

Also, you realize awk might be super useful here, right? grep non info (^[^#][^#]) lines to pipe to awk, at NR==1, pick the header indices you'll use, for rest of the NR, print those indices.

ADD REPLY • link 7.9 years ago by Ram 44k

score 3 · Accepted Answer · 2016-12-16

You are very close and I appreciate the effort. I'll share the thing you are missing and some other helpful tips because you basically have all the logic right. (I personally think it would be better to use vcf-tools or one of the associated VCF modules to work on the indexed files more efficiently. That being said, let's work with what you posted.) The most important thing to remember is to use lexical variables to limit the scope of your actions. When you are using loops and modifying the topic $_ or trying to directly modify things like @_ (this special notation only applies to subroutine arguments, so it was misplaced) it can get confusing. That is one reason it's best to work with lexical variables and avoid the magic (also, readability). Exceptions are in map/grep/sort blocks. Also important is that bare filehandles have global scope, so they are to be avoided in all cases. These can cause problems from a large distance in the code.

For the question, it sounds like you want to pick genotypes from one file and select those from a VCF file, right? It's not completely obvious at first because your examples don't match up. Also, there were duplicate IDs in your example array, which may or may not matter but it's important to check your inputs (you can solve this problem using the same module I used below using the uniq method). I modified the files to demonstrate, otherwise there was nothing to be done. The ID file:

And the VCF:

Here is some code:

The first thing to do is read in the IDs, which is what the map block is doing. What that is doing specifically is removing newlines and spaces so the IDs will match those in the file, and then assigning the keys a value of 1 (to say it's 'defined' in Perl-speak). The space/newline issue is important in any programming language. The rest is basically what you had with some minor changes. To pick the correct genotypes you find the column (array index), then test if any matches were found with if (@indexes) {, and if not you stop reading to be efficient (literally, you say that's the last line we need). The data is printed with an "array slice" and join, and these two idioms will save you a lot of typing. If you are unsure about what is happening here, use Data::Printer (the usage of which will be very familiar if you know Ruby) or Data::Dump::Color (personal preference) to peek and the data structures in the script.

Running this prints:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NAD93   NAD98   NAD58
3   66894   rs9681213   0   1   .   PASS    .   GT  0|1 0|1 0|1
3   95973   rs1400176   0   1   .   PASS    .   GT  1|1 1|1 1|1
3   104972  rs990284    0   1   .   PASS    .   GT  0|1 0|0 0|1
3   114133  rs954824    0   1   .   PASS    .   GT  1|1 1|1 1|1

That's what I think you wanted. If that's not exactly what you were asking then I'm confident you can modify it to your needs. That will likely be good enough, but as I mentioned above, I'd personally use a toolkit or some kind of VCF library for this since it has to be a solved problem (I recall doing this recently and I’ll update if I remember the command).