Hello folk, I'm beginner in perl and I'm trying to extract the whole column with a set of specific patterns in tab-delimited file (very large). I saw some scripts but only when you know which column are you interested in (small file).
My file structure looks like this:
##info1
##info2
##info3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ID01 ID02 ID03 etc...
3 66894 rs9681213 0 1 . PASS . GT 0|1 0|1 0|1 etc...
3 95973 rs1400176 0 1 . PASS . GT 1|1 1|1 1|1 etc...
3 104972 rs990284 0 1 . PASS . GT 0|1 0|1 0|0 etc...
3 114133 rs954824 0 1 . PASS . GT 1|1 1|1 1|1 etc...
and so on...
Let's suppose that I want to extract four columns (ID01, ID03, ID45, ID80). These IDs could be (optionally) in another text file for parsing. I suppose I have to storage these patterns in an array, right!?
I tried to split the @_ data (for getting the lines) but I couldn't get the correct column $_[?] to print off. The file is too big and I can't see in which column number they are. I really get stuck...
That's the code I've been working on:
#! usr/bin/perl
use strict;
use warnings;
my $file = $ARGV[0] || die;
my @fname = split(/_/, $file); #my file name is chr3_etc.vcf
open(FILE, '<', "$file") || die;
open(OUT,'>', "$fname[0]_NAD.vcf");
my @NAD_ids = qw(NAD15 NAD54 NAD55 NAD56 NAD57 NAD58 NAD59 NAD64 NAD93 NAD93 NAD98);
while (<FILE>) {
next if m/##/; #exclude the additional information above the table
chomp; @_=split; #getting the columns
if (@NAD_ids) {
print OUT "$_[0]\t$_[1]\t$_[2]\t$_[3]\t$_[4]\t$_[5]\t$_[6]\t$_[7]\t$_[8]\t$_[($NAD_ids[0])]" . "\n"; # see the last one (problem);
}
}
close FILE; close OUT;
exit;
Please consider the "NADs" like the IDs.
So, what error messages are you getting when you run your code, why is it not working the way you want it to?
I don't understand what
Is supposed to do, are you trying to compare each line of the file to the list of IDs, are you trying to print entire lines where the IDs in certain columns match your list?
I think OP is trying to ensure they actually do need to pick a column.
No, I don't! That's exactly what Ram said as a reply here. Just to make sure the IDs exist. Also, I'm trying to do what he said in his comment. That's actually the way I'm thinking to solve this, but if anyone knows another way, I entirely open to see how you 'd do that!
Ah, OK. This looks like genotype data, so first you want to check that the file contains the IDs/samples you want. But then, for the computer to print out the correct columns, you will have to tell it which columns to print (give it the array index of the matching columns), which is essentially what Ram said.