Hello folks!
I'm working with vcf file, and that's how it looks like:
##info1
##info2
##info3
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ID01 ID02 ID03 etc...
3 66894 rs9681213 0 1 . PASS . GT 0|1 0|1 0|1 etc...
3 95973 rs1400176 0 1 . PASS . GT 1|1 1|1 1|1 etc...
3 104972 rs990284 0 1 . PASS . GT 0|1 0|1 0|0 etc...
3 114133 rs954824 0 1 . PASS . GT 1|1 1|1 1|1 etc...
and so on...
As you can see, the general format explained: - At the lines there are information about my central targets: SNPs and their alleles; - At the columns (after the 9th one) there are individuals with their respective alleles for each SNPs.
So, for each column's ID there are lines with the info I'm looking for...
I've a Perl script for extracting specific individuals (IDs) and I just realized there're missing values for some IDs and it's impairing my later analysis. Such script prints the empty values as they are (empty), and even if I use the vcftools program it prints a point (.) instead the empty, but it doesn't help me anyway. So I wanna know what IDs aren't codified.
Basically, I wanna print just the columns in which IDs present missing ("null" or "undefined") values, i.e. they're blank.
Once my files are huge in both directions, it's not easy to see manually what IDs haven't codification in their lines. From my limited knowledge, I believe the easiest way is to check the undefined value in the column and somehow print just the lines from that column.
So what I have to do is 1) to split the lines in order to get the columns (OK); 2) to print the first nine columns anyway through a loop (OK); 3) to check if there's any missing values at the columns (partially OK); 4) to print only the columns of which lines are data-missing (I got stuck here);
My problem at this last part is that, even "I discovering the columns with undefined value" (actually, the program did discover at this point, not me), I do not know what they are in order to print only them. I have to find a way to tell the program to print just the columns with missing data and I don't know how to specify those columns for it...
Could someone please help me out?
It's not clear to me. Use the correct wording please e.g: VARIANT/INFO/FILTER/FORMAT/GENOTYPE/SAMPLE/ATTRIBUTE:... not 'value', 'data'...
That's an original piece of the file:
The problem is: there are individuals (columns) with no "coding" information (i.e. column without binary info in their lines). I'm trying to be clear, but I know I'm confuse. Sorry about that...
Can you provide an example of one of this "missing" values? The VCF file format has a specifications document (https://samtools.github.io/hts-specs/VCFv4.3.pdf). I'm not sure where you got your file from but it doesn't actually conform to the VCF specifications as "0" and "1" are not valid REF and ALT alleles, but this may not matter for your purposes.
Are you looking for records that do not include genotype information for at least one sample? If so, can you provide an example of a "missing" genotype? Is it "." or "./." or ".|." or some other value?
Yeah, sure!
The missing genotype for those individuals who don't have it, is "missing" in the file, empty, do not showed. Please see the following pictures to see what I mean:
Actually, if I filter the vcf file in order to originate another vcf file (recoded, specific for the population X) with my Perl script, the missing genotype stays empty. If I filter the vcf file with vcftools, the missing genotype are showed by a single point "." Please see the pictures:
Another example, if I count the number of tabs (therefore, columns) per line I can see that the header shows 2356 tabs and the lines with data only 2339. I ran this command on unix shell in order to see that: $ awk '{print gsub(/\t/,"")}' chr22_hg19_phased_selscan.vcf >out.vcf
I'm entirely new on programming and bioinformatics. I want to extract individuals from specific populations and run tests for analyzing the influence of natural selection...