Question

How to print the lines exclusively from unknown columns with missing(undef) value? BIG FILE!

0

Entering edit mode

7.9 years ago

cmcouto.silva ▴ 60

Hello folks!

I'm working with vcf file, and that's how it looks like:

##info1
##info2
##info3
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ID01    ID02    ID03    etc...
3   66894   rs9681213   0   1   .   PASS    .   GT  0|1 0|1 0|1 etc...
3   95973   rs1400176   0   1   .   PASS    .   GT  1|1 1|1 1|1 etc...
3   104972  rs990284    0   1   .   PASS    .   GT  0|1 0|1 0|0 etc...
3   114133  rs954824    0   1   .   PASS    .   GT  1|1 1|1 1|1 etc...
and so on...

As you can see, the general format explained: - At the lines there are information about my central targets: SNPs and their alleles; - At the columns (after the 9th one) there are individuals with their respective alleles for each SNPs.

So, for each column's ID there are lines with the info I'm looking for...

I've a Perl script for extracting specific individuals (IDs) and I just realized there're missing values for some IDs and it's impairing my later analysis. Such script prints the empty values as they are (empty), and even if I use the vcftools program it prints a point (.) instead the empty, but it doesn't help me anyway. So I wanna know what IDs aren't codified.

Basically, I wanna print just the columns in which IDs present missing ("null" or "undefined") values, i.e. they're blank.

Once my files are huge in both directions, it's not easy to see manually what IDs haven't codification in their lines. From my limited knowledge, I believe the easiest way is to check the undefined value in the column and somehow print just the lines from that column.

So what I have to do is 1) to split the lines in order to get the columns (OK); 2) to print the first nine columns anyway through a loop (OK); 3) to check if there's any missing values at the columns (partially OK); 4) to print only the columns of which lines are data-missing (I got stuck here);

My problem at this last part is that, even "I discovering the columns with undefined value" (actually, the program did discover at this point, not me), I do not know what they are in order to print only them. I have to find a way to tell the program to print just the columns with missing data and I don't know how to specify those columns for it...

Could someone please help me out?

SNP vcf perl shell • 3.5k views

ADD COMMENT • link updated 7.9 years ago by Ketil 4.1k • written 7.9 years ago by cmcouto.silva ▴ 60

0

Entering edit mode

It's not clear to me. Use the correct wording please e.g: VARIANT/INFO/FILTER/FORMAT/GENOTYPE/SAMPLE/ATTRIBUTE:... not 'value', 'data'...

ADD REPLY • link 7.9 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

That's an original piece of the file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HGDP00876   PT-91DX PT-91YR 4256126086_A    PT-911T
22  17154984    rs9605028   0   1   .   PASS    .   GT  1|1 1|1 1|1 1|1 1|1
22  17155383    rs1892844   0   1   .   PASS    .   GT  1|1 1|1 1|1 1|1 1|1
22  17178213    rs2845371   0   1   .   PASS    .   GT  1|1 1|1 1|0 1|0 0|0
22  17178586    rs5993924   0   1   .   PASS    .   GT  1|1 0|1 1|1 0|1 1|1
22  17202602    rs2845379   0   1   .   PASS    .   GT  0|0 1|0 1|1 0|1 0|0
22  17214252    rs2845346   0   1   .   PASS    .   GT  1|1 1|1 1|1 1|1 1|1
22  17254399    rs2190742   0   1   .   PASS    .   GT  0|0 0|0 1|1 1|1 1|1
22  17264904    rs9605145   0   1   .   PASS    .   GT  1|1 1|1 1|1 1|1 1|1
22  17265194    rs9605146   0   1   .   PASS    .   GT  1|1 1|1 1|1 1|1 1|1

The problem is: there are individuals (columns) with no "coding" information (i.e. column without binary info in their lines). I'm trying to be clear, but I know I'm confuse. Sorry about that...

ADD REPLY • link 7.9 years ago by cmcouto.silva ▴ 60

1

Entering edit mode

Can you provide an example of one of this "missing" values? The VCF file format has a specifications document (https://samtools.github.io/hts-specs/VCFv4.3.pdf). I'm not sure where you got your file from but it doesn't actually conform to the VCF specifications as "0" and "1" are not valid REF and ALT alleles, but this may not matter for your purposes.

Are you looking for records that do not include genotype information for at least one sample? If so, can you provide an example of a "missing" genotype? Is it "." or "./." or ".|." or some other value?

ADD REPLY • link 7.9 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Yeah, sure!

The missing genotype for those individuals who don't have it, is "missing" in the file, empty, do not showed. Please see the following pictures to see what I mean:

Original vcf files with "missing" genotype visualized by R (u can see by the empty space between columns): https://i.imgsafe.org/a701cc31dd.png https://i.imgsafe.org/a701e3cb71.jpg

Actually, if I filter the vcf file in order to originate another vcf file (recoded, specific for the population X) with my Perl script, the missing genotype stays empty. If I filter the vcf file with vcftools, the missing genotype are showed by a single point "." Please see the pictures:

Vcf originated from Perl script (empty values): https://i.imgsafe.org/a70237965a.jpg Vcf originated from vcftools (points to indicate missing values): https://i.imgsafe.org/a7024e1120.jpg

Another example, if I count the number of tabs (therefore, columns) per line I can see that the header shows 2356 tabs and the lines with data only 2339. I ran this command on unix shell in order to see that: $ awk '{print gsub(/\t/,"")}' chr22_hg19_phased_selscan.vcf >out.vcf

I'm entirely new on programming and bioinformatics. I want to extract individuals from specific populations and run tests for analyzing the influence of natural selection...

ADD REPLY • link 7.9 years ago by cmcouto.silva ▴ 60

2

Entering edit mode

7.9 years ago

Ketil 4.1k

I wanna print just the columns in which IDs present missing ("null" or "undefined") values, i.e. they're blank.

If this is actually what you want, you could identify columns with blanks, something like:

for i in {1..10}; do cut -f$i < file | grep -q '^$' || echo $i; done

and then to print those columns (using the result of the above):

cut -f1,2,4,5 < file

ADD COMMENT • link 7.9 years ago by Ketil 4.1k

0

Entering edit mode

I tested it with my file and with simple simulated one, but unfortunately it didn't work :/

ADD REPLY • link 7.9 years ago by cmcouto.silva ▴ 60

1

Entering edit mode

7.9 years ago

dyollluap ▴ 310

you could try grep if your accession ID's are all rsXXXXXXX grep -v 'tabrs' yourfilename.vcf >nullid_yourfilename.vcf (ctrl+v then tab to get tab character for grep regex). This selects every row that doesn't have an rs accession ID.

Usually anything without an accession for the ID field would use something like chr_position_variant.

ADD COMMENT • link 7.9 years ago by dyollluap ▴ 310

0

Entering edit mode

I'm sorry for being unclear. Please see my reply to Pierre's comment with an original piece of the file. The individuals have different pattern of IDs. Thanks for the reply!

ADD REPLY • link 7.9 years ago by cmcouto.silva ▴ 60

1

Entering edit mode

7.9 years ago

Pierre Lindenbaum 164k

Basically, I wanna print just the columns in which IDs present missing ("null" or "undefined")

I'm still not sur I understand but here is a solution with awk

 awk '/^#CHROM/ {N=split($0,header);next;} {for(i=10;i<=NF;++i) printf("%s\t%s\t%s\t%s\n",$1,$2,header[i],$i);}' input.txt | grep  -v -E '([01]\|[01])$'

ADD COMMENT • link 7.9 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I tested it with my file and with simple simulated one, but unfortunately it didn't work :/

ADD REPLY • link 7.9 years ago by cmcouto.silva ▴ 60

score 3 · Accepted Answer · 2017-02-07

It seems that you want to filter out sample columns containing any missing data or ..

Here's a dummy example, in which PT-91YR column is what to be find.

[update] Note that I delete meta lines starting with ##, and delete the leading # of the header line for convenience, by cat d.txt | grep '^##' -v | sed '1s/^#//'.

$ cat d.txt 
CHROM   POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HGDP00876       PT-91YR 4256126086_A
22      17154984        rs9605028       0       1       .       PASS    .       GT      1|1             1|1
22      17155383        rs1892844       0       1       .       PASS    .       GT      1|1             1|1
22      17178213        rs2845371       0       1       .       PASS    .       GT      1|1             1|0

A better view:

$ csvtk -t pretty d.txt 
CHROM   POS        ID          REF   ALT   QUAL   FILTER   INFO   FORMAT   HGDP00876   PT-91YR   4256126086_A
22      17154984   rs9605028   0     1     .      PASS     .      GT       1|1                   1|1
22      17155383   rs1892844   0     1     .      PASS     .      GT       1|1                   1|1
22      17178213   rs2845371   0     1     .      PASS     .      GT       1|1                   1|0

Steps:

Transpose the table using csvtk or GNU datamash

# $ cat d.txt | datamash transpose
$ cat d.txt | csvtk -t transpose 
CHROM   22      22      22
POS     17154984        17155383        17178213
ID      rs9605028       rs1892844       rs2845371
REF     0       0       0
ALT     1       1       1
QUAL    .       .       .
FILTER  PASS    PASS    PASS
INFO    .       .       .
FORMAT  GT      GT      GT
HGDP00876       1|1     1|1     1|1
PT-91YR
4256126086_A    1|1     1|1     1|0

Searching lines of which 2+th columns contain missing data.

# short-flag version:
# $ cat d.txt | csvtk -t transpose \
#    | awk 'FNR>9' \
#    | csvtk grep -H -t -f 2-10000 -r -p '(^$)|\.' \
#    | cut -f 1

$ cat d.txt | csvtk -t transpose \
    | awk 'FNR>9' \
    | csvtk grep --no-header-row --tabs --fields 2-10000 --use-regexp --pattern '(^$)|\.'\
    | cut -f 1
PT-91YR

You can then delete columns in R now or csvtk.

# prepare columns list to be delete: e.g.: deleting col1 and col2: "-col1,-col2"
$ colnames=$(perl -e 'print "-", join ",-", ( map {chomp; $_ } <> );' column_list.txt)
$ echo $colnames
-PT-91YR

$ cat d.txt | csvtk cut -t -f $colnames | csvtk -t pretty
CHROM   POS        ID          REF   ALT   QUAL   FILTER   INFO   FORMAT   HGDP00876   4256126086_A
22      17154984   rs9605028   0     1     .      PASS     .      GT       1|1         1|1
22      17155383   rs1892844   0     1     .      PASS     .      GT       1|1         1|1
22      17178213   rs2845371   0     1     .      PASS     .      GT       1|1         1|0