How to print the lines exclusively from unknown columns with missing(undef) value? BIG FILE!
4
0
Entering edit mode
7.9 years ago

Hello folks!

I'm working with vcf file, and that's how it looks like:

##info1
##info2
##info3
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ID01    ID02    ID03    etc...
3   66894   rs9681213   0   1   .   PASS    .   GT  0|1 0|1 0|1 etc...
3   95973   rs1400176   0   1   .   PASS    .   GT  1|1 1|1 1|1 etc...
3   104972  rs990284    0   1   .   PASS    .   GT  0|1 0|1 0|0 etc...
3   114133  rs954824    0   1   .   PASS    .   GT  1|1 1|1 1|1 etc...
and so on...

As you can see, the general format explained: - At the lines there are information about my central targets: SNPs and their alleles; - At the columns (after the 9th one) there are individuals with their respective alleles for each SNPs.

So, for each column's ID there are lines with the info I'm looking for...

I've a Perl script for extracting specific individuals (IDs) and I just realized there're missing values for some IDs and it's impairing my later analysis. Such script prints the empty values as they are (empty), and even if I use the vcftools program it prints a point (.) instead the empty, but it doesn't help me anyway. So I wanna know what IDs aren't codified.

Basically, I wanna print just the columns in which IDs present missing ("null" or "undefined") values, i.e. they're blank.

Once my files are huge in both directions, it's not easy to see manually what IDs haven't codification in their lines. From my limited knowledge, I believe the easiest way is to check the undefined value in the column and somehow print just the lines from that column.

So what I have to do is 1) to split the lines in order to get the columns (OK); 2) to print the first nine columns anyway through a loop (OK); 3) to check if there's any missing values at the columns (partially OK); 4) to print only the columns of which lines are data-missing (I got stuck here);

My problem at this last part is that, even "I discovering the columns with undefined value" (actually, the program did discover at this point, not me), I do not know what they are in order to print only them. I have to find a way to tell the program to print just the columns with missing data and I don't know how to specify those columns for it...

Could someone please help me out?

SNP vcf perl shell • 3.5k views
ADD COMMENT
0
Entering edit mode

It's not clear to me. Use the correct wording please e.g: VARIANT/INFO/FILTER/FORMAT/GENOTYPE/SAMPLE/ATTRIBUTE:... not 'value', 'data'...

ADD REPLY
1
Entering edit mode

That's an original piece of the file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HGDP00876   PT-91DX PT-91YR 4256126086_A    PT-911T
22  17154984    rs9605028   0   1   .   PASS    .   GT  1|1 1|1 1|1 1|1 1|1
22  17155383    rs1892844   0   1   .   PASS    .   GT  1|1 1|1 1|1 1|1 1|1
22  17178213    rs2845371   0   1   .   PASS    .   GT  1|1 1|1 1|0 1|0 0|0
22  17178586    rs5993924   0   1   .   PASS    .   GT  1|1 0|1 1|1 0|1 1|1
22  17202602    rs2845379   0   1   .   PASS    .   GT  0|0 1|0 1|1 0|1 0|0
22  17214252    rs2845346   0   1   .   PASS    .   GT  1|1 1|1 1|1 1|1 1|1
22  17254399    rs2190742   0   1   .   PASS    .   GT  0|0 0|0 1|1 1|1 1|1
22  17264904    rs9605145   0   1   .   PASS    .   GT  1|1 1|1 1|1 1|1 1|1
22  17265194    rs9605146   0   1   .   PASS    .   GT  1|1 1|1 1|1 1|1 1|1

The problem is: there are individuals (columns) with no "coding" information (i.e. column without binary info in their lines). I'm trying to be clear, but I know I'm confuse. Sorry about that...

ADD REPLY
1
Entering edit mode

Can you provide an example of one of this "missing" values? The VCF file format has a specifications document (https://samtools.github.io/hts-specs/VCFv4.3.pdf). I'm not sure where you got your file from but it doesn't actually conform to the VCF specifications as "0" and "1" are not valid REF and ALT alleles, but this may not matter for your purposes.

Are you looking for records that do not include genotype information for at least one sample? If so, can you provide an example of a "missing" genotype? Is it "." or "./." or ".|." or some other value?

ADD REPLY
0
Entering edit mode

Yeah, sure!

The missing genotype for those individuals who don't have it, is "missing" in the file, empty, do not showed. Please see the following pictures to see what I mean:

Original vcf files with "missing" genotype visualized by R (u can see by the empty space between columns): https://i.imgsafe.org/a701cc31dd.png https://i.imgsafe.org/a701e3cb71.jpg

Actually, if I filter the vcf file in order to originate another vcf file (recoded, specific for the population X) with my Perl script, the missing genotype stays empty. If I filter the vcf file with vcftools, the missing genotype are showed by a single point "." Please see the pictures:

Vcf originated from Perl script (empty values): https://i.imgsafe.org/a70237965a.jpg Vcf originated from vcftools (points to indicate missing values): https://i.imgsafe.org/a7024e1120.jpg

Another example, if I count the number of tabs (therefore, columns) per line I can see that the header shows 2356 tabs and the lines with data only 2339. I ran this command on unix shell in order to see that: $ awk '{print gsub(/\t/,"")}' chr22_hg19_phased_selscan.vcf >out.vcf

I'm entirely new on programming and bioinformatics. I want to extract individuals from specific populations and run tests for analyzing the influence of natural selection...

ADD REPLY
3
Entering edit mode
7.9 years ago

It seems that you want to filter out sample columns containing any missing data or ..

Here's a dummy example, in which PT-91YR column is what to be find.

[update] Note that I delete meta lines starting with ##, and delete the leading # of the header line for convenience, by cat d.txt | grep '^##' -v | sed '1s/^#//'.

$ cat d.txt 
CHROM   POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HGDP00876       PT-91YR 4256126086_A
22      17154984        rs9605028       0       1       .       PASS    .       GT      1|1             1|1
22      17155383        rs1892844       0       1       .       PASS    .       GT      1|1             1|1
22      17178213        rs2845371       0       1       .       PASS    .       GT      1|1             1|0

A better view:

$ csvtk -t pretty d.txt 
CHROM   POS        ID          REF   ALT   QUAL   FILTER   INFO   FORMAT   HGDP00876   PT-91YR   4256126086_A
22      17154984   rs9605028   0     1     .      PASS     .      GT       1|1                   1|1
22      17155383   rs1892844   0     1     .      PASS     .      GT       1|1                   1|1
22      17178213   rs2845371   0     1     .      PASS     .      GT       1|1                   1|0

Steps:

  1. Transpose the table using csvtk or GNU datamash

    # $ cat d.txt | datamash transpose
    $ cat d.txt | csvtk -t transpose 
    CHROM   22      22      22
    POS     17154984        17155383        17178213
    ID      rs9605028       rs1892844       rs2845371
    REF     0       0       0
    ALT     1       1       1
    QUAL    .       .       .
    FILTER  PASS    PASS    PASS
    INFO    .       .       .
    FORMAT  GT      GT      GT
    HGDP00876       1|1     1|1     1|1
    PT-91YR
    4256126086_A    1|1     1|1     1|0
    
  2. Searching lines of which 2+th columns contain missing data.

    # short-flag version:
    # $ cat d.txt | csvtk -t transpose \
    #    | awk 'FNR>9' \
    #    | csvtk grep -H -t -f 2-10000 -r -p '(^$)|\.' \
    #    | cut -f 1
    
    $ cat d.txt | csvtk -t transpose \
        | awk 'FNR>9' \
        | csvtk grep --no-header-row --tabs --fields 2-10000 --use-regexp --pattern '(^$)|\.'\
        | cut -f 1
    PT-91YR
    
  3. You can then delete columns in R now or csvtk.

    # prepare columns list to be delete: e.g.: deleting col1 and col2: "-col1,-col2"
    $ colnames=$(perl -e 'print "-", join ",-", ( map {chomp; $_ } <> );' column_list.txt)
    $ echo $colnames
    -PT-91YR
    
    $ cat d.txt | csvtk cut -t -f $colnames | csvtk -t pretty
    CHROM   POS        ID          REF   ALT   QUAL   FILTER   INFO   FORMAT   HGDP00876   4256126086_A
    22      17154984   rs9605028   0     1     .      PASS     .      GT       1|1         1|1
    22      17155383   rs1892844   0     1     .      PASS     .      GT       1|1         1|1
    22      17178213   rs2845371   0     1     .      PASS     .      GT       1|1         1|0
    
ADD COMMENT
1
Entering edit mode

Such an awesome tool! It worked and, furthermore, provided me a new tool for working with my data. I really liked it! Thank u very much, shenwei356!

Actually, I'd like to take this opportunity to ask a question regarding this tool. It worked nicely with my filtered data, but it didn't work with the original one, which reports " [ERRO] read /dev/stdin: file already closed". Do you know why?

regards,

ADD REPLY
1
Entering edit mode

sorry, i have no idea without checking data and commands.

ADD REPLY
0
Entering edit mode

Regarding to the original vcf file, I've observed that if I remove the hashtag "#" for any of the initial lines with comments or the header, csvtk points out that error. But only with the original one. After comparing with the filtered ones, I see the following:

Original vcf file has 2356 tabs in the header and 2339 in the following lines with genotype data; The filtered vcf file has 437 tabs in the header and the same 437 tabs in the following lines; Do you think that would be the problem?

I'm asking because maybe you're familiar with this. If it'll take your time, please disregard.. You've helped me a lot already!

ADD REPLY
1
Entering edit mode

sorry, I never met error [ERRO] read /dev/stdin: file already closed. Please try again by removing whole initial lines starting with ## (not just the hashtag) and delete the leading # of the header line by this:

cat data.vcf | grep '^##' -v | sed '1s/^#//' > new.vcf

And process the new vcf file.

ADD REPLY
2
Entering edit mode
7.9 years ago
Ketil 4.1k

I wanna print just the columns in which IDs present missing ("null" or "undefined") values, i.e. they're blank.

If this is actually what you want, you could identify columns with blanks, something like:

for i in {1..10}; do cut -f$i < file | grep -q '^$' || echo $i; done

and then to print those columns (using the result of the above):

cut -f1,2,4,5 < file

ADD COMMENT
0
Entering edit mode

I tested it with my file and with simple simulated one, but unfortunately it didn't work :/

ADD REPLY
1
Entering edit mode
7.9 years ago
dyollluap ▴ 310

you could try grep if your accession ID's are all rsXXXXXXX grep -v 'tabrs' yourfilename.vcf >nullid_yourfilename.vcf (ctrl+v then tab to get tab character for grep regex). This selects every row that doesn't have an rs accession ID.

Usually anything without an accession for the ID field would use something like chr_position_variant.

ADD COMMENT
0
Entering edit mode

I'm sorry for being unclear. Please see my reply to Pierre's comment with an original piece of the file. The individuals have different pattern of IDs. Thanks for the reply!

ADD REPLY
1
Entering edit mode
7.9 years ago

Basically, I wanna print just the columns in which IDs present missing ("null" or "undefined")

I'm still not sur I understand but here is a solution with awk

 awk '/^#CHROM/ {N=split($0,header);next;} {for(i=10;i<=NF;++i) printf("%s\t%s\t%s\t%s\n",$1,$2,header[i],$i);}' input.txt | grep  -v -E '([01]\|[01])$'
ADD COMMENT
0
Entering edit mode

I tested it with my file and with simple simulated one, but unfortunately it didn't work :/

ADD REPLY

Login before adding your answer.

Traffic: 1686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6