Filtering rows based on a list if all the entries of the list correspond to a specific row value
2
0
Entering edit mode
18 months ago
bionix ▴ 10

Hi,

I have a table with millions of columns and hundreds of rows. There are two columns namely Genome and Gene. I also have a list of genes of interest that matches some of the genes from the Gene column of the data table. I want to extract ONLY those Genomes from the Genome column with ALL the genes in the list. Here is an example data set and the desired output.

Input data table:

ID  Genome  Gene    Factor  Samp1   Samp2   Samp3
Var001  Genome1 Gene1   Fact001 0.259391466 0.831387803 0.410587301
Var002  Genome3 Gene1   Fact002 0.898459563 0.314926729 0.40721684
Var003  Genome2 Gene5   Fact001 0.909905803 0.63259759  0.318972387
Var004  Genome2 Gene2   Fact003 0.9085035   0.376803799 0.166725974
Var005  Genome1 Gene2   Fact001 0.919946522 0.852811493 0.411401126
Var006  Genome3 Gene7   Fact004 0.138077309 0.539512705 0.729510187
Var007  Genome4 Gene3   Fact005 0.871523277 0.688500219 0.732045
Var008  Genome2 Gene2   Fact001 0.841650093 0.525697925 0.459021533
Var009  Genome4 Gene6   Fact006 0.044513563 0.074029822 0.252681177
Var010  Genome1 Gene3   Fact006 0.983563686 0.407575447 0.365852195
Var011  Genome1 Gene5   Fact001 0.169699922 0.936933803 0.417067634
Var012  Genome5 Gene2   Fact007 0.134943505 0.201524375 0.904785499
Var013  Genome3 Gene3   Fact008 0.632858717 0.622417174 0.529329237
Var014  Genome4 Gene2   Fact010 0.844965387 0.981636875 0.983639709
Var015  Genome5 Gene3   Fact007 0.651606152 0.188821571 0.95424102
Var016  Genome1 Gene3   Fact009 0.305152183 0.654209057 0.496756239
Var017  Genome3 Gene2   Fact002 0.223418605 0.878336839 0.323197942

The list of genes of interest:

Gene1
Gene2
Gene3

Desired output:

ID  Genome  Gene    Factor  Samp1   Samp2   Samp3
Var001  Genome1 Gene1   Fact001 0.259391466 0.831387803 0.410587301
Var005  Genome1 Gene2   Fact001 0.919946522 0.852811493 0.411401126
Var010  Genome1 Gene3   Fact006 0.983563686 0.407575447 0.365852195
Var016  Genome1 Gene3   Fact009 0.305152183 0.654209057 0.496756239
Var002  Genome3 Gene1   Fact002 0.898459563 0.314926729 0.40721684
Var017  Genome3 Gene2   Fact002 0.223418605 0.878336839 0.323197942
Var013  Genome3 Gene3   Fact008 0.632858717 0.622417174 0.529329237

Since Genes 1, 2, and 3 are collectively present only in Genomes 1 and 3, the output should be in this manner. Please note that a simple output as Genome1, Geneome3 is also fine. I can use grep to get all the row values. I tried to find a solution using awk and sort but I hit a wall after sorting and filtering the genes and associated genomes. But I couldn't figure out how to filter ONLY those genomes which have ALL the genes from the list. Any suggestion and/or advice would be very helpful.

Many thanks for your time and help!

row-matching sorting filtering • 2.1k views
ADD COMMENT
2
Entering edit mode
18 months ago

a one liner. Assuming tab-delimited iinput , assuming grep only finds gene names in column3 and only finds genomes in column2.

$ head -n1 table.tsv && grep -Fw -f <(cat table.tsv | cut -f2,3 | sort | uniq | grep -F -w -f genes.txt | cut -f1 |sort | uniq -c | awk '($1==3) {print $2}') table.tsv | grep -w -F -f genes.txt | sort -t $'\t' -k2,2 -k3,3


ID  Genome  Gene    Factor  Samp1   Samp2   Samp3
Var001  Genome1 Gene1   Fact001 0.259391466 0.831387803 0.410587301
Var005  Genome1 Gene2   Fact001 0.919946522 0.852811493 0.411401126
Var010  Genome1 Gene3   Fact006 0.983563686 0.407575447 0.365852195
Var016  Genome1 Gene3   Fact009 0.305152183 0.654209057 0.496756239
Var002  Genome3 Gene1   Fact002 0.898459563 0.314926729 0.40721684
Var017  Genome3 Gene2   Fact002 0.223418605 0.878336839 0.323197942
Var013  Genome3 Gene3   Fact008 0.632858717 0.622417174 0.529329237
ADD COMMENT
0
Entering edit mode

Thank you very much! It works just fine. Just wondering if there is a way to print the column headers too. I can always make a separate file with the headers and then >> to the same file for printing the filtered rows, just curious if there is a better way!

Many thanks!

ADD REPLY
0
Entering edit mode

Printing the header is what head -n1 table.tsv is for.

ADD REPLY
0
Entering edit mode

Sorry for the confusion! Yes, head -n1 table.tsv produces STDOUT of the headers and doesn't save it to a file when I do:

head -n1 table.tsv && grep -Fw -f <(cat table.tsv | cut -f2,3 | sort | uniq | grep -F -w -f genes.txt | cut -f1 |sort | uniq -c | awk '($1==3) {print $2}') table.tsv | grep -w -F -f genes.txt | sort -t $'\t' -k2,2 -k3,3 > output_file.tsv

That's what I meant!

ADD REPLY
1
Entering edit mode

just curious if there is a better way!

create a subshell and redirect it to a file

(my one-liner....) > output.tsv
ADD REPLY
0
Entering edit mode

Thanks! Solved.

ADD REPLY
1
Entering edit mode

Oops, my bad - I forgot that && doesn't play well with redirection. I usually do cat <(head -n1 ...) <(grep ...) > out.file but (one-liner) > out.file as suggested by Pierre is a better way of doing that.

ADD REPLY
0
Entering edit mode

Thank you for the alternative.

ADD REPLY
0
Entering edit mode

Hi @Pierre Lindenbaum thank you for the help. Could you please let me know what awk '($1==3) {print $2}' is doing? Is it fetching the exact number of genes, i.e., 3 in the example data? The number of output genomes changes if I change that 3 number and if I keep it to 3, the genomes that have even two of the genes from the list are printed when I used my actual data. The Genome and Gene columns are exactly at the same position as in the example data. If I use the actual number of the gene (11), the one-liner doesn't work and I should also mention that I have ~10 genes on the list. Also, my genes are EC number 2.7.2.1, 4.6.7.8, 1.2.3.4, etc. Since we used -w flag, it should be fine I guess.

Any suggestion would be very helpful!

Many thanks.

ADD REPLY
1
Entering edit mode

Please read the awk manual. awk works essentially as CONDITION { STATEMENTS_TO_EXECUTE } so here, '($1==3) { print $2 }' translates to "print second column if first column equals 3. The first column is the number of unique Genes per Genome

the one-liner doesn't work

Does it error out or produce no results? If the latter, are you sure that a non-empty result set is the right answer i.e., there is at least 1 Genome with 11 Genes?

ADD REPLY
0
Entering edit mode

Thanks! So I need to write as '($1==<unique_gene_number_in_the_list>) { print $2 }', right? Unfortunately, it produced the wrong output. It gave some genomes even if they had a subset of the genes on the list. Yes, at least 1 Genome has all 11 Genes. I am not sure though if it is because of the gene names. They are EC numbers. Just the dot-separated numbers (x.x.x.x) where x is [0-9].

Thanks!

ADD REPLY
0
Entering edit mode

So I need to write as '($1==<unique_gene_number_in_the_list>) { print $2 }', right?

Yes, that's correct.

Unfortunately, it produced the wrong output. It gave some genomes even if they had a subset of the genes on the list.

Can you show us an example?

ADD REPLY
0
Entering edit mode

Hi, @Ram, thanks a lot for the suggestions and clarification. As I suspected, the problem was with my gene names (3rd column). Since they were just the EC numbers (i.e., x.x.x.x where x is [0-9]), grep somehow started permutation-combination although -w flag was used. I appended a small string with the numbers to make it alpha-numeric, and it works just fine. If anyone is reading this thread for a similar situation, just be careful with these points. The one-liner works just fine!

Many thanks for your kind help!

ADD REPLY
0
Entering edit mode

although -f flag was used

-f is "from file". -F is for fixed, meaning "don't treat the input pattern as a regular expression."

ADD REPLY
0
Entering edit mode

Sorry, my bad! I meant -w for the whole 'word'! Edited it.

Thanks!

ADD REPLY
1
Entering edit mode

OK that makes sense - .s can throw a wrench in the plans when -w is involved.

ADD REPLY
0
Entering edit mode
18 months ago
Ram 44k

Use python or R. Group by Genome, create a new column named gene_csv which contains all unique values of Gene per Genome group sorted and concatenated as a comma-separated list, then filter for gene_csv == "Gene1,Gene2,Gene3"

ADD COMMENT

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6