Hi,
I have a table with millions of columns and hundreds of rows. There are two columns namely Genome
and Gene
. I also have a list of genes of interest that matches some of the genes from the Gene
column of the data table. I want to extract ONLY those Genomes from the Genome
column with ALL the genes in the list. Here is an example data set and the desired output.
Input data table:
ID Genome Gene Factor Samp1 Samp2 Samp3
Var001 Genome1 Gene1 Fact001 0.259391466 0.831387803 0.410587301
Var002 Genome3 Gene1 Fact002 0.898459563 0.314926729 0.40721684
Var003 Genome2 Gene5 Fact001 0.909905803 0.63259759 0.318972387
Var004 Genome2 Gene2 Fact003 0.9085035 0.376803799 0.166725974
Var005 Genome1 Gene2 Fact001 0.919946522 0.852811493 0.411401126
Var006 Genome3 Gene7 Fact004 0.138077309 0.539512705 0.729510187
Var007 Genome4 Gene3 Fact005 0.871523277 0.688500219 0.732045
Var008 Genome2 Gene2 Fact001 0.841650093 0.525697925 0.459021533
Var009 Genome4 Gene6 Fact006 0.044513563 0.074029822 0.252681177
Var010 Genome1 Gene3 Fact006 0.983563686 0.407575447 0.365852195
Var011 Genome1 Gene5 Fact001 0.169699922 0.936933803 0.417067634
Var012 Genome5 Gene2 Fact007 0.134943505 0.201524375 0.904785499
Var013 Genome3 Gene3 Fact008 0.632858717 0.622417174 0.529329237
Var014 Genome4 Gene2 Fact010 0.844965387 0.981636875 0.983639709
Var015 Genome5 Gene3 Fact007 0.651606152 0.188821571 0.95424102
Var016 Genome1 Gene3 Fact009 0.305152183 0.654209057 0.496756239
Var017 Genome3 Gene2 Fact002 0.223418605 0.878336839 0.323197942
The list of genes of interest:
Gene1
Gene2
Gene3
Desired output:
ID Genome Gene Factor Samp1 Samp2 Samp3
Var001 Genome1 Gene1 Fact001 0.259391466 0.831387803 0.410587301
Var005 Genome1 Gene2 Fact001 0.919946522 0.852811493 0.411401126
Var010 Genome1 Gene3 Fact006 0.983563686 0.407575447 0.365852195
Var016 Genome1 Gene3 Fact009 0.305152183 0.654209057 0.496756239
Var002 Genome3 Gene1 Fact002 0.898459563 0.314926729 0.40721684
Var017 Genome3 Gene2 Fact002 0.223418605 0.878336839 0.323197942
Var013 Genome3 Gene3 Fact008 0.632858717 0.622417174 0.529329237
Since Genes 1, 2, and 3 are collectively present only in Genomes 1 and 3, the output should be in this manner. Please note that a simple output as Genome1, Geneome3
is also fine. I can use grep
to get all the row values.
I tried to find a solution using awk
and sort
but I hit a wall after sorting and filtering the genes and associated genomes. But I couldn't figure out how to filter ONLY those genomes which have ALL the genes from the list. Any suggestion and/or advice would be very helpful.
Many thanks for your time and help!
Thank you very much! It works just fine. Just wondering if there is a way to print the column headers too. I can always make a separate file with the headers and then
>>
to the same file for printing the filtered rows, just curious if there is a better way!Many thanks!
Printing the header is what
head -n1 table.tsv
is for.Sorry for the confusion! Yes,
head -n1 table.tsv
produces STDOUT of the headers and doesn't save it to a file when I do:That's what I meant!
create a subshell and redirect it to a file
Thanks! Solved.
Oops, my bad - I forgot that
&&
doesn't play well with redirection. I usually docat <(head -n1 ...) <(grep ...) > out.file
but(one-liner) > out.file
as suggested by Pierre is a better way of doing that.Thank you for the alternative.
Hi @Pierre Lindenbaum thank you for the help. Could you please let me know what
awk '($1==3) {print $2}'
is doing? Is it fetching the exact number of genes, i.e.,3
in the example data? The number of output genomes changes if I change that3
number and if I keep it to3
, the genomes that have even two of the genes from the list are printed when I used my actual data. TheGenome
andGene
columns are exactly at the same position as in the example data. If I use the actual number of the gene (11), the one-liner doesn't work and I should also mention that I have ~10 genes on the list. Also, my genes are EC number 2.7.2.1, 4.6.7.8, 1.2.3.4, etc. Since we used-w
flag, it should be fine I guess.Any suggestion would be very helpful!
Many thanks.
Please read the awk manual. awk works essentially as
CONDITION { STATEMENTS_TO_EXECUTE }
so here,'($1==3) { print $2 }'
translates to "print second column if first column equals 3. The first column is the number of uniqueGene
s perGenome
Does it error out or produce no results? If the latter, are you sure that a non-empty result set is the right answer i.e., there is at least 1 Genome with 11 Genes?
Thanks! So I need to write as
'($1==<unique_gene_number_in_the_list>) { print $2 }'
, right? Unfortunately, it produced the wrong output. It gave some genomes even if they had a subset of the genes on the list. Yes, at least 1 Genome has all 11 Genes. I am not sure though if it is because of the gene names. They are EC numbers. Just the dot-separated numbers (x.x.x.x) where x is [0-9].Thanks!
Yes, that's correct.
Can you show us an example?
Hi, @Ram, thanks a lot for the suggestions and clarification. As I suspected, the problem was with my gene names (3rd column). Since they were just the EC numbers (i.e., x.x.x.x where x is [0-9]),
grep
somehow started permutation-combination although-w
flag was used. I appended a small string with the numbers to make it alpha-numeric, and it works just fine. If anyone is reading this thread for a similar situation, just be careful with these points. The one-liner works just fine!Many thanks for your kind help!
-f
is "from file".-F
is for fixed, meaning "don't treat the input pattern as a regular expression."Sorry, my bad! I meant
-w
for the whole 'word'! Edited it.Thanks!
OK that makes sense -
.
s can throw a wrench in the plans when-w
is involved.