I have the data frame in R with 14 columns and 4.4 million rows.
Column 1 has the query id and column 4 has the gene name.
I want to make the data fram that can show the which and how many genes corresponding to each query id.
I have 44K different query ids and each query have maximum ~100 genes hit
CSAI_contig04661_6 sp O65396 GCST ARATH 86.03 408 56 1 72 478 1 408 0.0e+00 738.0
CSAI_contig04661_6 sp Q681Y3 Y1099 ARATH 22.55 337 244 10 140 474 103 424 8.0e-09 56.6
CSAI_contig04661_6 sp Q9FLR5 SMC6A ARATH 24.27 103 66 3 04. Jun 249 342 441 4.6e+00 28. Sep
CSAI_contig04661_6 sp Q9LQI7 GCST ARATH 24.28 74 47 2 17. Aug 300 31 100 8.1e+00 27. Jul
CSAI_contig04661_6 sp P56795 RK22 ARATH 28.95 76 49 4 11. Mrz 509 15 87 8.4e+00 27. Mrz
CSAI_isotig00001_4 sp Q8VZE4 PP299 ARATH 29.63 108 55 5 31. Jul 307 10 109 1.6e+00 30. Apr
I am interested in this type of output.
CSAI_contig04661_6 GCST 2
Y1099 1
SMC6A 1
RK22 1
How can I make a loop that check the column 1 until they have same query (for example in this example it has 6) and then go to the column 4 and find how many genes are present and count their number if more then one (in this example against first query GCST is present 2 times)
Did you try grep in Linux?
I tried it in R with the following:
I received the output in this form
I want to print query id only one . For example