Question

How can I make the loop to count the gene against query id

0

Entering edit mode

9.3 years ago

tcf.hcdg ▴ 70

I have the data frame in R with 14 columns and 4.4 million rows.

Column 1 has the query id and column 4 has the gene name.

I want to make the data fram that can show the which and how many genes corresponding to each query id.

I have 44K different query ids and each query have maximum ~100 genes hit

CSAI_contig04661_6     sp     O65396     GCST      ARATH     86.03     408     56      1      72          478     1       408     0.0e+00     738.0
CSAI_contig04661_6     sp     Q681Y3     Y1099     ARATH     22.55     337     244     10     140         474     103     424     8.0e-09     56.6
CSAI_contig04661_6     sp     Q9FLR5     SMC6A     ARATH     24.27     103     66      3      04. Jun     249     342     441     4.6e+00     28. Sep
CSAI_contig04661_6     sp     Q9LQI7     GCST      ARATH     24.28     74      47      2      17. Aug     300     31      100     8.1e+00     27. Jul
CSAI_contig04661_6     sp     P56795     RK22      ARATH     28.95     76      49      4      11. Mrz     509     15      87      8.4e+00     27. Mrz
CSAI_isotig00001_4     sp     Q8VZE4     PP299     ARATH     29.63     108     55      5      31. Jul     307     10      109     1.6e+00     30. Apr

I am interested in this type of output.

CSAI_contig04661_6                GCST       2
                                  Y1099      1
                                  SMC6A      1
                                  RK22       1

How can I make a loop that check the column 1 until they have same query (for example in this example it has 6) and then go to the column 4 and find how many genes are present and count their number if more then one (in this example against first query GCST is present 2 times)

loop • 1.9k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.3 years ago by tcf.hcdg ▴ 70

0

Entering edit mode

Did you try grep in Linux?

ADD REPLY • link 9.2 years ago by Maxime Lamontagne ★ 2.4k

0

Entering edit mode

I tried it in R with the following:

group_by(t38kbat, query_id, gene) %>% summarise(n())

I received the output in this form

query_id  gene n()
1  CSAI_contig04661_6  GCST   3
2  CSAI_contig04661_6 SMC6A   1
3  CSAI_contig04661_6 Y1099   1
4  CSAI_isotig00001_4 AMSH3   1
5  CSAI_isotig00001_4 C98A9   1
6  CSAI_isotig00001_4 MOB2A   1
7  CSAI_isotig00001_4 PP299   1
8  CSAI_isotig00001_4  QORL   1
9  CSAI_isotig00001_4 WAKLP   1
10 CSAI_isotig00004_3  GCST   1
..                ...   ... ...

I want to print query id only one . For example

CSAI_contig04661_6
                                               GCST   3
                                               SMC6A   1
                                               Y1099   1

CSAI_isotig00001_4
                                               AMSH3   1
                                                C98A9   1
                                                MOB2A   1
                                                PP299   1
                                                QORL   1
                                                WAKLP   1

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by tcf.hcdg ▴ 70

Ram · Answer 1 · 2015-08-26

Hi,

This can be a solution:

#A little vector to count occurrences, initialized to 1.
count <- c(rep(1,length(data$geneName)))

#A data frame with the columns of interest.
df <- data.frame(data$geneID, data$geneName, count)

#Function AGGREGATE, useful in R. The function SUM is applied to count when geneID #match with geneName
ag <- aggregate(count ~ ., data = df, FUN = sum)

We you can (as possible!), avoid loop in R. ;)