Question

Identifying repeated genes from multiple lists

1

Entering edit mode

9.7 years ago

Megatron ▴ 10

Hello,

I have 25 lists of genes.

On each list, there are anywhere from 1-50 genes

I want to process these lists to find, between these 25 lists, which genes show up most frequently.

Can anyone help?

What I have tried on R:

Loading all 25 lists, and then

Reduce(intersect, list(a,b,c))

However: when inputting 25 lists, it usually gives me a null because no single gene appears on all 25 lists.

My aim is to have a result where I have a list of genes listed by frequency of appearance within these 25 lists.

Thanks

gene R • 2.7k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.7 years ago by Megatron ▴ 10

0

Entering edit mode

cat list.* | sort | uniq -c | sort -n | tail

?

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.7 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

Can you be a little more specific? I'm new at R.

I uploaded the lists onto my Global Environment so I'm not sure if I have to cat list or sort.

When I did unique(list1, list2, list 3...list 25) it says hash table is full ---not sure what -c means

Thanks for your patience this is probably so easy for you

If it helps, here is my data specifically that I enter into the R Console:

Load lists:

list1 <- c(65,84,137,159,164,209,209,221,330)
list10 <- c(3,7,25,28,44,44,46,54,58,66,69,85,88,109,129,155,155,168,187,187,187,190,191,196,204,208,233,247,262,275,288,316,333,347,350,356)
list11 <- c(12,33,52,61,63,67,75,79,81,82,87,95,99,101,108,114,121,130,132,138,144,147,147,148,165,171,173,178,182,189,197,202,220,229,234,236,238,240,246,247,259,262,263,274,276,280,280,290,298,308,312,326,329,331,335,337,339,341)
list13 <- c(17,36,71,73,74,91,96,123,150,205,211,213,255,277,307,318,339,342,358)
list14 <- c(4,4,5,5,7,15,20,29,31,62,78,80,104,109,117,127,130,132,161,179,184,188,192,194,195,200,202,206,218,230,232,235,242,245,257,257,259,261,281,292,293,302,304,306,310,311,324,327,336,345,354)
list15 <- c(50,103,121,136,156,174,187,247,251,253,258,310,319,336,343)
list16 <- c(11,109,128,140,172,181,188,201,207,247,247,265,279,344,356,358)
list17 <- c(10,21,59,199,299)
list18 <- c(53,57,63,90,165,176,198,243,315,338,351)
list19 <- c(6,9,23,35,53,94,106,107,113,118,124,126,146,146,203,216,237,244,248,266,268,285,286,289,296,298,300,300,314,340)
list2 <- c(20,35,39,49,79,105,111,116,119,130,141,143,147,147,151,159,160,167,174,180,212,214,239,250,252,256,267,271,291,301,305,307,318,322,351)
list20 <- c(320)
list21 <- c(346)
list3 <- c(2,13,38,55,70,81,88,98,115,133,133,153,154,162,169,183,212,274,340,348,349,355)
list4 <- c(270,278,354)
list5 <- c(32,135,196,297)
list6 <- c(290,316,317)
list7 <- c(14,26,34,41,42,76,132,163,186,222,225,231,232,239,269,272,303,313,334,352,353,356,357)
list8 <- c(4,8,16,30,40,43,47,56,97,98,110,122,130,149,185,217,236,236,282,321)
list9 <- c(1,11,16,18,19,20,22,24,27,32,37,45,48,51,60,63,64,68,69,70,72,77,83,86,89,91,92,93,100,102,104,112,116,120,122,123,125,131,134,139,142,145,152,157,158,162,164,166,170,170,171,175,177,183,193,210,215,219,223,224,226,227,228,232,241,247,249,250,254,260,264,272,273,280,280,280,283,284,287,294,295,298,301,309,310,313,320,323,324,325,328,332,356)

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.7 years ago by Megatron ▴ 10

Ram · Answer 1 · 2015-05-28

3

Entering edit mode

9.7 years ago

ethan.kaufman ▴ 380

sort(table(c(list1, list2, ..., list25)), decreasing=T)

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.7 years ago by ethan.kaufman ▴ 380

Ram · Answer 2 · 2015-05-28

0

Entering edit mode

9.7 years ago

Alex Reynolds 36k

What is the format of your gene list?

If it is just a text file split by newlines, then you could easily do this on the command line with awk:

$ cat geneListA.txt geneListB.txt ... geneListN.txt
    | awk ' \
        { \
            geneCounts[$0]++; \
        } \
        END { \
            for (geneName in geneCounts) { \
                print geneName"\t"geneCounts[geneName]; \
            } \
        }' - \
    > unsortedCounts.txt

The file unsortedCounts.txt is an unsorted two-column file containing the gene name and its count across files geneListA.txt through geneListN.txt.

To sort this by counts, just pipe the output of the awk statement to GNU sort and do a (descending) numeric sort on the second column:

$ cat geneListA.txt geneListB.txt ... geneListN.txt \
    | awk ' \
        { \
            geneCounts[$0]++; \
        } \
        END { \
            for (geneName in geneCounts) { \
                print geneName"\t"geneCounts[geneName]; \
            } \
        }' - \
    | sort -n k2,2r - \
    > sortedCounts.txt

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks, let me try this and report back

ADD REPLY • link 9.7 years ago by Megatron ▴ 10

0

Entering edit mode

I just tried to download gawk for windows + source files, accessed Gnuwin32/bin/awk etc on MS DOS and placed the genelists in the directory - I am completely lost though. Maybe staying on R is a better option

edit: or if you could provide some simpler steps

cheers

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.7 years ago by Megatron ▴ 10

0

Entering edit mode

From MSDOS, I realized that type is the equivalent of cat

so I did cat genelist1.txt genelist2.txt genelist3.txt and in the cmd all the lists were printed out

Then, gawk { \geneCounts[$0]++; \} gives me an invalid character

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.7 years ago by Megatron ▴ 10

2

Entering edit mode

Don't do bioinformatics on Windows. Sorry to be a snob about it, but you'll otherwise have to jump through numerous hoops to do common command-line tasks like these. Either swap out your OS or run your analyses within a Linux VM in VirtualBox or similar.

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.7 years ago by Alex Reynolds 36k

Ram · Answer 3 · 2015-05-29

Megatron,

I had to do something similar in R a while ago. You'll need to create a list of lists, a list of unique gene IDs and then a matrix of counts. Here is my code:

my.lists <- list(list1=c(123,234,345), list2=c(45,23,12,78,43,87,123), list3=c(123,432,234,45,23))
unique_genes <- unique(unlist(my.lists))
#set up empty matrix
mtx <- matrix(0, nrow=length(names(my.lists)), ncol=length(unique_genes))
rownames(mtx) <- names(my.lists)
colnames(mtx) <- unique_genes
#populate the matrix
for(i in rownames(mtx)){
    mtx[i,(colnames(mtx) %in% my.lists[[i]])] <- 1
}
freqSorted <- sort(colSums(mtx), decreasing=T)