Identifying repeated genes from multiple lists
3
1
Entering edit mode
9.5 years ago
Megatron ▴ 10

Hello,

I have 25 lists of genes.

On each list, there are anywhere from 1-50 genes

I want to process these lists to find, between these 25 lists, which genes show up most frequently.

Can anyone help?

What I have tried on R:

Loading all 25 lists, and then

Reduce(intersect, list(a,b,c))

However: when inputting 25 lists, it usually gives me a null because no single gene appears on all 25 lists.

My aim is to have a result where I have a list of genes listed by frequency of appearance within these 25 lists.

Thanks

gene R • 2.5k views
ADD COMMENT
0
Entering edit mode
cat list.* | sort | uniq -c | sort -n | tail

?

ADD REPLY
0
Entering edit mode

Can you be a little more specific? I'm new at R.

I uploaded the lists onto my Global Environment so I'm not sure if I have to cat list or sort.

When I did unique(list1, list2, list 3...list 25) it says hash table is full ---not sure what -c means

Thanks for your patience this is probably so easy for you

If it helps, here is my data specifically that I enter into the R Console:

Load lists:

list1 <- c(65,84,137,159,164,209,209,221,330)
list10 <- c(3,7,25,28,44,44,46,54,58,66,69,85,88,109,129,155,155,168,187,187,187,190,191,196,204,208,233,247,262,275,288,316,333,347,350,356)
list11 <- c(12,33,52,61,63,67,75,79,81,82,87,95,99,101,108,114,121,130,132,138,144,147,147,148,165,171,173,178,182,189,197,202,220,229,234,236,238,240,246,247,259,262,263,274,276,280,280,290,298,308,312,326,329,331,335,337,339,341)
list13 <- c(17,36,71,73,74,91,96,123,150,205,211,213,255,277,307,318,339,342,358)
list14 <- c(4,4,5,5,7,15,20,29,31,62,78,80,104,109,117,127,130,132,161,179,184,188,192,194,195,200,202,206,218,230,232,235,242,245,257,257,259,261,281,292,293,302,304,306,310,311,324,327,336,345,354)
list15 <- c(50,103,121,136,156,174,187,247,251,253,258,310,319,336,343)
list16 <- c(11,109,128,140,172,181,188,201,207,247,247,265,279,344,356,358)
list17 <- c(10,21,59,199,299)
list18 <- c(53,57,63,90,165,176,198,243,315,338,351)
list19 <- c(6,9,23,35,53,94,106,107,113,118,124,126,146,146,203,216,237,244,248,266,268,285,286,289,296,298,300,300,314,340)
list2 <- c(20,35,39,49,79,105,111,116,119,130,141,143,147,147,151,159,160,167,174,180,212,214,239,250,252,256,267,271,291,301,305,307,318,322,351)
list20 <- c(320)
list21 <- c(346)
list3 <- c(2,13,38,55,70,81,88,98,115,133,133,153,154,162,169,183,212,274,340,348,349,355)
list4 <- c(270,278,354)
list5 <- c(32,135,196,297)
list6 <- c(290,316,317)
list7 <- c(14,26,34,41,42,76,132,163,186,222,225,231,232,239,269,272,303,313,334,352,353,356,357)
list8 <- c(4,8,16,30,40,43,47,56,97,98,110,122,130,149,185,217,236,236,282,321)
list9 <- c(1,11,16,18,19,20,22,24,27,32,37,45,48,51,60,63,64,68,69,70,72,77,83,86,89,91,92,93,100,102,104,112,116,120,122,123,125,131,134,139,142,145,152,157,158,162,164,166,170,170,171,175,177,183,193,210,215,219,223,224,226,227,228,232,241,247,249,250,254,260,264,272,273,280,280,280,283,284,287,294,295,298,301,309,310,313,320,323,324,325,328,332,356)
ADD REPLY
3
Entering edit mode
9.5 years ago
ethan.kaufman ▴ 380
sort(table(c(list1, list2, ..., list25)), decreasing=T)
ADD COMMENT
0
Entering edit mode
9.5 years ago

What is the format of your gene list?

If it is just a text file split by newlines, then you could easily do this on the command line with awk:

$ cat geneListA.txt geneListB.txt ... geneListN.txt
    | awk ' \
        { \
            geneCounts[$0]++; \
        } \
        END { \
            for (geneName in geneCounts) { \
                print geneName"\t"geneCounts[geneName]; \
            } \
        }' - \
    > unsortedCounts.txt

The file unsortedCounts.txt is an unsorted two-column file containing the gene name and its count across files geneListA.txt through geneListN.txt.

To sort this by counts, just pipe the output of the awk statement to GNU sort and do a (descending) numeric sort on the second column:

$ cat geneListA.txt geneListB.txt ... geneListN.txt \
    | awk ' \
        { \
            geneCounts[$0]++; \
        } \
        END { \
            for (geneName in geneCounts) { \
                print geneName"\t"geneCounts[geneName]; \
            } \
        }' - \
    | sort -n k2,2r - \
    > sortedCounts.txt
ADD COMMENT
0
Entering edit mode

Thanks, let me try this and report back

ADD REPLY
0
Entering edit mode

I just tried to download gawk for windows + source files, accessed Gnuwin32/bin/awk etc on MS DOS and placed the genelists in the directory - I am completely lost though. Maybe staying on R is a better option

edit: or if you could provide some simpler steps

cheers

ADD REPLY
0
Entering edit mode

From MSDOS, I realized that type is the equivalent of cat

so I did cat genelist1.txt genelist2.txt genelist3.txt and in the cmd all the lists were printed out

Then, gawk { \geneCounts[$0]++; \} gives me an invalid character

ADD REPLY
2
Entering edit mode

Don't do bioinformatics on Windows. Sorry to be a snob about it, but you'll otherwise have to jump through numerous hoops to do common command-line tasks like these. Either swap out your OS or run your analyses within a Linux VM in VirtualBox or similar.

ADD REPLY
0
Entering edit mode
9.5 years ago
alolex ▴ 960

Megatron,

I had to do something similar in R a while ago. You'll need to create a list of lists, a list of unique gene IDs and then a matrix of counts. Here is my code:

my.lists <- list(list1=c(123,234,345), list2=c(45,23,12,78,43,87,123), list3=c(123,432,234,45,23))
unique_genes <- unique(unlist(my.lists))
#set up empty matrix
mtx <- matrix(0, nrow=length(names(my.lists)), ncol=length(unique_genes))
rownames(mtx) <- names(my.lists)
colnames(mtx) <- unique_genes
#populate the matrix
for(i in rownames(mtx)){
    mtx[i,(colnames(mtx) %in% my.lists[[i]])] <- 1
}
freqSorted <- sort(colSums(mtx), decreasing=T)
ADD COMMENT

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6