Question

find all the genes present in all files

0

Entering edit mode

8.6 years ago

bulbul ▴ 10

i have six files. each file have lakhs of gene_id. i have to find out all the genes that are present in all file. if any gene(s) is/are not present even in a single file. then i have to delete them. how should i do it??

perl • 1.5k views

ADD COMMENT • link updated 8.6 years ago by natasha.sernova ★ 4.0k • written 8.6 years ago by bulbul ▴ 10

0

Entering edit mode

If you give details about the content of your files, I can send you a bash script that will give you all the genes that are common in the six files. Do a head of your files and describe

ADD REPLY • link 8.6 years ago by Antonio R. Franco ★ 5.2k

score 0 · Answer 1 · 2016-05-11

My algorithm would be the following:

Make a hash out of all your DIFFERENT gene-names from all files (these gene-names will be your keys).

Then check each file if there is, for example, gene1 in it. Do it 6 times - for each file.

Calculate the total number of this particular gene occurences in all 6 files.

Assign/give this total number (1-6) to this gene-name (a key), equal to its total number of its occurences in 6 files,

it will become its value in the hash.

Do it for the next gene. Repeat until your keys (gene-names) are finished. Your hash wil look like below:

key value

gene1 4

gene2 2

gene3 6

gene4 5

...

Select only genes (keys) with values 6 from this hash.

See also this post, it may help to do it faster:

Finding common genes