find all the genes present in all files
1
0
Entering edit mode
8.6 years ago
bulbul ▴ 10

i have six files. each file have lakhs of gene_id. i have to find out all the genes that are present in all file. if any gene(s) is/are not present even in a single file. then i have to delete them. how should i do it??

perl • 1.5k views
ADD COMMENT
0
Entering edit mode

If you give details about the content of your files, I can send you a bash script that will give you all the genes that are common in the six files. Do a head of your files and describe

ADD REPLY
0
Entering edit mode
8.6 years ago
natasha.sernova ★ 4.0k

My algorithm would be the following:

Make a hash out of all your DIFFERENT gene-names from all files (these gene-names will be your keys).

Then check each file if there is, for example, gene1 in it. Do it 6 times - for each file.

Calculate the total number of this particular gene occurences in all 6 files.

Assign/give this total number (1-6) to this gene-name (a key), equal to its total number of its occurences in 6 files,

it will become its value in the hash.

Do it for the next gene. Repeat until your keys (gene-names) are finished. Your hash wil look like below:

key value

gene1 4

gene2 2

gene3 6

gene4 5

...

Select only genes (keys) with values 6 from this hash.

See also this post, it may help to do it faster:

Finding common genes

ADD COMMENT

Login before adding your answer.

Traffic: 1955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6