i have six files. each file have lakhs of gene_id. i have to find out all the genes that are present in all file. if any gene(s) is/are not present even in a single file. then i have to delete them. how should i do it??
i have six files. each file have lakhs of gene_id. i have to find out all the genes that are present in all file. if any gene(s) is/are not present even in a single file. then i have to delete them. how should i do it??
My algorithm would be the following:
Make a hash out of all your DIFFERENT gene-names from all files (these gene-names will be your keys).
Then check each file if there is, for example, gene1 in it. Do it 6 times - for each file.
Calculate the total number of this particular gene occurences in all 6 files.
Assign/give this total number (1-6) to this gene-name (a key), equal to its total number of its occurences in 6 files,
it will become its value in the hash.
Do it for the next gene. Repeat until your keys (gene-names) are finished. Your hash wil look like below:
key value
gene1 4
gene2 2
gene3 6
gene4 5
...
Select only genes (keys) with values 6 from this hash.
See also this post, it may help to do it faster:
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
If you give details about the content of your files, I can send you a bash script that will give you all the genes that are common in the six files. Do a head of your files and describe