Question

Extracting list of identical items from several excel files

0

Entering edit mode

13 months ago

otieno43 ▴ 30

Hi guys,

I have 10 excel files with several columns and rows. The column headers are similar in all file but for the rows some items are similar while others are not. Below is an excerpt for one of the files:

tag id  score   probability of true positive    read count  significant mature miRBase miRNA    example miRBase miRNA with the same seed    consensus mature sequence   consensus star sequence consensus precursor sequence    precursor coordinate
scf7180000638994_294    40938.9 76 +/- 17%  80293   yes Gmo-Mir-8-5p    dps-miR-8   uaauacugucagguaaagauguc caucuuaccgggcagcauuaga  caucuuaccgggcagcauuagauuucauugaacuaaauuucuaauacugucagguaaagauguc    GmorY1_scf7180000638994:57393..57457:-
scf7180000646270_8382   21683.4 76 +/- 17%  42524   yes Gmo-Mir-1-5p    dps-miR-1   uggaauguaaagaaguauggag  ccaugcuuccuugcauucaaua  ccaugcuuccuugcauucaauaguauuuaaauucgcauauggaauguaaagaaguauggag   GmorY1_scf7180000646270:15242..15303:-
scf7180000641396_3727   16329.7 76 +/- 17%  32022   yes Gmo-Mir-276-2-5p    dps-miR-276a    uaggaacuucauaccgugcucu  agcgagguauagaguuccuacg  agcgagguauagaguuccuacguuauaauauauuaucuguaggaacuucauaccgugcucu   GmorY1_scf7180000641396:101253..101314:+
scf7180000642685_5346   14242.6 76 +/- 17%  27930   yes Gmo-Mir-375-5p  dps-miR-375-3p  uuuguucguuuggcuuaaguu   acuuaagccaagugaauacaaaca    acuuaagccaagugaauacaaacauauaaaaauugauuccauacgguuuguucguuuggcuuaaguu GmorY1_scf7180000642685:7898..7965:-
scf7180000648092_10507  6965.5  76 +/- 17%  13654   yes Gmo-Mir-2a-5-5p dps-miR-2a  uaucacagccagcuuugaugagcu    ccucaucaagugguugugaua   ccucaucaagugguugugauauggauuaucaacgcauaucacagccagcuuugaugagcu    GmorY1_scf7180000648092:46212..46272:+
scf7180000648092_10505  6279.1  76 +/- 17%  12309   yes Gmo-Mir-2a-3-5p dps-miR-2a  uaucacagccagcuuugaugagcu    ccucacaaaguggcugugaaaug ccucacaaaguggcugugaaauguugugaauuauaugauuuuaguucuaauucaaucuaaaauugcauaucacagccagcuuugaugagcu GmorY1_scf7180000648092:45974..46065:+
scf7180000652159_25218  5249.2  76 +/- 17%  10289   yes Gmo-let-7-5p    dps-let-7   ugagguaguagguuguauagu   cuauacaacgugcuagcuuucu  ugagguaguagguuguauaguaguaaauuugaaaacauuaaacauacuauacaacgugcuagcuuucu    GmorY1_scf7180000652159:291524..291592:+
scf7180000647953_9806   4570.4  76 +/- 17%  8958    yes Gmo-Mir-281-1-5p    dps-miR-281 ugucauggaauugcucucuuug  aagagagcuguccgucgacaguc aagagagcuguccgucgacaguccaguucagacauauuaauacugucauggaauugcucucuuug   GmorY1_scf7180000647953:3329..3394:-
scf7180000652160_25883  4094.1  76 +/- 17%  8025    yes Gmo-Mir-33-5p   ame-miR-33-5p   gugcauuguagucgcauuguc   caauacuucugcaaugcaaacu  gugcauuguagucgcauuguccguguuauaaggaagaagucggccaauacuucugcaaugcaaacu  GmorY1_scf7180000652160:6155773..6155839:-
scf7180000648035_10282  4043.6  76 +/- 17%  7925    yes Gmo-Mir-999-5p  aae-miR-999 uguuaacuguaagacugugucu  acauagucguacagaaaauauu  acauagucguacagaaaauauuguguuauacgaguccaauguuaacuguaagacugugucu   GmorY1_scf7180000648035:31983..32044:-
scf7180000648092_10503  3907.2  76 +/- 17%  7656    yes Gmo-Mir-2a-2-5p dps-miR-2a  uaucacagccagcuuugaggagc uucuucaaaguggcugcgaaaug uucuucaaaguggcugcgaaauguuguacacagcuagauuuaauauucauaucacagccagcuuugaggagc    GmorY1_scf7180000648092:45766..45838:+

I would like to consolidate the row items which are similar based on the following columns:

mature mirbase MiRNA
example miRBase with the same seed
consensus mature sequence
consensus star sequence
consensus precursor sequence

I am wondering if there is a script (AWK, GREP, or just python) one can use do this at once. Some help please.

from excel items Extract file • 771 views

ADD COMMENT • link updated 13 months ago by DonPhager • 0 • written 13 months ago by otieno43 ▴ 30

0

Entering edit mode

By excel file do you mean an excel -specific format like xlsx?

Also, can you expand on what you mean by consolidate the rows that are similar between the different files?

ADD REPLY • link 13 months ago by rpolicastro 13k

0

Entering edit mode

Others might not be in favor of this but for me as a complete beginner in this field, it was able to help me with tasks like this: ChatGPT. It might not be the most elegant way but it was able to give me some good working scripts for tasks like this and it worked pretty good. Sometimes you have to adjust a bit but if you are not that experienced with informatics and shell scripting like me, it is a good option.

ADD REPLY • link 13 months ago by DonPhager • 0

score 0 · Answer 1 · 2023-11-26

I am surprised nobody told you yet not to use Excel in bioinformatics. There are jokesters on this site that take most opportunities to do so. And they are right. That aside, pandas can do what you want. It has a function to read Excel files, and the rest is simply merging multiple files by columns and throwing away rows that are not found in both files.

import pandas as pd

columns=['mature mirbase MiRNA','example miRBase with the same seed']
f1 = pd.read_excel(file_name_1)
f2 = pd.read_excel(file_name_2)
f3 = f1.merge(f2, left_on=columns, right_on=columns, how='outer')
f3.dropna(axis=0, inplace=True)
f3.to_excel(output_file, index=False)

I was too lazy to paste all the column names above, and you need to enter proper names for file_name_1/2 and output_file.