Hi to all, I have a output obtained by geneprediction tool getORF which contains the protein sequences along with the FASTA header and I have many such files around 2GB and I need to compile the output of these files into an excel sheet containing two columns one with contain FASTA header information and the other containing the protein sequences.The output of these getorf is pasted below.
>Mid_Vagina_WUGI_1_1 [1 - 69]
LLFSVIPQTLCTFLYYLSTKPQK
>Mid_Vagina_WUGI_1_2 [3 - 101]
AILSYSPNPLYISVLPQYQTSKIIHFYYLQHMS
>Mid_Vagina_WUGI_1_3 [2 - 124]
CYSQLFPKPSVHFCTTSVPNLKNNPLLLSTTYVLEAFLCFH
>Mid_Vagina_WUGI_1_4 [105 - 149]
RPFCVFINVNSNKDS
>Mid_Vagina_WUGI_1_5 [73 - 159]
STSIIYNICLRGLFVFSLMLIATKTADFY
>Mid_Vagina_WUGI_1_6 [137 - 187]
QQRQLTFIECLLCVLVF
>Mid_Vagina_WUGI_1_7 [163 - 195]
VPTMCSCVLSF
>Mid_Vagina_WUGI_1_8 [153 - 212]
LLLSAYYVFLCSEFLMNKLG
>Mid_Vagina_WUGI_1_9 [216 - 263]
The output by getORF is pasted here,I need a script which creates an excel table containing two columns
>Mid_Vagina_WUGI_1_1 [1 - 69]
LLFSVIPQTLCTFLYYLSTKPQK
>Mid_Vagina_WUGI_1_2 [3 - 101]
AILSYSPNPLYISVLPQYQTSKIIHFYYLQHMS
>Mid_Vagina_WUGI_1_3 [2 - 124]
CYSQLFPKPSVHFCTTSVPNLKNNPLLLSTTYVLEAFLCFH
>Mid_Vagina_WUGI_1_4 [105 - 149]
RPFCVFINVNSNKDS
>Mid_Vagina_WUGI_1_5 [73 - 159]
STSIIYNICLRGLFVFSLMLIATKTADFY
>Mid_Vagina_WUGI_1_6 [137 - 187]
QQRQLTFIECLLCVLVF
>Mid_Vagina_WUGI_1_7 [163 - 195]
VPTMCSCVLSF
>Mid_Vagina_WUGI_1_8 [153 - 212]
LLLSAYYVFLCSEFLMNKLG
>Mid_Vagina_WUGI_1_9 [216 - 263]
FLQKKLRTIILLILQV
>Mid_Vagina_WUGI_1_10 [206 - 277]
TWLILTKKATYNYSSHFTSVEKEV
>Mid_Vagina_WUGI_1_11 [202 - 288]
INLVNSYKKSYVQLFFSFYKCRKGGMMLT
>Mid_Vagina_WUGI_1_12 [267 - 359]
KRRYDAYLSSLIKLLLPSHTTKKVVSHQWYQ
>Mid_Vagina_WUGI_1_13 [332 - 388]
KGGFPSMVSVRRKARVFIS
>Mid_Vagina_WUGI_1_14 [363 - 392]
GGKQEFSFHN
>Mid_Vagina_WUGI_1_15 [396 - 425]
VLFCGTFSQK
>Mid_Vagina_WUGI_1_16 [392 - 487]
LSPVLWHIFSEITWRYFLHTNQTWGKDSWKIY
>Mid_Vagina_WUGI_1_17 [301 - 501]
LSCSSQAIPLKRWFPINGISEEESKSFHFITKSCFVAHFLRNNLEIFSTYQPNVGKGLME
NLLDFLI
>Mid_Vagina_WUGI_1_18 [429 - 509]
LGDIFYIPTKRGERTHGKFIRFPDLIN
>Mid_Vagina_WUGI_1_19 [508 - 537]
IKTSNCNYCQ
>Mid_Vagina_WUGI_1_20 [500 - 550]
SDKLKQAIVITANDRTS
>Mid_Vagina_WUGI_1_21 [528 - 575]
LLPMTELPDQNQYLSQ
>Mid_Vagina_WUGI_1_22 [541 - 579]
QNFLTKINICLNE
>Mid_Vagina_WUGI_1_23 [554 - 601]
PKSISVSMSSVWLFSA
>Mid_Vagina_WUGI_1_24 [579 - 626]
VVFGYSQPNQDKQKPG
>Mid_Vagina_WUGI_1_25 [583 - 708]
CLAILSLIKTSRNLDDISSGRSGSKKPSASSGPRDSYNLVHL
>Mid_Vagina_WUGI_1_26 [605 - 751]
SRQAETWMTSAVVGQGQKSLQLPQAPETPTIWFTFKEKSPKSINCVINC
>Mid_Vagina_WUGI_1_27 [755 - 793]
RKFCDCFSHFLCA
>Mid_Vagina_WUGI_1_28 [769 - 819]
LFLSFSVCVDISSLMLR
>Mid_Vagina_WUGI_1_29 [823 - 855]
SIIKVAKYNGK
>Mid_Vagina_WUGI_1_30 [818 - 862]
DNQSSKLQNTMGNRS
>Mid_Vagina_WUGI_1_31 [859 - 897]
KLIFFLSSIMLGY
>Mid_Vagina_WUGI_1_32 [630 - 944]
HQQWSVRVKKAFSFLRPQRLLQFGSPLKKRAPNPLTVLLTARESSVTVSLIFCVRRYFIF
DAEIINHQSCKIQWEIEVDFFFKQHHAGILRPPQFTGFAFQSKVA
>Mid_Vagina_WUGI_1_33 [878 - 979]
AASCWDTEAPSVHRFCFSIKSGLSHHQAIYLLLF
>Mid_Vagina_WUGI_1_34 [901 - 1023]
GPLSSQVLLFNQKWPESSPSNLSTFILIKSSNISSPLPKPA
>Mid_Vagina_WUGI_1_35 [986 - 1024]
NPQIYRLLYPSQH
>Mid_Vagina_WUGI_1_36 [948 - 1025]
VITKQFIYFYFNKILKYIVSSTQAST
>Mid_Vagina_WUGI_1_37 [1024 - 977] (REVERSE SENSE)
VLAWVEETIYLRILLK
>Mid_Vagina_WUGI_1_38 [1025 - 936] (REVERSE SENSE)
GAGLGRGDDIFEDFIKIKVDKLLGDDSGHF
>Mid_Vagina_WUGI_1_39 [967 - 932] (REVERSE SENSE)
INCLVMTQATFD
>Mid_Vagina_WUGI_1_40 [948 - 916] (REVERSE SENSE)
LRPLLIEKQNL
>Mid_Vagina_WUGI_1_41 [912 - 877] (REVERSE SENSE)
TEGASVSQHDAA
>Mid_Vagina_WUGI_1_42 [883 - 833] (REVERSE SENSE)
CCLKKKSTSISHCILQL
>Mid_Vagina_WUGI_1_43 [873 - 826] (REVERSE SENSE)
KKNQLLFPIVFCNFDD
>Mid_Vagina_WUGI_1_44 [932 - 747] (REVERSE SENSE)
LKSKTCELRGPQYPSMMLLKKKINFYFPLYFATLMIDYLSIKDEISTHTENERNSHRTFS
SS
>Mid_Vagina_WUGI_1_45 [750 - 676] (REVERSE SENSE)
QLITQLMDLGLFSLKVNQIVGVSGA
>Mid_Vagina_WUGI_1_46 [826 - 635] (REVERSE SENSE)
LIISASKMKYLRTQKMRETVTELSLAVNNTVNGFGALFFKGEPNCRSLWGLRKLKAFLTL
TDHC
>Mid_Vagina_WUGI_1_47 [645 - 604] (REVERSE SENSE)
PTTADVIQVSACLD
>Mid_Vagina_WUGI_1_48 [689 - 531] (REVERSE SENSE)
ESLGPEEAEGFFDPDRPLLMSSRFLLVLIRLRIAKHYSLRQILILVRKFCHWQ
>Mid_Vagina_WUGI_1_49 [559 - 512] (REVERSE SENSE)
FWSGSSVIGSNYNCLF
>Mid_Vagina_WUGI_1_50 [600 - 490] (REVERSE SENSE)
AENSQTLLIETDIDFGQEVLSLAVITIACFNLSDQEI
>Mid_Vagina_WUGI_1_51 [508 - 476] (REVERSE SENSE)
FIRSGNLINFP
>Mid_Vagina_WUGI_1_52 [527 - 384] (REVERSE SENSE)
LQLLVLIYQIRKSNKFSMSPFPTFGWYVENISKLFLRKCATKQDLVMK
>Mid_Vagina_WUGI_1_53 [483 - 331] (REVERSE SENSE)
IFHESFPHVWLVCRKYLQVISEKMCHKTGLSYEMKTLAFLLTDTIDGKPPF
>Mid_Vagina_WUGI_1_54 [380 - 291] (REVERSE SENSE)
KLLLSSSLIPLMGNHLFSGMAWEEQLNQAA
>Mid_Vagina_WUGI_1_55 [346 - 287] (REVERSE SENSE)
WETTFLVVWLGRSNLIKLLK
>Mid_Vagina_WUGI_1_56 [287 - 258] (REVERSE SENSE)
VSIIPPFLHL
>Mid_Vagina_WUGI_1_57 [300 - 253] (REVERSE SENSE)
SSCLSKHHTSFSTLVK
>Mid_Vagina_WUGI_1_58 [283 - 215] (REVERSE SENSE)
ASYLLFYTCKMRRIIVRSFFCKN
>Mid_Vagina_WUGI_1_59 [211 - 173] (REVERSE SENSE)
PSLFIKNSEHKNT
>Mid_Vagina_WUGI_1_60 [194 - 159] (REVERSE SENSE)
KLRTQEHIVGTQ
>Mid_Vagina_WUGI_1_61 [240 - 133] (REVERSE SENSE)
LYVAFFVRINQVYSLKTQNTRTHSRHSIKVSCLCCY
>Mid_Vagina_WUGI_1_62 [166 - 104] (REVERSE SENSE)
ALNKSQLSLLLLTLMKTQKGL
>Mid_Vagina_WUGI_1_63 [155 - 90] (REVERSE SENSE)
KSAVFVAININENTKRPLRHML
>Mid_Vagina_WUGI_1_64 [49 - 17] (REVERSE SENSE)
GSTEMYRGFGE
>Mid_Vagina_WUGI_1_65 [120 - 13] (REVERSE SENSE)
KHKKASKTYVVDNRSGLFLRFGTEVVQKCTEGLGNN
>Mid_Vagina_WUGI_1_66 [44 - 3] (REVERSE SENSE)
YRNVQRVWGITENS
>Mid_Vagina_WUGI_2_1 [3 - 47]
CHSGYLGIGLSLYLV
>Mid_Vagina_WUGI_2_2 [28 - 57]
VCLCILSESC
>Mid_Vagina_WUGI_2_3 [61 - 96]
ALSMETFKICFG
>Mid_Vagina_WUGI_2_4 [2 - 100]
MSLRVSWYRSVSVSCLNLVEPFQWKHSRFALGK
So you basically just need to replace the right-most space with a comma or tab so you can load it as a csv, or do you need something in addition?
FYI, Excel has an ~1 million row limit, so you might be over that.
Edit: thanks to whomever reformatted this!