What is the best way to open an excel file containing info on over 6 milion SNPS? Each rows is a SNP, wich relative p-values and other info.
I'd like to split the file in sveral different files.
Is there a way to do it?
Thank you
What is the best way to open an excel file containing info on over 6 milion SNPS? Each rows is a SNP, wich relative p-values and other info.
I'd like to split the file in sveral different files.
Is there a way to do it?
Thank you
It seems Excel has a limit of 1,048,576 rows so I suppose your SNPs are split across different sheets.
If any useful, I wrote an excelToCsv converter. It's java so it should work on any system. To convert all the sheets you could do:
excelToCsv -f -i snps.xlsx > snps.csv
Using pandas
in python to convert to both comma- and tab-separated text files:
import pandas as pd
df = pd.read_excel('my_file.xlsx', sheet_name=None)
df.to_csv('my_file.csv', index=False)
df.to_csv('my_file.tsv', index=False, sep='\t')
For Excel files with multiple sheets:
import pandas as pd
df = pd.concat(pd.read_excel('my_file.xlsx', sheet_name=None), ignore_index=True)
df.to_csv('my_file.csv', index=False)
df.to_csv('my_file.tsv', index=False, sep='\t')
I kind of advise against pandas for this. For example, see this question of mine on stackoverflow. In general, I think pandas and R-based solutions don't handle well input that is not already nicely tabular and well behaved.
While writing excelToCsv I realised there are lots of funny things happening with Excel conversion. For example, 0.0123 may be exported as 1E-02 also when exporting using Excel itself (!!). If a number is formatted with commas for thousand separators (e.g. 1,000,000), Excel exports it as a string like "1,000,000", good luck reading it back as a number. Dates are a mess. My program should handle these cases quite well (as far as I can tell...)
The geeky detail is that I use the Java POI package which, it seems to me, is way more sophisticated than various python/R counterparts and gives a lot of low-level options to manipulate cells. In general, it seems to do "the right thing". But basically, by default my program should give output that is very similar to a manual export from Excel and, therefore, could result in data loss. Better, I think, is to use the -f
option which discards all the formatting from numeric cells and you avoid 0.0123 becoming 1E-02, 1,000 staying 1,000, FALSE/TRUE becoming sometimes strings sometimes numbers, and possibly other quirks I don't know. Optionally, dates are converted to ISO format again using the POI engine. The relevant bit of code is around here https://github.com/dariober/excelToCsv/blob/master/src/main/java/excelToCsv/Main.java#L170.
I like this answer, as a way to import an Excel file into a Python script or notebook for analysis. @Mensur Could you comment on how well Pandas read_excel() works? Does it have strange formatting issues with certain types of cells, or fonts, or characters from Excel?
It has worked in my hands, but keep in mind that many people use space characters, commas instead of decimal points, and other types of notations that can throw off any conversion effort, no matter how carefully executed. Fortunately there are many options to customize the import - see here - and pandas
will in most cases make a good guess regarding the type of data in each column. In case of multiple sheets with different data structures, I think it is safest to import them one at a time.
Many people I know think that dask is even better for these conversions than pandas
. I don't have much experience with it, but there are solutions out there for both options.
If you cannot open it to export to another format (csv, txt tab-delimeted etc.) for command line processing, try an online conversion server able to handle your file size. What are the criteria for spliting? Just blocks of rows, eg., 1-1000, 1001-2000 [...]? If so you can try this tool: https://splitmyexcelfile.com/
I found this tidyverse tutorial helpful: https://readxl.tidyverse.org/
You can first list the sheets of the Excel file (if there are multiple), and then decide which one you want to load.
There are parameters to limit the number of rows the function reads (n_max
), so you can implement chunk-wise loading if memory becomes a concern.
Unfortunately there is no function to automatically send an email with a proper rant to the authors of this Excel file of yours telling them what a terrible idea it is to store large data in such a format rather than a simple compressed text-based file, it is a pity :-P
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
convert it to (== save it as tab-delineated txt file) and process it in linux using awk or such?
Is it an xls/xlsx file or simple a csv file that your computer opens on Excel?
Are you on a Mac or Windows computer? On a Mac, you have access to a wide range of Unix tools, including
split
, which will split up a large file into several small ones (see here).Your process might be:
dos2unix
)wc -l
)split -l N
, replacingN
with the desired number of lines per chunk)If you're on Windows, then you might install a Unix operating system on your system (e.g., Cygwin) to get access to standard line processing tools. Windows has Powershell, but it isn't standard outside of Windows and these steps are somewhat more involved.