tidypolars_extra.io¶
Classes¶
Read data into a tibble. |
Module Contents¶
- class tidypolars_extra.io.read_data[source]¶
Read data into a tibble.
Formats supported: csv, dta, xls, xlsx, ods, tsv, txt, tex, dat, sav, rds, Rdata, gspread
- Parameters:
fn (str) – Full path to file, including filename. The type of file is inferred from the file extension. Hierarchical headers are accepted (see Notes). To see accepted formats, run: “
read_data.get_accepted_file_formats(True)” To read from google spreadsheet directly, use “credentials” and “url” instead of “fn”. To read from a URL with the file other from a google spreadsheet, use “fn”.credentials (str) – Path to the .json file with Google API credentials to access the spreadsheet (see Notes).
url (str) – Google spreadsheet URL
sheet_name (str | int) – Name of the sheet to load.
cols (list of str) – List with names of the columns to return. Used with .sav files.
sep (str (Default ";")) – Specify the column separator for .csv files
big_data (bool) – If True, uses dask to load the data. Default: False
silently (bool (optional)) – If True, do now show a completion message
sheet_name – Sheet name or index.
n_headers (int) – Used for data with hierarchical header. Number of header rows at the top of the sheet that are header of the columns. See notes. Defaults 0.
header_combine_rule (callable(levels) -> str) – Used for data with hierarchical header. How to combine the list of non-empty levels into a final column name. Default (None) uses “level 1 (<level 2>, <level 3>… <level n>)” If combine=’_’, it uses ‘_’.join(levels).
combine_parenthesis_sep (str) – Used for data with hierarchical header. Used by default combine to separate levels grouped within parenthesis in the column name. Default uses ‘,’: “level 1 (<level 2>, <level 3>… <level n>)”
multi_col_sentinel (Any) – Used for data with hierarchical header. Value used in upper levels to indicate “continuation” of a merged header from the previous column (default: the string “None”).
Notes
Other keyword arguments are accepted based on the underlying method that reads the file, which can be found in their respective documentation provided by the original module.
Extension => underlying method:
.csv => polars.read_csv (uses sep=’,’ as default)
.tsv => polars.read_csv (uses sep=’ ‘ as default)
.dat => polars.read_csv (uses sep=’ ‘ as default)
.txt => polars.read_csv (lines into list)
.xls => pandas.read_excel
.xlsx => pandas.read_excel
.xlt => pandas.read_excel
.xltx => pandas.read_excel
.ods => pandas.read_excel
.dta => pandas.read_stata
.sav => pyreadstat.read_sav
.rds => pyreadr.read_r
.rda => pyreadr.read_r
.Rdata => pyreadr.read_r
Big data is handled with Dask
Hierarchical header:
Some data contains a hierarchical header, i.e., a multi-line header. Here is an example with 2 levels:
|----------------------------------------| | Party | Age | Gender | |---------------|---------------|--------| | Code | Value | value | group | | |------|--------|-------|-------|--------| | 1 | Dem | 23 | 20-29 | M | | 0 | Rep | 33 | 30-39 | F | |----------------------------------------|
When that is the case, the argument
n_headerscan be used to specify the number of header levels, or lines containing header information. The function flattens the levels and combines the information into the header name to maintain a tidy format. The rule is:In upper levels (all rows except the last), values equal to multi_col_sentinel, None, or empty string are treated as “merged” and forward-filled horizontally.
In the last level, None or multi_col_sentinel is treated as “missing label” and is simply ignored for that level.
The example above becomes:
|--------------------------------------------------------------------| | Party (code) | Party (value) | Age (value) | Age (group) | Gender | |---------------|---------------|-------------|-------------|--------| | 1 | Dem | 23 | 20-29 | M | | 0 | Rep | 33 | 30-39 | F | |--------------------------------------------------------------------|
See
header_combine_ruleandcombine_parenthesis_sepfor more settings.Load data from a google spreadsheet:
It requires Google credentials. The settings follow Google requirements and gspread steps. Steps available here: - https://docs.gspread.org/en/latest/oauth2.html#for-end-users-using-oauth-client-id
- Returns:
tibble when the file has no variable or value labels,
(tibble, DATA_LABELS) when it does