tidypolars_extra.io

Classes

read_data

Read data into a tibble.

Module Contents

class tidypolars_extra.io.read_data[source]

Read data into a tibble.

Formats supported: csv, dta, xls, xlsx, ods, tsv, txt, tex, dat, sav, rds, Rdata, gspread

Parameters:
  • fn (str) – Full path to file, including filename. The type of file is inferred from the file extension. Hierarchical headers are accepted (see Notes). To see accepted formats, run: “read_data.get_accepted_file_formats(True)” To read from google spreadsheet directly, use “credentials” and “url” instead of “fn”. To read from a URL with the file other from a google spreadsheet, use “fn”.

  • credentials (str) – Path to the .json file with Google API credentials to access the spreadsheet (see Notes).

  • url (str) – Google spreadsheet URL

  • sheet_name (str | int) – Name of the sheet to load.

  • cols (list of str) – List with names of the columns to return. Used with .sav files.

  • sep (str (Default ";")) – Specify the column separator for .csv files

  • big_data (bool) – If True, uses dask to load the data. Default: False

  • silently (bool (optional)) – If True, do now show a completion message

  • sheet_name – Sheet name or index.

  • n_headers (int) – Used for data with hierarchical header. Number of header rows at the top of the sheet that are header of the columns. See notes. Defaults 0.

  • header_combine_rule (callable(levels) -> str) – Used for data with hierarchical header. How to combine the list of non-empty levels into a final column name. Default (None) uses “level 1 (<level 2>, <level 3>… <level n>)” If combine=’_’, it uses ‘_’.join(levels).

  • combine_parenthesis_sep (str) – Used for data with hierarchical header. Used by default combine to separate levels grouped within parenthesis in the column name. Default uses ‘,’: “level 1 (<level 2>, <level 3>… <level n>)”

  • multi_col_sentinel (Any) – Used for data with hierarchical header. Value used in upper levels to indicate “continuation” of a merged header from the previous column (default: the string “None”).

Notes

Other keyword arguments are accepted based on the underlying method that reads the file, which can be found in their respective documentation provided by the original module.

Extension => underlying method:

  • .csv => polars.read_csv (uses sep=’,’ as default)

  • .tsv => polars.read_csv (uses sep=’ ‘ as default)

  • .dat => polars.read_csv (uses sep=’ ‘ as default)

  • .txt => polars.read_csv (lines into list)

  • .xls => pandas.read_excel

  • .xlsx => pandas.read_excel

  • .xlt => pandas.read_excel

  • .xltx => pandas.read_excel

  • .ods => pandas.read_excel

  • .dta => pandas.read_stata

  • .sav => pyreadstat.read_sav

  • .rds => pyreadr.read_r

  • .rda => pyreadr.read_r

  • .Rdata => pyreadr.read_r

Big data is handled with Dask

Hierarchical header:

Some data contains a hierarchical header, i.e., a multi-line header. Here is an example with 2 levels:

|----------------------------------------|
|     Party     |      Age      | Gender |
|---------------|---------------|--------|
| Code | Value  | value | group |        |
|------|--------|-------|-------|--------|
|    1 | Dem    | 23    | 20-29 |  M     |
|    0 | Rep    | 33    | 30-39 |  F     |
|----------------------------------------|

When that is the case, the argument n_headers can be used to specify the number of header levels, or lines containing header information. The function flattens the levels and combines the information into the header name to maintain a tidy format. The rule is:

  • In upper levels (all rows except the last), values equal to multi_col_sentinel, None, or empty string are treated as “merged” and forward-filled horizontally.

  • In the last level, None or multi_col_sentinel is treated as “missing label” and is simply ignored for that level.

The example above becomes:

|--------------------------------------------------------------------|
| Party (code)  | Party (value) | Age (value) | Age (group) | Gender |
|---------------|---------------|-------------|-------------|--------|
|    1          | Dem           | 23          | 20-29       |  M     |
|    0          | Rep           | 33          | 30-39       |  F     |
|--------------------------------------------------------------------|

See header_combine_rule and combine_parenthesis_sep for more settings.

Load data from a google spreadsheet:

It requires Google credentials. The settings follow Google requirements and gspread steps. Steps available here: - https://docs.gspread.org/en/latest/oauth2.html#for-end-users-using-oauth-client-id

Returns:

  • tibble when the file has no variable or value labels,

  • (tibble, DATA_LABELS) when it does

get_accepted_file_formats()[source]
read_Rdata()[source]
read_csv()[source]
read_dta()[source]
read_gspread()[source]
read_json()[source]

Read a JSON or NDJSON file

read_parquet()[source]

Read a Parquet file

read_sav()[source]
read_xls()[source]