tidypolars_extra¶

Submodules¶

Attributes¶

API_labels

Classes¶

`DescCol`	Expressions that can be used in various contexts.
`TibbleGroupBy`	Starts a new GroupBy operation.
`read_data`	Read data into a tibble.
`tibble`	A data frame object that provides methods familiar to R tidyverse users.

Functions¶

`abs`(x)	Absolute value
`across`(cols[, fn, names_prefix, names_suffix])	Apply a function across a selection of columns
`as_boolean`(x)	Convert column to string. Alias to as_logical (R naming).
`as_categorical`(args, *kwargs)	Convert to factor. Alias for as_factor
`as_character`(x)	Convert to string. Defaults to Utf8.
`as_date`(x[, fmt])	Convert a string to a Date
`as_datetime`(x[, fmt])	Convert a string to a Datetime
`as_factor`(x[, levels])	Convert to factor (R naming), equlivalent to Enum or
`as_float`(x)	Convert to float. Defaults to Float64.
`as_integer`(x)	Convert to integer. Defaults to Int64.
`as_logical`(x)	Convert to a boolean (polars) or 'logical' (R naming)
`as_string`(x)	Convert column to string. Alias to as_character (R naming).
`between`(x, left, right)	Test if values of a column are between two values
`case_when`(*args[, _default])	Case when
`cast`(x, dtype)	General type conversion.
`ceiling_date`(x[, unit, change_on_boundary])	Round date up to the nearest unit
`coalesce`(*args)	Coalesce missing values
`contains`(match[, ignore_case])	Contains a literal string
`cor`(x, y[, method])	Find the correlation of two columns
`count`(x)	Number of observations in each group
`cov`(x, y)	Find the covariance of two columns
`cume_dist`(x)	Compute cumulative distribution (proportion of values <= current value)
`cummax`(x)	Cumulative maximum
`cummin`(x)	Cumulative minimum
`cumprod`(x)	Cumulative product
`cumsum`(x)	Cumulative sum
`days`([n])	Create a duration of n days
`desc`(x)	Mark a column to order in descending
`difftime`(x, y[, units])	Compute time differences in specified units
`dt_round`(x, rule, n)	Round the datetime
`ends_with`(match[, ignore_case])	Ends with a suffix
`everything`()	Selects all columns
`fct_collapse`(x, **kwargs)	Collapse multiple factor levels into one
`fct_infreq`(df, col_name)	Reorder factor levels by frequency (most common first)
`fct_lump`(x[, n, prop, other_level])	Collapse least frequent factor levels into 'Other'
`fct_recode`(x, **kwargs)	Manually recode factor levels
`fct_rev`(df, col_name)	Reverse factor level order
`first`(x)	Get first value
`floor`(x)	Round numbers down to the lower integer
`floor_date`(x[, unit])	Round date down to the nearest unit
`from_pandas`(df)	Convert from pandas DataFrame to tibble
`from_polars`(df)	Convert from polars DataFrame to tibble
`hour`(x)	Extract the hour from a datetime
`hours`([n])	Create a duration of n hours
`if_else`(condition, true, false)	If Else
`iqr`(x)	Compute the interquartile range (Q3 - Q1)
`is_finite`(x)	Test if values are finite
`is_in`(x, values)	Test if values are in a list
`is_infinite`(x)	Test if values are infinite
`is_not`(x)	Negate a boolean expression
`is_not_in`(x, values)	Test if values are not in a list
`is_not_null`(x)	Test if values are not null
`is_null`(x)	Test if values are null
`lag`(x[, n, default])	Get lagging values
`last`(x)	Get last value
`lead`(x[, n, default])	Get leading values
`length`(x)	Number of observations in each group.
`log`(x)	Compute the natural logarithm of a column
`log10`(x)	Compute the base 10 logarithm of a column
`mad`(x)	Compute the median absolute deviation
`make_date`([year, month, day])	Create a date object
`make_datetime`([year, month, day, hour, minute, second])	Create a datetime object
`map`(cols, _fun)	Apply function by row
`matches`(match[, ignore_case])	Matches pattern
`max`(x)	Get column max
`mday`(x)	Extract the month day from a date from 1 to 31.
`mean`(x)	Get column mean
`median`(x)	Get column median
`microseconds`([n])	Create a duration of n microseconds
`milliseconds`([n])	Create a duration of n milliseconds
`min`(x)	Get column minimum
`minute`(x)	Extract the minute from a datetime
`minutes`([n])	Create a duration of n minutes
`mode`(x)	Compute the statistical mode (most frequent value)
`month`(x)	Extract the month from a date
`n`()	Number of observations in each group
`n_distinct`(x)	Get number of distinct values in a column
`n_missing`(x)	Count the number of null/missing values in a column
`now`()	Return the current datetime as a polars literal
`ntile`(x, n)	Divide values into n roughly equal groups
`paste`(*args[, sep])	Concatenate strings together
`paste0`(*args)	Concatenate strings together with no separator
`pct_missing`(x)	Compute the percentage of null/missing values in a column
`percent_rank`(x)	Compute percent rank (values between 0 and 1)
`quantile`(x[, quantile])	Get number of distinct values in a column
`quarter`(x)	Extract the quarter from a date
`rank`(x[, method])	Assigns a minimum rank to each element in the input list, handling ties by
`rep`(x[, times])	Replicate the values in x
`replace_null`(x[, replace])	Replace null values
`round`(x[, digits])	Round a column to the specified number of decimal places
`row_number`()	Return row number
`scale`(x)	Standardize the input by scaling it to a mean of 0 and a standard deviation of 1.
`sd`(x)	Get column standard deviation
`second`(x)	Extract the second from a datetime
`seconds`([n])	Create a duration of n seconds
`sqrt`(x)	Get column square root
`starts_with`(match[, ignore_case])	Starts with a prefix
`str_c`(*args[, sep])	Concatenate strings together.
`str_count`(string, pattern)	Count occurrences of a pattern in a string
`str_detect`(string, pattern[, negate])	Detect the presence or absence of a pattern in a string
`str_dup`(string, times)	Duplicate/repeat a string
`str_ends`(string, pattern[, negate])	Detect the presence or absence of a pattern at the end of a string.
`str_extract`(string, pattern)	Extract the target capture group from provided patterns
`str_extract_all`(string, pattern)	Extract all matches of a pattern
`str_length`(string)	Length of a string
`str_pad`(string, width[, side, pad])	Pad a string to a specified width
`str_remove`(string, pattern)	Removes the first matched patterns in a string
`str_remove_all`(string, pattern)	Removes all matched patterns in a string
`str_replace`(string, pattern, replacement)	Replaces the first matched patterns in a string
`str_replace_all`(string, pattern, replacement)	Replaces all matched patterns in a string
`str_split`(string, pattern)	Split a string by a pattern
`str_squish`(string)	Remove leading/trailing whitespace and collapse internal whitespace
`str_starts`(string, pattern[, negate])	Detect the presence or absence of a pattern at the beginning of a string.
`str_sub`(string[, start, end])	Extract portion of string based on start and end inputs
`str_to_lower`(string)	Convert case of a string
`str_to_title`(string)	Convert string to Title Case
`str_to_upper`(string)	Convert case of a string
`str_trim`(string[, side])	Trim whitespace
`str_wrap`(string, width[, sep])	Split string
`sum`(x)	Get column sum
`today`()	Return the current date as a polars literal
`var`(x)	Get column variance
`wday`(x)	Extract the weekday from a date from sunday = 1 to saturday = 7.
`week`(x)	Extract the week from a date
`weeks`([n])	Create a duration of n weeks
`weighted_mean`(x, w)	Compute weighted mean
`where`(col_type)	Select columns by type using a string
`yday`(x)	Extract the year day from a date from 1 to 366.
`year`(x)	Extract the year from a date
`zscore`(x)	Standardize to z-scores (alias for scale)

Package Contents¶

class tidypolars_extra.DescCol[source]¶

Bases: polars.Expr

Expressions that can be used in various contexts.

class tidypolars_extra.TibbleGroupBy(df, by, *args, **kwargs)[source]¶

Bases: tidypolars_extra.type_conversion.pl.dataframe.group_by.GroupBy

Starts a new GroupBy operation.

Utility class for performing a group by operation over the given DataFrame.

Generated by calling df.group_by(…).

Parameters:

df – DataFrame to perform the group by operation over.
*by – Column or columns to group by. Accepts expression input. Strings are parsed as column names.
maintain_order – Ensure that the order of the groups is consistent with the input data. This is slower than a default group by.
predicates – Predicate expressions to filter groups after aggregation.
**named_by – Additional column(s) to group by, specified as keyword arguments. The columns will be named as the keyword used.

filter(*args, **kwargs)[source]¶

mutate(*args, **kwargs)[source]¶

summarize(*args, **kwargs)[source]¶

by¶

df¶

class tidypolars_extra.read_data[source]¶

Read data into a tibble.

Formats supported: csv, dta, xls, xlsx, ods, tsv, txt, tex, dat, sav, rds, Rdata, gspread

Parameters:

fn (str) – Full path to file, including filename. The type of file is inferred from the file extension. Hierarchical headers are accepted (see Notes). To see accepted formats, run: “read_data.get_accepted_file_formats(True)” To read from google spreadsheet directly, use “credentials” and “url” instead of “fn”. To read from a URL with the file other from a google spreadsheet, use “fn”.
credentials (str) – Path to the .json file with Google API credentials to access the spreadsheet (see Notes).
url (str) – Google spreadsheet URL
sheet_name (str | int) – Name of the sheet to load.
cols (list of str) – List with names of the columns to return. Used with .sav files.
sep (str (Default ";")) – Specify the column separator for .csv files
big_data (bool) – If True, uses dask to load the data. Default: False
silently (bool (optional)) – If True, do now show a completion message
sheet_name – Sheet name or index.
n_headers (int) – Used for data with hierarchical header. Number of header rows at the top of the sheet that are header of the columns. See notes. Defaults 0.
header_combine_rule (callable(levels) -> str) – Used for data with hierarchical header. How to combine the list of non-empty levels into a final column name. Default (None) uses “level 1 (<level 2>, <level 3>… <level n>)” If combine=’_’, it uses ‘_’.join(levels).
combine_parenthesis_sep (str) – Used for data with hierarchical header. Used by default combine to separate levels grouped within parenthesis in the column name. Default uses ‘,’: “level 1 (<level 2>, <level 3>… <level n>)”
multi_col_sentinel (Any) – Used for data with hierarchical header. Value used in upper levels to indicate “continuation” of a merged header from the previous column (default: the string “None”).

Notes

Other keyword arguments are accepted based on the underlying method that reads the file, which can be found in their respective documentation provided by the original module.

Extension => underlying method:

.csv => polars.read_csv (uses sep=’,’ as default)
.tsv => polars.read_csv (uses sep=’ ‘ as default)
.dat => polars.read_csv (uses sep=’ ‘ as default)
.txt => polars.read_csv (lines into list)
.xls => pandas.read_excel
.xlsx => pandas.read_excel
.xlt => pandas.read_excel
.xltx => pandas.read_excel
.ods => pandas.read_excel
.dta => pandas.read_stata
.sav => pyreadstat.read_sav
.rds => pyreadr.read_r
.rda => pyreadr.read_r
.Rdata => pyreadr.read_r

Big data is handled with Dask

Hierarchical header:

Some data contains a hierarchical header, i.e., a multi-line header. Here is an example with 2 levels:

|----------------------------------------|
|     Party     |      Age      | Gender |
|---------------|---------------|--------|
| Code | Value  | value | group |        |
|------|--------|-------|-------|--------|
|    1 | Dem    | 23    | 20-29 |  M     |
|    0 | Rep    | 33    | 30-39 |  F     |
|----------------------------------------|

When that is the case, the argument n_headers can be used to specify the number of header levels, or lines containing header information. The function flattens the levels and combines the information into the header name to maintain a tidy format. The rule is:

In upper levels (all rows except the last), values equal to multi_col_sentinel, None, or empty string are treated as “merged” and forward-filled horizontally.
In the last level, None or multi_col_sentinel is treated as “missing label” and is simply ignored for that level.

The example above becomes:

|--------------------------------------------------------------------|
| Party (code)  | Party (value) | Age (value) | Age (group) | Gender |
|---------------|---------------|-------------|-------------|--------|
|    1          | Dem           | 23          | 20-29       |  M     |
|    0          | Rep           | 33          | 30-39       |  F     |
|--------------------------------------------------------------------|

See header_combine_rule and combine_parenthesis_sep for more settings.

Load data from a google spreadsheet:

It requires Google credentials. The settings follow Google requirements and gspread steps. Steps available here: - https://docs.gspread.org/en/latest/oauth2.html#for-end-users-using-oauth-client-id

Returns:

tibble when the file has no variable or value labels,
(tibble, DATA_LABELS) when it does

get_accepted_file_formats()[source]¶

read_Rdata()[source]¶

read_csv()[source]¶

read_dta()[source]¶

read_gspread()[source]¶

read_json()[source]¶: Read a JSON or NDJSON file

read_parquet()[source]¶: Read a Parquet file

read_sav()[source]¶

read_xls()[source]¶

class tidypolars_extra.tibble(*args, **kwargs)[source]¶

Bases: tidypolars_extra.type_conversion.pl.DataFrame

A data frame object that provides methods familiar to R tidyverse users.

anti_join(df, left_on=None, right_on=None, on=None)[source]¶

Perform an anti join (keep rows without a match in df)

Parameters:

df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.

Returns:

Rows from the original tibble that do not have a match in df.

Return type:

tibble

Examples

>>> df1.anti_join(df2, on = 'x')

arrange(*args)[source]¶

Arrange/sort rows

Parameters:: *args (str) – Columns to sort by

Examples

>>> df = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
>>> # Arrange in ascending order
>>> df.arrange('x', 'y')
>>> # Arrange some columns descending
>>> df.arrange(tp.desc('x'), 'y')

Returns:: Original tibble ordered by args
Return type:: tibble

assert_no_nulls(*cols)[source]¶

Assert that specified columns contain no null values

Parameters:: *cols (str) – Column names to check. If empty, checks all columns.
Returns:: Returns self if assertion passes.
Return type:: tibble
Raises:: AssertionError – If any null values are found.

Examples

>>> df.assert_no_nulls('x', 'y')

assert_unique(*cols)[source]¶

Assert that specified columns have unique combinations

Parameters:: *cols (str) – Column names to check. If empty, checks all columns.
Returns:: Returns self if assertion passes.
Return type:: tibble
Raises:: AssertionError – If duplicate combinations are found.

Examples

>>> df.assert_unique('id')

bind_cols(*args)[source]¶

Bind data frames by columns

Parameters:: *args (tibble) – Data frame to bind
Returns:: The original tibble with added columns from the other tibble specified in args
Return type:: tibble

Examples

>>> df1 = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
>>> df2 = tp.tibble({'a': ['c', 'c', 'c'], 'b': range(4, 7)})
>>> df1.bind_cols(df2)

bind_rows(*args)[source]¶

Bind data frames by row

Parameters:: *args (tibble, list) – Data frames to bind by row
Returns:: The original tibble with added rows from the other tibble specified in args
Return type:: tibble

Examples

>>> df1 = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
>>> df2 = tp.tibble({'x': ['c', 'c', 'c'], 'y': range(4, 7)})
>>> df1.bind_rows(df2)

clean_names(case='snake')[source]¶

Standardize column names

Parameters:: case (str) – Case style for column names. Options: ‘snake’ (default), ‘lower’, ‘upper’.
Returns:: A tibble with cleaned column names.
Return type:: tibble

Examples

>>> df = tp.tibble(**{"First Name": [1], "Last.Name": [2], "AGE (years)": [30]})
>>> df.clean_names()

clone()[source]¶: Very cheap deep clone

colnames(regex='.', type=None, include_factor=True)[source]¶

Return the names of numeric columns in self that match ‘regex’ type: (str)

include_factor: (boolean): When type=string, include or not factors

complete(*cols, fill=None)[source]¶

Complete a DataFrame with all combinations of specified columns

Parameters:

*cols (str) – Column names to find all combinations of.
fill (dict, optional) – Dictionary of column names to fill values for missing combinations.

Returns:

A tibble with all combinations of the specified columns, with missing values filled according to fill parameter.

Return type:

tibble

Examples

>>> df = tp.tibble(x = [1, 1, 2], y = ['a', 'b', 'a'], val = [10, 20, 30])
>>> df.complete('x', 'y')

count(*args, sort=False, name='n')[source]¶

Returns row counts of the dataset. If bare column names are provided, count() returns counts by group.

Parameters:

*args (str, Expr) – Columns to group by
sort (bool) – Should columns be ordered in descending order by count
name (str) – The name of the new column in the output. If omitted, it will default to “n”.

Returns:

If no agument is provided, just return the nomber of rows. If column names are provided, it will count the unique values across columns

Return type:

tibble

Examples

>>> df = tp.tibble({'a': [1, 1, 2, 3],
...:                 'b': ['a', 'a', 'b', 'b']})
>>> df.count()
shape: (1, 1)
┌─────┐
│   n │
│ u32 │
╞═════╡
│   4 │
└─────┘
>>> df.count('a', 'b')
shape: (3, 3)
┌─────────────────┐
│   a   b       n │
│ i64   str   u32 │
╞═════════════════╡
│   1   a       2 │
│   2   b       1 │
│   3   b       1 │
└─────────────────┘

cross_join(df, suffix='_right')[source]¶

Perform a cross join (Cartesian product)

Parameters:

df (tibble) – DataFrame to join with.
suffix (str) – Suffix to append to columns with a duplicate name.

Returns:

All combinations of rows from both tibbles.

Return type:

tibble

Examples

>>> df1.cross_join(df2)

crossing(*args, **kwargs)[source]¶

Expands the existing tibble for each value of the variables used in the crossing() argument. See Returns.

Parameters:

*args (list) – One unamed list is accepted.
**kwargs (list) – Keyword will be the variable name, and the values in the list will be in the expanded tibble

Returns:

A tibble with varibles containing all combinations of the values in the arguments passed to crossing(). The original tibble will be replicated for each unique combination.

Return type:

tibble

Examples

>>> df = tp.tibble({'a': [1, 2], "b": [3, 5]})
>>> df
shape: (2, 2)
┌───────────┐
│   a     b │
│ i64   i64 │
╞═══════════╡
│   1     3 │
│   2     5 │
└───────────┘
>>> df.crossing(c = ['a', 'b', 'c'])
shape: (6, 3)
┌─────────────────┐
│   a     b   c   │
│ i64   i64   str │
╞═════════════════╡
│   1     3   a   │
│   1     3   b   │
│   1     3   c   │
│   2     5   a   │
│   2     5   b   │
│   2     5   c   │
└─────────────────┘

describe()[source]¶

Generate summary statistics for all columns

Returns:: A tibble with summary statistics including column name, type, count of non-null values, null count, unique count, and for numeric columns: mean, std, min, 25%, 50%, 75%, max.
Return type:: tibble

Examples

>>> df.describe()

descriptive_statistics(vars=None, groups=None, include_categorical=True, include_type=False)[source]¶

Compute descriptive statistics for numerical variables and optionally frequency statistics for categorical variables, with support for grouping.

Parameters:

vars (str, list, dict, or None, default None) – The variables for which to compute statistics. - If None, all variables in the dataset (as given by self.names) are used. - If a string, it is interpreted as a single variable name. - If a list, each element is treated as a variable name. - If a dict, keys are variable names and values are their labels.
groups (str, list, dict, or None, default None) – Variable(s) to group by when computing statistics. - If None, overall statistics are computed. - If a string, it is interpreted as a single grouping variable. - If a list, each element is treated as a grouping variable. - If a dict, keys are grouping variable names and values are their labels.
include_categorical (bool, default True) – Whether to include frequency statistics for categorical variables in the output.
include_type (bool, default False) – If True, adds a column indicating the variable type (“Num” for numerical, “Cat” for categorical).

Returns:

A tibble containing the descriptive statistics. For numerical variables, the statistics include N (count of non-missing values), Missing (percentage of missing values), Mean (average value), Std.Dev. (standard deviation), Min (minimum value), and Max (maximum value). If grouping is specified, these statistics are computed for each group. When include_categorical is True, frequency statistics for categorical variables are appended to the result.

Return type:

tibble

distinct(*args, keep_all=False)[source]¶

Select distinct/unique rows

Parameters:

*args (str, Expr) – Columns to find distinct/unique rows
keep_all (boll) – If True, keep all columns. Otherwise, return only the ones used to select the distinct rows.

Returns:

Tibble after removing the repeated rows based on args

Return type:

tibble

Examples

>>> df = tp.tibble({'a': range(3), 'b': ['a', 'a', 'b']})
>>> df.distinct()
>>> df.distinct('b')

drop(*args)[source]¶

Drop unwanted columns

Parameters:: *args (str) – Columns to drop
Returns:: Tibble with columns in args dropped
Return type:: tibble

Examples

>>> df.drop('x', 'y')

drop_na(*args)[source]¶

Drop rows containing missing values. Alias for drop_null(), matching tidyr’s drop_na() spelling.

Parameters:: *args (str) – Columns to drop nulls from (defaults to all)
Returns:: Tibble with rows containing nulls in args removed.
Return type:: tibble

Examples

>>> df = tp.tibble(x = [1, None, 3], y = [None, 'b', 'c'])
>>> df.drop_na()
>>> df.drop_na('x')

drop_null(*args)[source]¶

Drop rows containing missing values

Parameters:: *args (str) – Columns to drop nulls from (defaults to all)
Returns:: Tibble with rows in args with missing values dropped
Return type:: tibble

Examples

>>> df = tp.tibble(x = [1, None, 3], y = [None, 'b', 'c'], z = range(3)}
>>> df.drop_null()
>>> df.drop_null('x', 'y')

equals(other, null_equal=True)[source]¶: Check if two tibbles are equal

fill(*args, direction='down', by=None)[source]¶

Fill in missing values with previous or next value

Parameters:

*args (str) – Columns to fill
direction (str) – Direction to fill. One of [‘down’, ‘up’, ‘downup’, ‘updown’]
by (str, list) – Columns to group by

Returns:

Tibble with missing values filled

Return type:

tibble

Examples

>>> df = tp.tibble({'a': [1, None, 3, 4, 5],
...                 'b': [None, 2, None, None, 5],
...                 'groups': ['a', 'a', 'a', 'b', 'b']})
>>> df.fill('a', 'b')
>>> df.fill('a', 'b', by = 'groups')
>>> df.fill('a', 'b', direction = 'downup')

filter(*args, by=None)[source]¶

Filter rows on one or more conditions

Parameters:

*args (Expr) – Conditions to filter by
by (str, list) – Columns to group by

Returns:

A tibble with rows that match condition.

Return type:

tibble

Examples

>>> df = tp.tibble({'a': range(3), 'b': ['a', 'a', 'b']})
>>> df.filter(col('a') < 2, col('b') == 'a')
>>> df.filter((col('a') < 2) & (col('b') == 'a'))
>>> df.filter(col('a') <= tp.mean(col('a')), by = 'b')

freq(vars=None, groups=None, na_rm=False, na_label=None)[source]¶

Compute frequency table.

Parameters:

vars (str, list, or dict) – Variables to return value frequencies for. If a dict is provided, the key should be the variable name and the values the variable label for the output
groups (str, list, dict, or None, optional) – Variable names to condition marginal frequencies on. If a dict is provided, the key should be the variable name and the values the variable label for the output Defaults to None (no grouping).
na_rm (bool, optional) – Whether to include NAs in the calculation. Defaults to False.
na_label (str) – Label to use for the NA values

Returns:

A tibble with relative frequencies and counts.

Return type:

tibble

full_join(df, left_on=None, right_on=None, on=None, suffix: str = '_right')[source]¶

Perform an full join

Parameters:

df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.

Returns:

Union between the original and the df tibbles. The rows that don’t match in one of the tibbles will be completed with missing values.

Return type:

tibble

Examples

>>> df1.full_join(df2)
>>> df1.full_join(df2, on = 'x')
>>> df1.full_join(df2, left_on = 'left_x', right_on = 'x')

get_dupes(*cols)[source]¶

Find duplicate rows

Parameters:: *cols (str) – Column names to check for duplicates. If empty, checks all columns.
Returns:: A tibble containing duplicate rows with a ‘dupe_count’ column.
Return type:: tibble

Examples

>>> df.get_dupes('x', 'y')

glimpse(regex='.')[source]¶

Print compact information about the data

Parameters:: regex (str, list, dict) – Return information of the variables that match the regular expression, the list, or the dictionary. If dictionary is used, the variable names must be the dictionary keys.
Return type:: None

group_by(group, *args, **kwargs)[source]¶

Takes an existing tibble and converts it into a grouped tibble where operations are performed “by group”. ungroup() happens automatically after the operation is performed.

Parameters:: group (str, list) – Variable names to group by.
Returns:: A tibble with values grouped by one or more columns.
Return type:: Grouped tibble

head(n=5, *, by=None)[source]¶: Alias for .slice_head()

hoist(col_name, *, remove=False, **fields)[source]¶

Pull named elements out of a list- or struct-column into top-level columns.

Parameters:

col_name (str) – Name of the list or struct column to reach into.
remove (bool) – If True, drop the original column after hoisting.
**fields (str, int, or list) – Each keyword defines a new top-level column. The value is a path into the list/struct column: a field name, an integer list index, or a list of such steps for nested access.

Examples

>>> df = tp.tibble(meta = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
>>> df.hoist('meta', a = 'a')

inner_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶

Perform an inner join

Parameters:

df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.

Returns:

A tibble with intersection of cases in the original and df tibbles.

Return type:

tibble

Examples

>>> df1.inner_join(df2)
>>> df1.inner_join(df2, on = 'x')
>>> df1.inner_join(df2, left_on = 'left_x', right_on = 'x')

iterrows()[source]¶

left_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶

Perform a left join

Parameters:

df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.

Returns:

The original tibble with added columns from tibble df if they match columns in the original one. Columns to match on are given in the function parameters.

Return type:

tibble

Examples

>>> df1.left_join(df2)
>>> df1.left_join(df2, on = 'x')
>>> df1.left_join(df2, left_on = 'left_x', right_on = 'x')

mutate(*args, by=None, **kwargs)[source]¶

Add or modify columns

Parameters:

*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify

Returns:

Original tibble with new column created.

Return type:

tibble

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), c = ['a', 'a', 'b']})
>>> df.mutate(double_a = col('a') * 2,
...           a_plus_b = col('a') + col('b'))
>>> df.mutate(row_num = row_number(), by = 'c')

nest(by, *args, **kwargs)[source]¶

creates a nested tibble

Parameters:

by (list, str) – Columns to nest on
kwargs –

datalist of column names

columns to select to include in the nested data If not provided, include all columns except the ones used in ‘by’

keystr
name of the resulting nested column.

names_sepstr
If not provided (default), the names in the nested data will come from the former names. If a string, the new inner names in the nested dataframe will use the outer names with names_sep automatically stripped. This makes names_sep roughly symmetric between nesting and unnesting.

Returns:

The resulting tibble with have a column that contains nested tibbles

Return type:

tibble

pack(**groups)[source]¶

Pack several columns into one or more struct columns.

Parameters:: **groups (list or str) – Each keyword defines a new struct column. The value is a list of existing column names to pack into that struct.

Examples

>>> df = tp.tibble(x = [1, 2], y = [3, 4], z = ['a', 'b'])
>>> df.pack(position = ['x', 'y'])

pipe(fn, *args, **kwargs)[source]¶

Apply a function to the entire DataFrame

Parameters:

fn (callable) – Function to apply. The tibble is passed as the first argument.
*args (any) – Additional positional arguments passed to fn.
**kwargs (any) – Additional keyword arguments passed to fn.

Returns:

Result of fn(self, *args, **kwargs).

Return type:

any

Examples

>>> def add_column(df, name, value):
...     return df.mutate(**{name: value})
>>> df.pipe(add_column, 'new_col', 1)

pivot_longer(cols=None, names_to='name', values_to='value')[source]¶

Pivot data from wide to long

Parameters:

cols (Expr) – List of the columns to pivot. Defaults to all columns.
names_to (str) – Name of the new “names” column.
values_to (str) – Name of the new “values” column

Returns:

Original tibble, but in long format.

Return type:

tibble

Examples

>>> df = tp.tibble({'id': ['id1', 'id2'], 'a': [1, 2], 'b': [1, 2]})
>>> df.pivot_longer(cols = ['a', 'b'])
>>> df.pivot_longer(cols = ['a', 'b'], names_to = 'stuff', values_to = 'things')

pivot_wider(names_from='name', values_from='value', id_cols=None, values_fn='first', values_fill=None)[source]¶

Pivot data from long to wide

Parameters:

names_from (str) – Column to get the new column names from.
values_from (str) – Column to get the new column values from
id_cols (str, list) – A set of columns that uniquely identifies each observation. Defaults to all columns in the data table except for the columns specified in names_from and values_from.
values_fn (str) – Function for how multiple entries per group should be dealt with. Any of ‘first’, ‘count’, ‘sum’, ‘max’, ‘min’, ‘mean’, ‘median’, ‘last’
values_fill (str) – If values are missing/null, what value should be filled in. Can use: “backward”, “forward”, “mean”, “min”, “max”, “zero”, “one”

Returns:

Original tibble, but in wide format.

Return type:

tibble

Examples

>>> df = tp.tibble({'id': [1, 1], 'variable': ['a', 'b'], 'value': [1, 2]})
>>> df.pivot_wider(names_from = 'variable', values_from = 'value')

print(n=1000, ncols=1000, str_length=1000, digits=2)[source]¶

Print the DataFrame

Parameters:

n (int, default=1000) – Number of rows to print
ncols (int, default=1000) – Number of columns to print
str_length (int, default=1000) – Maximum length of the strings.

Return type:

None

pull(var=None)[source]¶

Extract a column as a series

Parameters:: var (str) – Name of the column to extract. Defaults to the last column.
Returns:: The series will contain the values of the column from var.
Return type:: Series

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3))
>>> df.pull('a')

relevel(x, ref)[source]¶

Change the reference level a string or factor and covert to factor

Inputs¶

xstr: Variable name
refstr: Reference level

returns:: The original tibble with the column specified in x as an ordered factors, with first category specified in ref.
rtype:: tibble

relocate(*args, before=None, after=None)[source]¶

Move a column or columns to a new position

Parameters:: *args (str, Expr) – Columns to move
Returns:: Original tibble with columns relocated.
Return type:: tibble

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.relocate('a', before = 'c')
>>> df.relocate('b', after = 'c')

rename(*args, regex=False, tolower=False, strict=False, **kwargs)[source]¶

Rename columns

Parameters:

*args (str or dict) – If a single dict is provided, it is used as {old_name: new_name}. If strings are provided, they are treated as pairs: new_name, old_name, …
regex (bool, default False) – If True, uses regular expression replacement {<matched from>:<matched to>}
tolower (bool, default False) – If True, convert all to lower case
**kwargs (str) – Keyword arguments in the form new_name=’old_name’

Returns:

Original tibble with columns renamed.

Return type:

tibble

Examples

>>> df = tp.tibble({'x': range(3), 't': range(3), 'z': ['a', 'a', 'b']})
>>> df.rename({'x': 'new_x'})
>>> df.rename(new_x = 'x')
>>> df.rename('new_x', 'x')

replace(rep, regex=False)[source]¶

Replace method from polars pandas. Replaces values of a column.

Parameters:

rep (dict) –

Format to use polars’ replace:
{<varname>:{<old value>:<new value>, …}}

Format to use pandas’ replace:
{<old value>:<new value>, …}
regex (bool) – If true, replace using regular expression. It uses pandas replace()

Returns:

Original tibble with values of columns replaced based on rep`.

Return type:

tibble

replace_na(replace=None)[source]¶

Replace null values in specified columns

Parameters:: replace (dict) – Dictionary mapping column names to replacement values.
Returns:: A tibble with nulls replaced.
Return type:: tibble

Examples

>>> df.replace_na({'x': 0, 'y': 'missing'})

replace_null(replace=None)[source]¶

Replace null values

Parameters:: replace (dict, str, int, or float) – Dictionary of column/replacement pairs, or values to replace null values. If not dict, replace in all columns. If replace is a string, it will replace nulls in all string columns, and so on.
Returns:: Original tibble with missing/null values replaced.
Return type:: tibble

Examples

>>> df = tp.tibble({'a': [None, 'abc', 'cde'], 'b':[None, 1, 2], 'c': [None, 1.1, 2.2]})
>>> df.replace_null({'a': 'New value'})
>>> df.replace_null({'a': 1})
>>> df.replace_null({'b': 1})
>>> df.replace_null({'b': 1.1})
>>> df.replace_null({'c': 1})
>>> df.replace_null('a')
>>> df.replace_null(1)
>>> df.replace_null(1.1)

right_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶

Perform a right join

Parameters:

df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.

Returns:

Every row of df with matching columns from self. Unmatched rows on the left side receive null values.

Return type:

tibble

Examples

>>> df1.right_join(df2)
>>> df1.right_join(df2, on = 'x')
>>> df1.right_join(df2, left_on = 'left_x', right_on = 'x')

sample_frac(fraction, seed=None, with_replacement=False)[source]¶

Randomly sample a fraction of rows

Parameters:

fraction (float) – Fraction of rows to sample (between 0 and 1).
seed (int, optional) – Random seed for reproducibility.
with_replacement (bool) – Whether to sample with replacement.

Returns:

A tibble with a random fraction of rows.

Return type:

tibble

Examples

>>> df.sample_frac(0.5, seed = 42)

sample_n(n, seed=None, with_replacement=False)[source]¶

Randomly sample n rows

Parameters:

n (int) – Number of rows to sample.
seed (int, optional) – Random seed for reproducibility.
with_replacement (bool) – Whether to sample with replacement.

Returns:

A tibble with n randomly sampled rows.

Return type:

tibble

Examples

>>> df.sample_n(5, seed = 42)

save_data(fn, copies=None, sep=';', kws_latex=None, *args, **kws)[source]¶

Save data based on the filename.

Parameters:

fn (callable, str) – Path and filename
copies (list of str) – List with strings with the file extensions. Copies of the file are saved based on the extension, using the same filename and path used in “fn”.
sep (str (optional)) – Set the column separator to export to text-like files (.csv, .tsv, .txt, etc.)
kws_latex (dict) – Arguments of to_latex(). See tibble.to_latex()

Notes

Additional positional and keyword arguments are passed to the underlying method used to save the file, which is based on the file extension.

.tex => tidypolars_extra.tibble.to_latex
.csv => polars.write_csv (uses sep=’;’ as default)
.tsv => polars.write_csv (uses sep=’ ‘ as default)
.dat => polars.write_csv (uses sep=’ ‘ as default)
.txt => polars.write_csv (uses sep=’ ‘ as default)
.xls => polars.write_excel
.xlsx => polars.write_excel
.dta => pandas.DataFrame.to_stata
.parquet => polars.write_parquet

Use silently=True to save quietly (Default False).

select(*args)[source]¶

Select or drop columns

Parameters:: *args (str, list, dict, or combinations of them) – Columns to select. It can combine names, list of names, and a dict. If dict, it will rename the columns based on the dict. It also accepts helper functions: tp.matches(<regex>), tp.contains(<str>), tp.where(<str>).

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'abcba': ['a', 'a', 'b']})
>>> df.select('a', 'b')
>>> df.select(col('a'), col('b'))
>>> df.select({'a': 'new name'}, tp.matches("c"))
>>> df.select(tp.where('numeric'))

semi_join(df, left_on=None, right_on=None, on=None)[source]¶

Perform a semi join (keep rows with a match in df, no columns added)

Parameters:

df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.

Returns:

Rows from the original tibble that have a match in df.

Return type:

tibble

Examples

>>> df1.semi_join(df2, on = 'x')

separate(sep_col, into, sep='_', remove=True)[source]¶

Separate a character column into multiple columns

Parameters:

sep_col (str) – Column to split into multiple columns
into (list) – List of new column names
sep (str) – Separator to split on. Default to ‘_’
remove (bool) – If True removes the input column from the output data frame

Returns:

Original tibble with a column splitted based on sep.

Return type:

tibble

Examples

>>> df = tp.tibble(x = ['a_a', 'b_b', 'c_c'])
>>> df.separate('x', into = ['left', 'right'])

separate_longer_delim(sep_col, delim)[source]¶

Split a string column by delim into longer rows.

Parameters:

sep_col (str) – Column to split.
delim (str) – Delimiter to split on.

Examples

>>> df = tp.tibble(x = ['a,b', 'c'])
>>> df.separate_longer_delim('x', ',')

separate_longer_position(sep_col, width)[source]¶

Split each string into chunks of width characters and convert into longer rows.

Parameters:

sep_col (str) – Column to split.
width (int) – Width of each chunk in characters.

Examples

>>> df = tp.tibble(x = ['abcd', 'efgh'])
>>> df.separate_longer_position('x', 2)

separate_rows(*cols, sep=',')[source]¶

Split the given columns on sep and explode them into longer rows. Superseded by separate_longer_delim() but kept for tidyr parity.

Parameters:

*cols (str) – Columns to split and explode.
sep (str) – Delimiter to split on (default: ',').

Examples

>>> df = tp.tibble(x = ['a,b', 'c'], y = [1, 2])
>>> df.separate_rows('x', sep = ',')

separate_wider_delim(sep_col, delim, names, *, remove=True, too_few='error', too_many='error')[source]¶

Split a string column into several columns using a delimiter.

Parameters:

sep_col (str) – Column to split.
delim (str) – Delimiter to split on.
names (list) – Names of the resulting columns.
remove (bool) – If True (default) drop the original column.
too_few (str) – One of 'error' (default) or 'align_start'. When 'error', raises if a row produces fewer fields than len(names).
too_many (str) – One of 'error' (default) or 'drop'. When 'error', raises if a row produces more fields than len(names).

Examples

>>> df = tp.tibble(x = ['a_1', 'b_2'])
>>> df.separate_wider_delim('x', '_', names = ['letter', 'num'])

separate_wider_position(sep_col, widths, *, remove=True)[source]¶

Split a string column into several columns by character positions.

Parameters:

sep_col (str) – Column to split.
widths (dict) – Mapping of new column name → width in characters.
remove (bool) – If True (default) drop the original column.

Examples

>>> df = tp.tibble(x = ['2024Q1', '2025Q2'])
>>> df.separate_wider_position('x', widths = {'year': 4, 'q': 2})

separate_wider_regex(sep_col, patterns, *, remove=True)[source]¶

Split a string column using a regular expression with named groups.

Parameters:

sep_col (str) – Column to split.
patterns (str or dict) – Either a regex string containing named capturing groups, or a dict {name: sub_pattern} which is assembled into a single regex of named groups in the given order.
remove (bool) – If True (default) drop the original column.

Examples

>>> df = tp.tibble(x = ['id-001', 'id-002'])
>>> df.separate_wider_regex('x', {'prefix': '[a-z]+', '_sep': '-', 'num': '\d+'})

set_names(nm=None)[source]¶

Change the column names of the data frame

Parameters:: nm (list) – A list of new names for the data frame

Examples

>>> df = tp.tibble(x = range(3), y = range(3))
>>> df.set_names(['a', 'b'])

slice(*args, by=None)[source]¶

Grab rows from a data frame

Parameters:

*args (int, list) – Rows to grab
by (str, list) – Columns to group by

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.slice(0, 1)
>>> df.slice(0, by = 'c')

slice_head(n=5, *, by=None)[source]¶

Grab top rows from a data frame

Parameters:

n (int) – Number of rows to grab
by (str, list) – Columns to group by

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.slice_head(2)
>>> df.slice_head(1, by = 'c')

slice_max(order_by, n=1, *, with_ties=True, by=None)[source]¶

Select rows with the largest values of order_by.

Parameters:

order_by (str, list) – Column(s) to order by (descending).
n (int) – Number of rows to return per group.
with_ties (bool) – If True (default), include tied rows even if that exceeds n.
by (str, list, optional) – Columns to group by.

Examples

>>> df = tp.tibble(x = [1, 2, 2, 3], g = ['a', 'a', 'b', 'b'])
>>> df.slice_max('x', n = 1)
>>> df.slice_max('x', n = 1, by = 'g')

slice_min(order_by, n=1, *, with_ties=True, by=None)[source]¶

Select rows with the smallest values of order_by.

Parameters:

order_by (str, list) – Column(s) to order by (ascending).
n (int) – Number of rows to return per group.
with_ties (bool) – If True (default), include tied rows even if that exceeds n.
by (str, list, optional) – Columns to group by.

Examples

>>> df = tp.tibble(x = [1, 2, 2, 3], g = ['a', 'a', 'b', 'b'])
>>> df.slice_min('x', n = 1)
>>> df.slice_min('x', n = 1, by = 'g')

slice_sample(n=None, *, prop=None, replace=False, seed=None, by=None)[source]¶

Randomly sample rows. Modern replacement for sample_n() and sample_frac().

Parameters:

n (int, optional) – Number of rows to sample. Provide exactly one of n or prop.
prop (float, optional) – Fraction of rows to sample (between 0 and 1).
replace (bool) – Whether to sample with replacement.
seed (int, optional) – Random seed for reproducibility.
by (str, list, optional) – Columns to group by; sampling happens within each group.

Examples

>>> df.slice_sample(n = 3, seed = 42)
>>> df.slice_sample(prop = 0.5, by = 'g', seed = 42)

slice_tail(n=5, *, by=None)[source]¶

Grab bottom rows from a data frame

Parameters:

n (int) – Number of rows to grab
by (str, list) – Columns to group by

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.slice_tail(2)
>>> df.slice_tail(1, by = 'c')

summarise(*args, by=None, **kwargs)[source]¶: Alias for .summarize()

summarize(*args, by=None, **kwargs)[source]¶

Aggregate data with summary statistics

Parameters:

*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify

Returns:

A tibble with the summaries

Return type:

tibble

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.summarize(avg_a = tp.mean(col('a')))
>>> df.summarize(avg_a = tp.mean(col('a')),
...              by = 'c')
>>> df.summarize(avg_a = tp.mean(col('a')),
...              max_b = tp.max(col('b')))

tab(row, col, groups=None, margins=True, normalize='all', margins_name='Total', stat='both', na_rm=True, na_label='NA', digits=2)[source]¶

Create a 2x2 contingency table for two categorical variables, with optional grouping, margins, and normalization.

Parameters:

row (str) – Name of the variable to be used for the rows of the table.
col (str) – Name of the variable to be used for the columns of the table.
groups (str or list of str, optional) – Variable name(s) to use as grouping variables. When provided, a separate 2x2 table is generated for each group.
margins (bool, default True) – If True, include row and column totals (margins) in the table.
normalize ({'all', 'row', 'columns'}, default 'all') –
Specifies how to compute the marginal percentages in each cell:
- ’all’: percentages computed over the entire table.
- ’row’: percentages computed across each row.
- ’columns’: percentages computed down each column.
margins_name (str, default 'Total') – Name to assign to the row and column totals.
stat ({'both', 'perc', 'n'}, default 'both') –
Determines the statistic to display in each cell:
- ’both’: returns both percentages and sample size.
- ’perc’: returns percentages only.
- ’n’: returns sample size only.
na_rm (bool, default True) – If True, remove rows with missing values in the row or col variables.
na_label (str, default 'NA') – Label to use for missing values when na_rm is False.
digits (int, default 2) – Number of digits to round the percentages to.

Returns:

A contingency table as a tibble. The table contains counts and/or percentages as specified by the stat parameter, includes margins if requested, and is formatted with group headers when grouping variables are provided.

Return type:

tibble

tail(n=5, *, by=None)[source]¶: Alias for .slice_tail()

to_csv(*args, **kws)[source]¶

Save tibble to csv.

Details¶

See polars write_csv() for details.

rtype:: None

to_dict(*, as_series=True)[source]¶

Aggregate data with summary statistics

Parameters:: as_series (bool) – If True - returns the dict values as Series If False - returns the dict values as lists

Examples

>>> df.to_dict()
>>> df.to_dict(as_series = False)

to_dta(*args, **kws)[source]¶

Save tibble to dta.

Details¶

See polars write_dta() for details.

rtype:: None

to_excel(*args, **kws)[source]¶

Save tibble to excel.

Details¶

See polars write_excel() for details.

rtype:: None

to_latex(fn=None, header=None, digits=4, caption=None, label=None, align=None, na_rep='', position='!htb', group_rows_by=None, group_title_align='l', footnotes=None, footnotes_width='\\linewidth', index=False, escape=False, longtable=False, longtable_singlespace=True, rotate=False, scale=True, parse_linebreaks=True, tabular=False, *args, **kws)[source]¶

Convert the object to a LaTeX tabular representation.

Parameters:

fn (str) – Path with filename

header (list of tuples, optional) –

The column headers for the LaTeX table. Each tuple corresponds to a column. Example creating upper level header with grouped columns:

[("", "col 1"),
 ("Group A", "col 2"),
 ("Group A", "col 3"),
 ("Group B", "col 4"),
 ("Group B", "col 5"),
]

Example creating two upper level headers with grouped columns:

[("Group 1", ""       , "col 1"),
 ("Group 1", "Group A", "col 2"),
 ("Group 1", "Group A", "col 3"),
 (""       , "Group B", "col 4"),
 (""       , "Group B", "col 5"),
]

digits (int, default=4) – Number of decimal places to round the numerical values in the table.
caption (str, optional) – The caption for the LaTeX table.
label (str, optional) – The label for referencing the table in LaTeX.
align (str, optional) – Column alignment specifications (e.g., ‘lcr’).
na_rep (str, default='') – The representation for NaN values in the table.
position (str, default='!htbp') – The placement option for the table in the LaTeX document.
footnotes (dict, optional) – A dictionary where keys are column alignments (‘c’, ‘r’, or ‘l’) and values are the respective footnote strings.
footnotes_width (str, None) – Width of the footnote. Example: ‘linewidth’, ‘40pt’ If None, impose no restriction to the width
group_rows_by (str, default=None) – Name of the variable in the data with values to group the rows by.
group_title_align (str, default='l') – Alignment of the title of each row group.
index (bool, default=False) – Whether to include the index in the LaTeX table.
escape (bool, default=False) – Whether to escape LaTeX special characters.
longtable (bool, deafult=False) – If True, table spans multiple pages
longtable_singlespace (bool) – Force single space to longtables
rotate (bool) – Whether to use landscape table
scale (bool, default=True) – If True, scales the table to fit the linewidth when the table exceeds that size. Ignored when longtable=True (LaTeX limitation because longtable does not use tabular).
parse_linebreaks (bool, default=True) – If True, parse n and replace it with makecel to produce linebreaks
tabular (bool, default=False) – Whether to use a tabular format for the output.

Returns:

A LaTeX formatted string of the tibble.

Return type:

str

to_markdown()[source]¶

Render the tibble as a Markdown table string

Returns:: A Markdown-formatted table string.
Return type:: str

Examples

>>> print(df.to_markdown())

to_pandas()[source]¶

Convert to a pandas DataFrame

Examples

>>> df.to_pandas()

to_parquet(file=str, compression='snappy', use_pyarrow=False, silently=False, *args, **kws)[source]¶: Write a data frame to a parquet

to_polars()[source]¶

Convert to a polars DataFrame

Examples

>>> df.to_polars()

transmute(*args, by=None, **kwargs)[source]¶

Add or modify columns, keeping only the new columns

Parameters:

*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify

Returns:

A tibble with only the newly created columns (and grouping columns if by is used).

Return type:

tibble

Examples

>>> df.transmute(double_a = col('a') * 2)

unite(col='_united', unite_cols=[], sep='_', remove=True)[source]¶

Unite multiple columns by pasting strings together

Parameters:

col (str) – Name of the new column
unite_cols (list) – List of columns to unite
sep (str) – Separator to use between values
remove (bool) – If True removes input columns from the data frame

Examples

>>> df = tp.tibble(a = ["a", "a", "a"], b = ["b", "b", "b"], c = range(3))
>>> df.unite("united_col", unite_cols = ["a", "b"])

unnest(col)[source]¶

Unnest a nested tibble :param col: Columns to unnest :type col: str

Returns:: The nested tibble will be expanded and become unested rows of the original tibble.
Return type:: tibble

unnest_longer(col_name, *, values_to=None, indices_to=None)[source]¶

Turn each element of a list- or struct-column into its own row.

For list columns, this behaves like DataFrame.explode. For struct columns, each row is expanded into one row per field, with the field name going into indices_to and the field value into values_to.

Parameters:

col_name (str) – Name of the list or struct column to unnest.
values_to (str, optional) – Name of the output value column. For list columns this renames the exploded column. For struct columns this names the value column; defaults to col_name.
indices_to (str, optional) – For struct columns, the name of the field-name column. Defaults to f"{col_name}_id".

Examples

>>> df = tp.tibble(id = [1, 2], vals = [[10, 20], [30]])
>>> df.unnest_longer('vals')

unnest_wider(col_name, *, names_sep=None)[source]¶

Turn each element of a struct- or list-column into its own column.

Parameters:

col_name (str) – Name of the column to unnest.
names_sep (str, optional) – If provided, the output column names become f"{col_name}{names_sep}{field}" to avoid collisions.

Examples

>>> df = tp.tibble(id = [1, 2], pt = [{'x': 1, 'y': 2}, {'x': 3, 'y': 4}])
>>> df.unnest_wider('pt')

unpack(*cols)[source]¶

Unpack one or more struct columns into their component columns.

Parameters:: *cols (str) – Names of the struct columns to unpack.

Examples

>>> df = tp.tibble(id = [1, 2]).pack(pt = ['id'])  # contrived
>>> df.unpack('pt')

property names¶

Get column names

Returns:: Names of the columns
Return type:: list

Examples

>>> df.names

property ncol¶

Get number of columns

Returns:: Number of columns
Return type:: int

Examples

>>> df.ncol

property nrow¶

Get number of rows

Returns:: Number of rows
Return type:: int

Examples

>>> df.nrow

tidypolars_extra.abs(x)[source]¶

Absolute value

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(abs_x = tp.abs('x'))
>>> df.mutate(abs_x = tp.abs(col('x')))

tidypolars_extra.across(cols, fn=lambda x: ..., names_prefix=None, names_suffix=None)[source]¶

Apply a function across a selection of columns

Parameters:

cols (list) – Columns to operate on
fn (lambda) – A function or lambda to apply to each column
names_prefix (Optional - str) – Prefix to append to changed columns

Examples

>>> df = tp.tibble(x = ['a', 'a', 'b'], y = range(3), z = range(3))
>>> df.mutate(across(['y', 'z'], lambda x: x * 2))
>>> df.mutate(across(tp.Int64, lambda x: x * 2, names_prefix = "double_"))
>>> df.summarize(across(['y', 'z'], tp.mean), by = 'x')

tidypolars_extra.as_boolean(x)[source]¶: Convert column to string. Alias to as_logical (R naming).

tidypolars_extra.as_categorical(*args, **kwargs)[source]¶: Convert to factor. Alias for as_factor

tidypolars_extra.as_character(x)[source]¶

Convert to string. Defaults to Utf8.

Parameters:: x (Str) – Column to operate on

Examples

>>> df.mutate(string_x = tp.as_string('x'))
# or equivalently
>>> df.mutate(character_x = tp.as_character('x'))

tidypolars_extra.as_date(x, fmt=None)[source]¶

Convert a string to a Date

Parameters:

x (Expr, Series) – Column to operate on
fmt (str) – “yyyy-mm-dd”

Examples

>>> df = tp.tibble(x = ['2021-01-01', '2021-10-01'])
>>> df.mutate(date_x = tp.as_date(col('x')))

tidypolars_extra.as_datetime(x, fmt=None)[source]¶

Convert a string to a Datetime

Parameters:

x (Expr, Series) – Column to operate on
fmt (str) – “yyyy-mm-dd”

Examples

>>> df = tp.tibble(x = ['2021-01-01', '2021-10-01'])
>>> df.mutate(date_x = tp.as_datetime(col('x')))

tidypolars_extra.as_factor(x, levels=None)[source]¶

Convert to factor (R naming), equlivalent to Enum or Categorical (polars), depending on whether ‘levels’ is provided.

Parameters:

x (Str) – Column to operate on
levels (list of str) – Categories to use in the factor. The catogories will be ordered as they appear in the list. If None (default), it will create an unordered factor (polars Categorical).

Examples

>>> df.mutate(factor_x = tp.as_factor('x'))
# or equivalently
>>> df.mutate(categorical_x = tp.as_categorical('x'))

tidypolars_extra.as_float(x)[source]¶

Convert to float. Defaults to Float64.

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(float_x = tp.as_float(col('x')))

tidypolars_extra.as_integer(x)[source]¶

Convert to integer. Defaults to Int64.

Parameters:: x (Expr) – Column to operate on

Examples

>>> df.mutate(int_x = tp.as_integer(col('x')))

tidypolars_extra.as_logical(x)[source]¶

Convert to a boolean (polars) or ‘logical’ (R naming)

Parameters:: x (Str) – Column to operate on

Examples

>>> df.mutate(bool_x = tp.as_boolean(col('x')))
# or equivalently
>>> df.mutate(logical_x = tp.as_logical(col('x')))

tidypolars_extra.as_string(x)[source]¶: Convert column to string. Alias to as_character (R naming). Equivalent to Utf8 type (polars)

tidypolars_extra.between(x, left, right)[source]¶

Test if values of a column are between two values

Parameters:

x (Expr, Series) – Column to operate on
left (int) – Value to test if column is greater than or equal to
right (int) – Value to test if column is less than or equal to

Examples

>>> df = tp.tibble(x = range(4))
>>> df.filter(tp.between(col('x'), 1, 3))

tidypolars_extra.case_when(*args, _default=None)[source]¶

Case when

Parameters:

*args (Expr) – When called with a single expression, returns pl.when() for chaining (e.g., tp.case_when(cond).then(val).otherwise(val)). When called with paired args (condition, value, condition, value, …), builds the full case expression.
_default (optional) – Default value when no condition is met (used with paired args)

Examples

>>> df = tp.tibble(x = range(1, 4))
>>> # Chaining style
>>> df.mutate(case_x = tp.case_when(col('x') < 2).then(0)
...                     .when(col('x') < 3).then(1)
...                     .otherwise(0))
>>> # Paired args style
>>> df.mutate(
>>>    case_x = tp.case_when(col('x') < 2, 1,
>>>                          col('x') < 3, 2,
>>>                          _default = 0)
>>> )

tidypolars_extra.cast(x, dtype)[source]¶

General type conversion.

Parameters:

x (Expr, Series) – Column to operate on
dtype (DataType) – Type to convert to

Examples

>>> df.mutate(abs_x = tp.cast(col('x'), tp.Float64))

tidypolars_extra.ceiling_date(x, unit='month', change_on_boundary=False)[source]¶

Round date up to the nearest unit

Parameters:

x (Expr, str) – Date/datetime column
unit (str) – Unit to round to: ‘year’, ‘month’, ‘week’, ‘day’, ‘hour’, ‘minute’, ‘second’
change_on_boundary (bool) – If False (default), dates already at a boundary are unchanged. If True, boundary dates are bumped to the next unit.

Returns:

Date/datetime rounded up.

Return type:

Expr

Examples

>>> df.mutate(month_end = tp.ceiling_date('date', 'month'))

tidypolars_extra.coalesce(*args)[source]¶

Coalesce missing values

Parameters:: args (Expr) – Columns to coalesce

Examples

>>> df.mutate(abs_x = tp.cast(col('x'), tp.Float64))

tidypolars_extra.contains(match, ignore_case=True)[source]¶

Contains a literal string

Parameters:

match (str) – String to match columns
ignore_case (bool) – If TRUE, the default, ignores case when matching names.

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.select(contains('c'))

tidypolars_extra.cor(x, y, method='pearson')[source]¶

Find the correlation of two columns

Parameters:

x (Expr) – A column
y (Expr) – A column
method (str) – Type of correlation to find. Either ‘pearson’ or ‘spearman’.

Examples

>>> df.summarize(cor = tp.cor(col('x'), col('y')))

tidypolars_extra.count(x)[source]¶

Number of observations in each group

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(count = tp.count(col('x')))

tidypolars_extra.cov(x, y)[source]¶

Find the covariance of two columns

Parameters:

x (Expr) – A column
y (Expr) – A column

Examples

>>> df.summarize(cor = tp.cov(col('x'), col('y')))

tidypolars_extra.cume_dist(x)[source]¶

Compute cumulative distribution (proportion of values <= current value)

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(cd = tp.cume_dist('x'))

tidypolars_extra.cummax(x)[source]¶

Cumulative maximum

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(cmax = tp.cummax('x'))

tidypolars_extra.cummin(x)[source]¶

Cumulative minimum

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(cmin = tp.cummin('x'))

tidypolars_extra.cumprod(x)[source]¶

Cumulative product

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(cprod = tp.cumprod('x'))

tidypolars_extra.cumsum(x)[source]¶

Cumulative sum

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(csum = tp.cumsum('x'))

tidypolars_extra.days(n=1)[source]¶

Create a duration of n days

Parameters:: n (int) – Number of days
Returns:: A duration literal.
Return type:: Expr

Examples

>>> df.mutate(tomorrow = col('date') + tp.days(1))

tidypolars_extra.desc(x)[source]¶: Mark a column to order in descending

tidypolars_extra.difftime(x, y, units='days')[source]¶

Compute time differences in specified units

Parameters:

x (Expr, str) – Start date/datetime column
y (Expr, str) – End date/datetime column
units (str) – Units for the result: ‘days’, ‘hours’, ‘minutes’, ‘seconds’, ‘weeks’

Returns:

Numeric expression with the time difference.

Return type:

Expr

Examples

>>> df.mutate(diff = tp.difftime('date1', 'date2', units='days'))

tidypolars_extra.dt_round(x, rule, n)[source]¶

Round the datetime

Parameters:

x (Expr, Series) – Column to operate on
rule (str) – Units of the downscaling operation. Any of: "month", "week", "day", "hour", "minute", "second".
n (int) – Number of units (e.g. 5 “day”, 15 “minute”.

Examples

>>> df.mutate(monthday = tp.mday(col('x')))

tidypolars_extra.ends_with(match, ignore_case=True)[source]¶

Ends with a suffix

Parameters:

match (str) – String to match columns
ignore_case (bool) – If TRUE, the default, ignores case when matching names.

Examples

>>> df = tp.tibble({'a': range(3), 'b_code': range(3), 'c_code': ['a', 'a', 'b']})
>>> df.select(ends_with('code'))

tidypolars_extra.everything()[source]¶

Selects all columns

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.select(everything())

tidypolars_extra.fct_collapse(x, **kwargs)[source]¶

Collapse multiple factor levels into one

Parameters:

x (Expr, str) – Factor/categorical column
**kwargs – Mapping of new_level = [‘old1’, ‘old2’, …]

Returns:

Expression with collapsed levels.

Return type:

Expr

Examples

>>> df.mutate(x_collapsed = tp.fct_collapse('x', ab=['a', 'b'], cd=['c', 'd']))

tidypolars_extra.fct_infreq(df, col_name)[source]¶

Reorder factor levels by frequency (most common first)

Parameters:

df (tibble) – The DataFrame containing the column
col_name (str) – Name of the column to reorder

Returns:

DataFrame with column cast to Enum with levels ordered by frequency.

Return type:

tibble

Examples

>>> df = tp.tibble(x=['a', 'b', 'a', 'a', 'b', 'c'])
>>> df = tp.fct_infreq(df, 'x')

tidypolars_extra.fct_lump(x, n=None, prop=None, other_level='Other')[source]¶

Collapse least frequent factor levels into ‘Other’

Uses a ranking approach: for each value, computes its frequency rank and replaces values outside the top n with other_level.

Parameters:

x (Expr, str) – Factor/categorical column
n (int, optional) – Number of most frequent levels to keep
prop (float, optional) – Minimum proportion to keep a level (0 to 1)
other_level (str) – Label for collapsed levels (default: ‘Other’)

Returns:

Expression with infrequent levels replaced.

Return type:

Expr

Examples

>>> df.mutate(x_lumped = tp.fct_lump('x', n=3))

tidypolars_extra.fct_recode(x, **kwargs)[source]¶

Manually recode factor levels

Parameters:

x (Expr, str) – Factor/categorical column
**kwargs – Mapping of new_level = ‘old_level’ or new_level = [‘old1’, ‘old2’]

Returns:

Expression with recoded levels.

Return type:

Expr

Examples

>>> df.mutate(x_recoded = tp.fct_recode('x', good='a', bad='b'))

tidypolars_extra.fct_rev(df, col_name)[source]¶

Reverse factor level order

Parameters:

df (tibble) – The DataFrame containing the column
col_name (str) – Name of the column to reverse

Returns:

DataFrame with column cast to Enum with reversed level order.

Return type:

tibble

Examples

>>> df = tp.tibble(x=['a', 'b', 'c'])
>>> df = tp.fct_rev(df, 'x')

tidypolars_extra.first(x)[source]¶

Get first value

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(first_x = tp.first('x'))
>>> df.summarize(first_x = tp.first(col('x')))

tidypolars_extra.floor(x)[source]¶

Round numbers down to the lower integer

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(floor_x = tp.floor(col('x')))

tidypolars_extra.floor_date(x, unit='month')[source]¶

Round date down to the nearest unit

Parameters:

x (Expr, str) – Date/datetime column
unit (str) – Unit to round to: ‘year’, ‘month’, ‘week’, ‘day’, ‘hour’, ‘minute’, ‘second’

Returns:

Date/datetime rounded down.

Return type:

Expr

Examples

>>> df.mutate(month_start = tp.floor_date('date', 'month'))

tidypolars_extra.from_pandas(df)[source]¶

Convert from pandas DataFrame to tibble

Parameters:: df (DataFrame) – pd.DataFrame to convert to a tibble
Return type:: tibble

Examples

>>> tp.from_pandas(df)

tidypolars_extra.from_polars(df)[source]¶

Convert from polars DataFrame to tibble

Parameters:: df (DataFrame) – pl.DataFrame to convert to a tibble
Return type:: tibble

Examples

>>> tp.from_polars(df)

tidypolars_extra.hour(x)[source]¶

Extract the hour from a datetime

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(hour = tp.hour(col('x')))

tidypolars_extra.hours(n=1)[source]¶

Create a duration of n hours

Parameters:: n (int) – Number of hours
Returns:: A duration literal.
Return type:: Expr

Examples

>>> df.mutate(later = col('datetime') + tp.hours(2))

tidypolars_extra.if_else(condition, true, false)[source]¶

If Else

Parameters:

condition (Expr) – A logical expression
true – Value if the condition is true
false – Value if the condition is false

Examples

>>> df = tp.tibble(x = range(1, 4))
>>> df.mutate(if_x = tp.if_else(col('x') < 2, 1, 2))

tidypolars_extra.iqr(x)[source]¶

Compute the interquartile range (Q3 - Q1)

Use in summarize() context only. Not suitable for mutate().

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(iqr_val = tp.iqr('x'))

tidypolars_extra.is_finite(x)[source]¶

Test if values are finite

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(finite = tp.is_finite('x'))

tidypolars_extra.is_in(x, values)[source]¶

Test if values are in a list

Parameters:

x (Expr, Series) – Column to operate on
values (list) – List of values to check

Examples

>>> df.mutate(in_list = tp.is_in('x', [1, 2]))

tidypolars_extra.is_infinite(x)[source]¶

Test if values are infinite

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(infinite = tp.is_infinite('x'))

tidypolars_extra.is_not(x)[source]¶

Negate a boolean expression

Parameters:: x (Expr) – Boolean expression to negate

Examples

>>> df.mutate(not_finite = tp.is_not(tp.is_finite(col('x'))))

tidypolars_extra.is_not_in(x, values)[source]¶

Test if values are not in a list

Parameters:

x (Expr, Series) – Column to operate on
values (list) – List of values to check

Examples

>>> df.mutate(not_in = tp.is_not_in('x', [1, 2]))

tidypolars_extra.is_not_null(x)[source]¶

Test if values are not null

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(not_null = tp.is_not_null('x'))

tidypolars_extra.is_null(x)[source]¶

Test if values are null

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(null = tp.is_null('x'))

tidypolars_extra.lag(x, n: int = 1, default=None)[source]¶

Get lagging values

Parameters:

x (Expr, Series) – Column to operate on
n (int) – Number of positions to lag by
default (optional) – Value to fill in missing values

Examples

>>> df.mutate(lag_x = tp.lag(col('x')))
>>> df.mutate(lag_x = tp.lag('x'))

tidypolars_extra.last(x)[source]¶

Get last value

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(last_x = tp.last('x'))
>>> df.summarize(last_x = tp.last(col('x')))

tidypolars_extra.lead(x, n: int = 1, default=None)[source]¶

Get leading values

Parameters:

x (Expr, Series) – Column to operate on
n (int) – Number of positions to lead by
default (optional) – Value to fill in missing values

Examples

>>> df.mutate(lead_x = tp.lead(col('x')))
>>> df.mutate(lead_x = col('x').lead())

tidypolars_extra.length(x)[source]¶

Number of observations in each group.

Alias for count().

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(length = tp.length(col('x')))

tidypolars_extra.log(x)[source]¶

Compute the natural logarithm of a column

Parameters:: x (Expr) – Column to operate on

Examples

>>> df.mutate(log = tp.log('x'))

tidypolars_extra.log10(x)[source]¶

Compute the base 10 logarithm of a column

Parameters:: x (Expr) – Column to operate on

Examples

>>> df.mutate(log = tp.log10('x'))

tidypolars_extra.mad(x)[source]¶

Compute the median absolute deviation

Use in summarize() context only. Not suitable for mutate().

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(mad_val = tp.mad('x'))

tidypolars_extra.make_date(year=1970, month=1, day=1)[source]¶

Create a date object

Parameters:

year (Expr, str, int) – Column or literal
month (Expr, str, int) – Column or literal
day (Expr, str, int) – Column or literal

Examples

>>> df.mutate(date = tp.make_date(2000, 1, 1))

tidypolars_extra.make_datetime(year=1970, month=1, day=1, hour=0, minute=0, second=0)[source]¶

Create a datetime object

Parameters:

year (Expr, str, int) – Column or literal
month (Expr, str, int) – Column or literal
day (Expr, str, int) – Column or literal
hour (Expr, str, int) – Column or literal
minute (Expr, str, int) – Column or literal
second (Expr, str, int) – Column or literal

Examples

>>> df.mutate(date = tp.make_datetime(2000, 1, 1))

tidypolars_extra.map(cols, _fun)[source]¶

Apply function by row

Parameters:

cols (list of str) – A list with the name of the columns in the data to apply function
_fun (a function) – The function to apply to the columns. The function is applied to each row separately

tidypolars_extra.matches(match, ignore_case=False)[source]¶

Matches pattern

Parameters:

match (str) – String to match columns
ignore_case (bool) – If True, the default, ignores case when matching names.

Examples

>>> df = tp.tibble({'a': range(3), 'add': range(3), 'sub': ['a', 'a', 'b']})
>>> df.select(tp.matches('a'))

tidypolars_extra.max(x)[source]¶

Get column max

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(max_x = tp.max('x'))
>>> df.summarize(max_x = tp.max(col('x')))

tidypolars_extra.mday(x)[source]¶

Extract the month day from a date from 1 to 31.

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(monthday = tp.mday(col('x')))

tidypolars_extra.mean(x)[source]¶

Get column mean

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(mean_x = tp.mean('x'))
>>> df.summarize(mean_x = tp.mean(col('x')))

tidypolars_extra.median(x)[source]¶

Get column median

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(median_x = tp.median('x'))
>>> df.summarize(median_x = tp.median(col('x')))

tidypolars_extra.microseconds(n=1)[source]¶

Create a duration of n microseconds

Parameters:: n (int) – Number of microseconds
Returns:: A duration literal.
Return type:: Expr

tidypolars_extra.milliseconds(n=1)[source]¶

Create a duration of n milliseconds

Parameters:: n (int) – Number of milliseconds
Returns:: A duration literal.
Return type:: Expr

tidypolars_extra.min(x)[source]¶

Get column minimum

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(min_x = tp.min('x'))
>>> df.summarize(min_x = tp.min(col('x')))

tidypolars_extra.minute(x)[source]¶

Extract the minute from a datetime

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(hour = tp.minute(col('x')))

tidypolars_extra.minutes(n=1)[source]¶

Create a duration of n minutes

Parameters:: n (int) – Number of minutes
Returns:: A duration literal.
Return type:: Expr

Examples

>>> df.mutate(later = col('datetime') + tp.minutes(30))

tidypolars_extra.mode(x)[source]¶

Compute the statistical mode (most frequent value)

Returns the first mode if there are ties (non-deterministic for ties). Use in summarize() context.

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(m = tp.mode('x'))

tidypolars_extra.month(x)[source]¶

Extract the month from a date

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(year = tp.month(col('x')))

tidypolars_extra.n()[source]¶

Number of observations in each group

Examples

>>> df.summarize(count = tp.n())

tidypolars_extra.n_distinct(x)[source]¶

Get number of distinct values in a column

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(min_x = tp.n_distinct('x'))
>>> df.summarize(min_x = tp.n_distinct(col('x')))

tidypolars_extra.n_missing(x)[source]¶

Count the number of null/missing values in a column

Parameters:: x (Expr, str) – Column to operate on
Returns:: Count of null values.
Return type:: Expr

Examples

>>> df.summarize(missing = tp.n_missing('x'))

tidypolars_extra.now()[source]¶

Return the current datetime as a polars literal

Returns:: A literal expression with the current datetime.
Return type:: Expr

Examples

>>> df.mutate(now = tp.now())

tidypolars_extra.ntile(x, n)[source]¶

Divide values into n roughly equal groups

Parameters:

x (Expr, Series) – Column to operate on
n (int) – Number of groups

Examples

>>> df.mutate(quartile = tp.ntile('x', 4))

tidypolars_extra.paste(*args, sep=' ')[source]¶

Concatenate strings together

Parameters:: args (Expr, str) – Columns and or strings to concatenate

Examples

>>> df = tp.tibble(x = ['a', 'b', 'c'])
>>> df.mutate(x_end = tp.paste(col('x'), 'end', sep = '_'))

tidypolars_extra.paste0(*args)[source]¶

Concatenate strings together with no separator

Parameters:: args (Expr, str) – Columns and or strings to concatenate

Examples

>>> df = tp.tibble(x = ['a', 'b', 'c'])
>>> df.mutate(xend = tp.paste0(col('x'), 'end'))

tidypolars_extra.pct_missing(x)[source]¶

Compute the percentage of null/missing values in a column

Parameters:: x (Expr, str) – Column to operate on
Returns:: Percentage of null values (0 to 100).
Return type:: Expr

Examples

>>> df.summarize(pct = tp.pct_missing('x'))

tidypolars_extra.percent_rank(x)[source]¶

Compute percent rank (values between 0 and 1)

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(prank = tp.percent_rank('x'))

tidypolars_extra.quantile(x, quantile=0.5)[source]¶

Get number of distinct values in a column

Parameters:

x (Expr, Series) – Column to operate on
quantile (float) – Quantile to return

Examples

>>> df.summarize(quantile_x = tp.quantile('x', .25))

tidypolars_extra.quarter(x)[source]¶

Extract the quarter from a date

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(quarter = tp.quarter(col('x')))

tidypolars_extra.rank(x, method='dense')[source]¶

Assigns a minimum rank to each element in the input list, handling ties by assigning the same (lowest) rank to tied values. The next distinct value’s rank is increased by the number of tied values before it.

Parameters:

x (str) – Column to operate on
method (str) – dense (default): Assigns ranks in a consecutive manner, without gaps, even for ties. average : Assigns the average rank to tied values. min: Assigns the minimum rank to tied values. max: Assigns the maximum rank to tied values. ordinal: Assigns a distinct rank to each value based on its order of appearance.

Returns:

A list of ranks corresponding to the elements of x.

Return type:

list of int

Examples

>>> rank([10, 20, 20, 30])
[1, 2, 2, 3]
>>> rank([3, 1, 2])
[3, 1, 2]  # since sorted order is 1,2,3 => ranks are assigned as per their order
>>> rank(["b", "a", "a", "c"])
[2, 1, 1, 3]

tidypolars_extra.rep(x, times=1)[source]¶

Replicate the values in x

Parameters:

x (const, Series) – Value or Series to repeat
times (int) – Number of times to repeat

Examples

>>> tp.rep(1, 3)
>>> tp.rep(pl.Series(range(3)), 3)

tidypolars_extra.replace_null(x, replace=None)[source]¶

Replace null values

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df = tp.tibble(x = [0, None], y = [None, None])
>>> df.mutate(x = tp.replace_null(col('x'), 1))

tidypolars_extra.round(x, digits=0)[source]¶

Round a column to the specified number of decimal places

Parameters:

x (Expr, Series) – Column to operate on
digits (int) – Decimals to round to

Examples

>>> df.mutate(x = tp.round(col('x')))

tidypolars_extra.row_number()[source]¶

Return row number

Examples

>>> df.mutate(row_num = tp.row_number())

tidypolars_extra.scale(x)[source]¶

Standardize the input by scaling it to a mean of 0 and a standard deviation of 1.

Parameters:: x (Expr) – Column to operate on
Returns:: The standardized version of the input data.
Return type:: array-like

tidypolars_extra.sd(x)[source]¶

Get column standard deviation

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(sd_x = tp.sd('x'))
>>> df.summarize(sd_x = tp.sd(col('x')))

tidypolars_extra.second(x)[source]¶

Extract the second from a datetime

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(hour = tp.minute(col('x')))

tidypolars_extra.seconds(n=1)[source]¶

Create a duration of n seconds

Parameters:: n (int) – Number of seconds
Returns:: A duration literal.
Return type:: Expr

Examples

>>> df.mutate(later = col('datetime') + tp.seconds(10))

tidypolars_extra.sqrt(x)[source]¶

Get column square root

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(sqrt_x = tp.sqrt('x'))

tidypolars_extra.starts_with(match, ignore_case=True)[source]¶

Starts with a prefix

Parameters:

match (str) – String to match columns
ignore_case (bool) – If TRUE, the default, ignores case when matching names.

Examples

>>> df = tp.tibble({'a': range(3), 'add': range(3), 'sub': ['a', 'a', 'b']})
>>> df.select(starts_with('a'))

tidypolars_extra.str_c(*args, sep='')[source]¶

Concatenate strings together.

Alias for paste().

Parameters:: args (Expr, str) – Columns and/or strings to concatenate

Examples

>>> df = tp.tibble(x = ['a', 'b', 'c'])
>>> df.mutate(x_end = str_c(col('x'), 'end', sep = '_'))

tidypolars_extra.str_count(string, pattern)[source]¶

Count occurrences of a pattern in a string

Parameters:

string (Expr, str) – Column to operate on
pattern (str) – Regular expression pattern to count

Examples

>>> df.mutate(n = tp.str_count('x', 'a'))

tidypolars_extra.str_detect(string, pattern, negate=False)[source]¶

Detect the presence or absence of a pattern in a string

Parameters:

string (str) – Input series to operate on
pattern (str) – Pattern to look for
negate (bool) – If True, return non-matching elements

Examples

>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape'])
>>> df.mutate(x = str_detect('name', 'a'))
>>> df.mutate(x = str_detect('name', ['a', 'e']))

tidypolars_extra.str_dup(string, times)[source]¶

Duplicate/repeat a string

Parameters:

string (Expr, str) – Column to operate on
times (int) – Number of times to repeat

Examples

>>> df.mutate(repeated = tp.str_dup('x', 3))

tidypolars_extra.str_ends(string, pattern, negate=False)[source]¶

Detect the presence or absence of a pattern at the end of a string.

Parameters:

string (Expr) – Column to operate on
pattern (str) – Pattern to look for
negate (bool) – If True, return non-matching elements

Examples

>>> df = tp.tibble(words = ['apple', 'bear', 'amazing'])
>>> df.filter(tp.str_ends(col('words'), 'ing'))

tidypolars_extra.str_extract(string, pattern)[source]¶

Extract the target capture group from provided patterns

Parameters:

string (str) – Input series to operate on
pattern (str) – Pattern to look for

Examples

>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape'])
>>> df.mutate(x = str_extract(col('name'), 'e'))

tidypolars_extra.str_extract_all(string, pattern)[source]¶

Extract all matches of a pattern

Parameters:

string (Expr, str) – Column to operate on
pattern (str) – Regular expression pattern with capture group

Returns:

A list column with all matches.

Return type:

Expr

Examples

>>> df.mutate(matches = tp.str_extract_all('x', r'\d+'))

tidypolars_extra.str_length(string)[source]¶

Length of a string

Parameters:: string (str) – Input series to operate on

Examples

>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape'])
>>> df.mutate(x = str_length(col('name')))

tidypolars_extra.str_pad(string, width, side='left', pad=' ')[source]¶

Pad a string to a specified width

Parameters:

string (Expr, str) – Column to operate on
width (int) – Minimum width of resulting string
side (str) – Side to pad on: ‘left’, ‘right’, or ‘both’
pad (str) – Character to pad with (single character)

Examples

>>> df.mutate(padded = tp.str_pad('x', 10))

tidypolars_extra.str_remove(string, pattern)[source]¶

Removes the first matched patterns in a string

Parameters:

string (str) – Input series to operate on
pattern (str) – Pattern to look for

Examples

>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape'])
>>> df.mutate(x = str_remove(col('name'), 'a'))

tidypolars_extra.str_remove_all(string, pattern)[source]¶

Removes all matched patterns in a string

Parameters:

string (str) – Input series to operate on
pattern (str) – Pattern to look for

Examples

>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape'])
>>> df.mutate(x = str_remove_all(col('name'), 'a'))

tidypolars_extra.str_replace(string, pattern, replacement)[source]¶

Replaces the first matched patterns in a string

Parameters:

string (str) – Input series to operate on
pattern (str) – Pattern to look for
replacement (str) – String that replaces anything that matches the pattern

Examples

>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape'])
>>> df.mutate(x = str_replace(col('name'), 'a', 'A'))

tidypolars_extra.str_replace_all(string, pattern, replacement)[source]¶

Replaces all matched patterns in a string

Parameters:

string (str) – Input series to operate on
pattern (str) – Pattern to look for
replacement (str) – String that replaces anything that matches the pattern

Examples

>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape'])
>>> df.mutate(x = str_replace_all(col('name'), 'a', 'A'))

tidypolars_extra.str_split(string, pattern)[source]¶

Split a string by a pattern

Parameters:

string (Expr, str) – Column to operate on
pattern (str) – Pattern to split on

Returns:

A list column with split parts.

Return type:

Expr

Examples

>>> df.mutate(parts = tp.str_split('x', '_'))

tidypolars_extra.str_squish(string)[source]¶

Remove leading/trailing whitespace and collapse internal whitespace

Parameters:: string (Expr, str) – Column to operate on

Examples

>>> df.mutate(clean = tp.str_squish('x'))

tidypolars_extra.str_starts(string, pattern, negate=False)[source]¶

Detect the presence or absence of a pattern at the beginning of a string.

Parameters:

string (Expr) – Column to operate on
pattern (str) – Pattern to look for
negate (bool) – If True, return non-matching elements

Examples

>>> df = tp.tibble(words = ['apple', 'bear', 'amazing'])
>>> df.filter(tp.str_starts(col('words'), 'a'))

tidypolars_extra.str_sub(string, start=0, end=None)[source]¶

Extract portion of string based on start and end inputs

Parameters:

string (str) – Input series to operate on
start (int) – First position of the character to return
end (int) – Last position of the character to return

Examples

>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape'])
>>> df.mutate(x = str_sub(col('name'), 0, 3))

tidypolars_extra.str_to_lower(string)[source]¶

Convert case of a string

Parameters:: string (str) – Convert case of this string

Examples

>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape'])
>>> df.mutate(x = str_to_lower(col('name')))

tidypolars_extra.str_to_title(string)[source]¶

Convert string to Title Case

Parameters:: string (Expr, str) – Column to operate on

Examples

>>> df.mutate(titled = tp.str_to_title('x'))

tidypolars_extra.str_to_upper(string)[source]¶

Convert case of a string

Parameters:: string (str) – Convert case of this string

Examples

>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape'])
>>> df.mutate(x = str_to_upper(col('name')))

tidypolars_extra.str_trim(string, side='both')[source]¶

Trim whitespace

Parameters:

string (Expr, Series) – Column or series to operate on
side (str) –
One of:
- ”both”
- ”left”
- ”right”

Examples

>>> df = tp.tibble(x = [' a ', ' b ', ' c '])
>>> df.mutate(x = tp.str_trim(col('x')))

tidypolars_extra.str_wrap(string, width, sep='list')[source]¶

Split string

Parameters:

string (str) – Column name to operate on
width (int) – Width to split the string
sep (string) – One of “n”: put “n” to split the string; return a single string “list”: return a list based on width

tidypolars_extra.sum(x)[source]¶

Get column sum

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.summarize(sum_x = tp.sum('x'))
>>> df.summarize(sum_x = tp.sum(col('x')))

tidypolars_extra.today()[source]¶

Return the current date as a polars literal

Returns:: A literal expression with today’s date.
Return type:: Expr

Examples

>>> df.mutate(today = tp.today())

tidypolars_extra.var(x)[source]¶

Get column variance

Parameters:: x (Expr) – Column to operate on

Examples

>>> df.summarize(sum_x = tp.var('x'))
>>> df.summarize(sum_x = tp.var(col('x')))

tidypolars_extra.wday(x)[source]¶

Extract the weekday from a date from sunday = 1 to saturday = 7.

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(weekday = tp.wday(col('x')))

tidypolars_extra.week(x)[source]¶

Extract the week from a date

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(week = tp.week(col('x')))

tidypolars_extra.weeks(n=1)[source]¶

Create a duration of n weeks

Parameters:: n (int) – Number of weeks
Returns:: A duration literal.
Return type:: Expr

Examples

>>> df.mutate(next_week = col('date') + tp.weeks(1))

tidypolars_extra.weighted_mean(x, w)[source]¶

Compute weighted mean

Parameters:

x (Expr, Series) – Column of values
w (Expr, Series) – Column of weights

Examples

>>> df.summarize(wm = tp.weighted_mean('x', 'w'))

tidypolars_extra.where(col_type)[source]¶

Select columns by type using a string

Options:

character : factor (ordered or unordered) and string string : only strings, exclude factors factor : ordered or unordered factors ordered : only ordered factors unordered : only unordered factors

numeric : float or integet float : only float integer : only integer

date : date datetime : data and time

Examples

>>> from tidypolars_extra.data import mtcars
>>> df = mtcars
>>> df.select(tp.where("integer"))
>>> df.select(tp.where("numeric"))
>>> df.select(tp.where("string") | tp.where("integer"))

tidypolars_extra.yday(x)[source]¶

Extract the year day from a date from 1 to 366.

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(yearday = tp.yday(col('x')))

tidypolars_extra.year(x)[source]¶

Extract the year from a date

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(year = tp.year(col('x')))

tidypolars_extra.zscore(x)[source]¶

Standardize to z-scores (alias for scale)

Parameters:: x (Expr, Series) – Column to operate on

Examples

>>> df.mutate(z = tp.zscore('x'))

tidypolars_extra.API_labels[source]¶