tidypolars_extra¶
Submodules¶
Attributes¶
Classes¶
Expressions that can be used in various contexts. |
|
Starts a new GroupBy operation. |
|
Read data into a tibble. |
|
A data frame object that provides methods familiar to R tidyverse users. |
Functions¶
|
Absolute value |
|
Apply a function across a selection of columns |
|
Convert column to string. Alias to as_logical (R naming). |
|
Convert to factor. Alias for as_factor |
|
Convert to string. Defaults to Utf8. |
|
Convert a string to a Date |
|
Convert a string to a Datetime |
|
Convert to factor (R naming), equlivalent to Enum or |
|
Convert to float. Defaults to Float64. |
|
Convert to integer. Defaults to Int64. |
|
Convert to a boolean (polars) or 'logical' (R naming) |
|
Convert column to string. Alias to as_character (R naming). |
|
Test if values of a column are between two values |
|
Case when |
|
General type conversion. |
|
Round date up to the nearest unit |
|
Coalesce missing values |
|
Contains a literal string |
|
Find the correlation of two columns |
|
Number of observations in each group |
|
Find the covariance of two columns |
|
Compute cumulative distribution (proportion of values <= current value) |
|
Cumulative maximum |
|
Cumulative minimum |
|
Cumulative product |
|
Cumulative sum |
|
Create a duration of n days |
|
Mark a column to order in descending |
|
Compute time differences in specified units |
|
Round the datetime |
|
Ends with a suffix |
Selects all columns |
|
|
Collapse multiple factor levels into one |
|
Reorder factor levels by frequency (most common first) |
|
Collapse least frequent factor levels into 'Other' |
|
Manually recode factor levels |
|
Reverse factor level order |
|
Get first value |
|
Round numbers down to the lower integer |
|
Round date down to the nearest unit |
|
Convert from pandas DataFrame to tibble |
|
Convert from polars DataFrame to tibble |
|
Extract the hour from a datetime |
|
Create a duration of n hours |
|
If Else |
|
Compute the interquartile range (Q3 - Q1) |
|
Test if values are finite |
|
Test if values are in a list |
|
Test if values are infinite |
|
Negate a boolean expression |
|
Test if values are not in a list |
|
Test if values are not null |
|
Test if values are null |
|
Get lagging values |
|
Get last value |
|
Get leading values |
|
Number of observations in each group. |
|
Compute the natural logarithm of a column |
|
Compute the base 10 logarithm of a column |
|
Compute the median absolute deviation |
|
Create a date object |
|
Create a datetime object |
|
Apply function by row |
|
Matches pattern |
|
Get column max |
|
Extract the month day from a date from 1 to 31. |
|
Get column mean |
|
Get column median |
|
Create a duration of n microseconds |
|
Create a duration of n milliseconds |
|
Get column minimum |
|
Extract the minute from a datetime |
|
Create a duration of n minutes |
|
Compute the statistical mode (most frequent value) |
|
Extract the month from a date |
|
Number of observations in each group |
|
Get number of distinct values in a column |
|
Count the number of null/missing values in a column |
|
Return the current datetime as a polars literal |
|
Divide values into n roughly equal groups |
|
Concatenate strings together |
|
Concatenate strings together with no separator |
|
Compute the percentage of null/missing values in a column |
|
Compute percent rank (values between 0 and 1) |
|
Get number of distinct values in a column |
|
Extract the quarter from a date |
|
Assigns a minimum rank to each element in the input list, handling ties by |
|
Replicate the values in x |
|
Replace null values |
|
Round a column to the specified number of decimal places |
Return row number |
|
|
Standardize the input by scaling it to a mean of 0 and a standard deviation of 1. |
|
Get column standard deviation |
|
Extract the second from a datetime |
|
Create a duration of n seconds |
|
Get column square root |
|
Starts with a prefix |
|
Concatenate strings together. |
|
Count occurrences of a pattern in a string |
|
Detect the presence or absence of a pattern in a string |
|
Duplicate/repeat a string |
|
Detect the presence or absence of a pattern at the end of a string. |
|
Extract the target capture group from provided patterns |
|
Extract all matches of a pattern |
|
Length of a string |
|
Pad a string to a specified width |
|
Removes the first matched patterns in a string |
|
Removes all matched patterns in a string |
|
Replaces the first matched patterns in a string |
|
Replaces all matched patterns in a string |
|
Split a string by a pattern |
|
Remove leading/trailing whitespace and collapse internal whitespace |
|
Detect the presence or absence of a pattern at the beginning of a string. |
|
Extract portion of string based on start and end inputs |
|
Convert case of a string |
|
Convert string to Title Case |
|
Convert case of a string |
|
Trim whitespace |
|
Split string |
|
Get column sum |
|
Return the current date as a polars literal |
|
Get column variance |
|
Extract the weekday from a date from sunday = 1 to saturday = 7. |
|
Extract the week from a date |
|
Create a duration of n weeks |
|
Compute weighted mean |
|
Select columns by type using a string |
|
Extract the year day from a date from 1 to 366. |
|
Extract the year from a date |
|
Standardize to z-scores (alias for scale) |
Package Contents¶
- class tidypolars_extra.DescCol[source]¶
Bases:
polars.ExprExpressions that can be used in various contexts.
- class tidypolars_extra.TibbleGroupBy(df, by, *args, **kwargs)[source]¶
Bases:
tidypolars_extra.type_conversion.pl.dataframe.group_by.GroupByStarts a new GroupBy operation.
Utility class for performing a group by operation over the given DataFrame.
Generated by calling df.group_by(…).
- Parameters:
df – DataFrame to perform the group by operation over.
*by – Column or columns to group by. Accepts expression input. Strings are parsed as column names.
maintain_order – Ensure that the order of the groups is consistent with the input data. This is slower than a default group by.
predicates – Predicate expressions to filter groups after aggregation.
**named_by – Additional column(s) to group by, specified as keyword arguments. The columns will be named as the keyword used.
- by¶
- df¶
- class tidypolars_extra.read_data[source]¶
Read data into a tibble.
Formats supported: csv, dta, xls, xlsx, ods, tsv, txt, tex, dat, sav, rds, Rdata, gspread
- Parameters:
fn (str) – Full path to file, including filename. The type of file is inferred from the file extension. Hierarchical headers are accepted (see Notes). To see accepted formats, run: “
read_data.get_accepted_file_formats(True)” To read from google spreadsheet directly, use “credentials” and “url” instead of “fn”. To read from a URL with the file other from a google spreadsheet, use “fn”.credentials (str) – Path to the .json file with Google API credentials to access the spreadsheet (see Notes).
url (str) – Google spreadsheet URL
sheet_name (str | int) – Name of the sheet to load.
cols (list of str) – List with names of the columns to return. Used with .sav files.
sep (str (Default ";")) – Specify the column separator for .csv files
big_data (bool) – If True, uses dask to load the data. Default: False
silently (bool (optional)) – If True, do now show a completion message
sheet_name – Sheet name or index.
n_headers (int) – Used for data with hierarchical header. Number of header rows at the top of the sheet that are header of the columns. See notes. Defaults 0.
header_combine_rule (callable(levels) -> str) – Used for data with hierarchical header. How to combine the list of non-empty levels into a final column name. Default (None) uses “level 1 (<level 2>, <level 3>… <level n>)” If combine=’_’, it uses ‘_’.join(levels).
combine_parenthesis_sep (str) – Used for data with hierarchical header. Used by default combine to separate levels grouped within parenthesis in the column name. Default uses ‘,’: “level 1 (<level 2>, <level 3>… <level n>)”
multi_col_sentinel (Any) – Used for data with hierarchical header. Value used in upper levels to indicate “continuation” of a merged header from the previous column (default: the string “None”).
Notes
Other keyword arguments are accepted based on the underlying method that reads the file, which can be found in their respective documentation provided by the original module.
Extension => underlying method:
.csv => polars.read_csv (uses sep=’,’ as default)
.tsv => polars.read_csv (uses sep=’ ‘ as default)
.dat => polars.read_csv (uses sep=’ ‘ as default)
.txt => polars.read_csv (lines into list)
.xls => pandas.read_excel
.xlsx => pandas.read_excel
.xlt => pandas.read_excel
.xltx => pandas.read_excel
.ods => pandas.read_excel
.dta => pandas.read_stata
.sav => pyreadstat.read_sav
.rds => pyreadr.read_r
.rda => pyreadr.read_r
.Rdata => pyreadr.read_r
Big data is handled with Dask
Hierarchical header:
Some data contains a hierarchical header, i.e., a multi-line header. Here is an example with 2 levels:
|----------------------------------------| | Party | Age | Gender | |---------------|---------------|--------| | Code | Value | value | group | | |------|--------|-------|-------|--------| | 1 | Dem | 23 | 20-29 | M | | 0 | Rep | 33 | 30-39 | F | |----------------------------------------|
When that is the case, the argument
n_headerscan be used to specify the number of header levels, or lines containing header information. The function flattens the levels and combines the information into the header name to maintain a tidy format. The rule is:In upper levels (all rows except the last), values equal to multi_col_sentinel, None, or empty string are treated as “merged” and forward-filled horizontally.
In the last level, None or multi_col_sentinel is treated as “missing label” and is simply ignored for that level.
The example above becomes:
|--------------------------------------------------------------------| | Party (code) | Party (value) | Age (value) | Age (group) | Gender | |---------------|---------------|-------------|-------------|--------| | 1 | Dem | 23 | 20-29 | M | | 0 | Rep | 33 | 30-39 | F | |--------------------------------------------------------------------|
See
header_combine_ruleandcombine_parenthesis_sepfor more settings.Load data from a google spreadsheet:
It requires Google credentials. The settings follow Google requirements and gspread steps. Steps available here: - https://docs.gspread.org/en/latest/oauth2.html#for-end-users-using-oauth-client-id
- Returns:
tibble when the file has no variable or value labels,
(tibble, DATA_LABELS) when it does
- class tidypolars_extra.tibble(*args, **kwargs)[source]¶
Bases:
tidypolars_extra.type_conversion.pl.DataFrameA data frame object that provides methods familiar to R tidyverse users.
- anti_join(df, left_on=None, right_on=None, on=None)[source]¶
Perform an anti join (keep rows without a match in df)
- Parameters:
df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
- Returns:
Rows from the original tibble that do not have a match in df.
- Return type:
Examples
>>> df1.anti_join(df2, on = 'x')
- arrange(*args)[source]¶
Arrange/sort rows
- Parameters:
*args (str) – Columns to sort by
Examples
>>> df = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)}) >>> # Arrange in ascending order >>> df.arrange('x', 'y') >>> # Arrange some columns descending >>> df.arrange(tp.desc('x'), 'y')
- Returns:
Original tibble ordered by
args- Return type:
- assert_no_nulls(*cols)[source]¶
Assert that specified columns contain no null values
- Parameters:
*cols (str) – Column names to check. If empty, checks all columns.
- Returns:
Returns self if assertion passes.
- Return type:
- Raises:
AssertionError – If any null values are found.
Examples
>>> df.assert_no_nulls('x', 'y')
- assert_unique(*cols)[source]¶
Assert that specified columns have unique combinations
- Parameters:
*cols (str) – Column names to check. If empty, checks all columns.
- Returns:
Returns self if assertion passes.
- Return type:
- Raises:
AssertionError – If duplicate combinations are found.
Examples
>>> df.assert_unique('id')
- bind_cols(*args)[source]¶
Bind data frames by columns
- Parameters:
*args (tibble) – Data frame to bind
- Returns:
The original tibble with added columns from the other tibble specified in
args- Return type:
Examples
>>> df1 = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)}) >>> df2 = tp.tibble({'a': ['c', 'c', 'c'], 'b': range(4, 7)}) >>> df1.bind_cols(df2)
- bind_rows(*args)[source]¶
Bind data frames by row
- Parameters:
*args (tibble, list) – Data frames to bind by row
- Returns:
The original tibble with added rows from the other tibble specified in
args- Return type:
Examples
>>> df1 = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)}) >>> df2 = tp.tibble({'x': ['c', 'c', 'c'], 'y': range(4, 7)}) >>> df1.bind_rows(df2)
- clean_names(case='snake')[source]¶
Standardize column names
- Parameters:
case (str) – Case style for column names. Options: ‘snake’ (default), ‘lower’, ‘upper’.
- Returns:
A tibble with cleaned column names.
- Return type:
Examples
>>> df = tp.tibble(**{"First Name": [1], "Last.Name": [2], "AGE (years)": [30]}) >>> df.clean_names()
- colnames(regex='.', type=None, include_factor=True)[source]¶
Return the names of numeric columns in self that match ‘regex’ type: (str)
- include_factor: (boolean)
When type=string, include or not factors
- complete(*cols, fill=None)[source]¶
Complete a DataFrame with all combinations of specified columns
- Parameters:
*cols (str) – Column names to find all combinations of.
fill (dict, optional) – Dictionary of column names to fill values for missing combinations.
- Returns:
A tibble with all combinations of the specified columns, with missing values filled according to fill parameter.
- Return type:
Examples
>>> df = tp.tibble(x = [1, 1, 2], y = ['a', 'b', 'a'], val = [10, 20, 30]) >>> df.complete('x', 'y')
- count(*args, sort=False, name='n')[source]¶
Returns row counts of the dataset. If bare column names are provided, count() returns counts by group.
- Parameters:
*args (str, Expr) – Columns to group by
sort (bool) – Should columns be ordered in descending order by count
name (str) – The name of the new column in the output. If omitted, it will default to “n”.
- Returns:
If no agument is provided, just return the nomber of rows. If column names are provided, it will count the unique values across columns
- Return type:
Examples
>>> df = tp.tibble({'a': [1, 1, 2, 3], ...: 'b': ['a', 'a', 'b', 'b']}) >>> df.count() shape: (1, 1) ┌─────┐ │ n │ │ u32 │ ╞═════╡ │ 4 │ └─────┘ >>> df.count('a', 'b') shape: (3, 3) ┌─────────────────┐ │ a b n │ │ i64 str u32 │ ╞═════════════════╡ │ 1 a 2 │ │ 2 b 1 │ │ 3 b 1 │ └─────────────────┘
- cross_join(df, suffix='_right')[source]¶
Perform a cross join (Cartesian product)
- Parameters:
df (tibble) – DataFrame to join with.
suffix (str) – Suffix to append to columns with a duplicate name.
- Returns:
All combinations of rows from both tibbles.
- Return type:
Examples
>>> df1.cross_join(df2)
- crossing(*args, **kwargs)[source]¶
Expands the existing tibble for each value of the variables used in the crossing() argument. See Returns.
- Parameters:
*args (list) – One unamed list is accepted.
**kwargs (list) – Keyword will be the variable name, and the values in the list will be in the expanded tibble
- Returns:
A tibble with varibles containing all combinations of the values in the arguments passed to crossing(). The original tibble will be replicated for each unique combination.
- Return type:
Examples
>>> df = tp.tibble({'a': [1, 2], "b": [3, 5]}) >>> df shape: (2, 2) ┌───────────┐ │ a b │ │ i64 i64 │ ╞═══════════╡ │ 1 3 │ │ 2 5 │ └───────────┘ >>> df.crossing(c = ['a', 'b', 'c']) shape: (6, 3) ┌─────────────────┐ │ a b c │ │ i64 i64 str │ ╞═════════════════╡ │ 1 3 a │ │ 1 3 b │ │ 1 3 c │ │ 2 5 a │ │ 2 5 b │ │ 2 5 c │ └─────────────────┘
- describe()[source]¶
Generate summary statistics for all columns
- Returns:
A tibble with summary statistics including column name, type, count of non-null values, null count, unique count, and for numeric columns: mean, std, min, 25%, 50%, 75%, max.
- Return type:
Examples
>>> df.describe()
- descriptive_statistics(vars=None, groups=None, include_categorical=True, include_type=False)[source]¶
Compute descriptive statistics for numerical variables and optionally frequency statistics for categorical variables, with support for grouping.
- Parameters:
vars (str, list, dict, or None, default None) – The variables for which to compute statistics. - If None, all variables in the dataset (as given by self.names) are used. - If a string, it is interpreted as a single variable name. - If a list, each element is treated as a variable name. - If a dict, keys are variable names and values are their labels.
groups (str, list, dict, or None, default None) – Variable(s) to group by when computing statistics. - If None, overall statistics are computed. - If a string, it is interpreted as a single grouping variable. - If a list, each element is treated as a grouping variable. - If a dict, keys are grouping variable names and values are their labels.
include_categorical (bool, default True) – Whether to include frequency statistics for categorical variables in the output.
include_type (bool, default False) – If True, adds a column indicating the variable type (“Num” for numerical, “Cat” for categorical).
- Returns:
A tibble containing the descriptive statistics. For numerical variables, the statistics include N (count of non-missing values), Missing (percentage of missing values), Mean (average value), Std.Dev. (standard deviation), Min (minimum value), and Max (maximum value). If grouping is specified, these statistics are computed for each group. When
include_categoricalis True, frequency statistics for categorical variables are appended to the result.- Return type:
- distinct(*args, keep_all=False)[source]¶
Select distinct/unique rows
- Parameters:
*args (str, Expr) – Columns to find distinct/unique rows
keep_all (boll) – If True, keep all columns. Otherwise, return only the ones used to select the distinct rows.
- Returns:
Tibble after removing the repeated rows based on
args- Return type:
Examples
>>> df = tp.tibble({'a': range(3), 'b': ['a', 'a', 'b']}) >>> df.distinct() >>> df.distinct('b')
- drop(*args)[source]¶
Drop unwanted columns
- Parameters:
*args (str) – Columns to drop
- Returns:
Tibble with columns in
argsdropped- Return type:
Examples
>>> df.drop('x', 'y')
- drop_na(*args)[source]¶
Drop rows containing missing values. Alias for
drop_null(), matching tidyr’sdrop_na()spelling.- Parameters:
*args (str) – Columns to drop nulls from (defaults to all)
- Returns:
Tibble with rows containing nulls in
argsremoved.- Return type:
Examples
>>> df = tp.tibble(x = [1, None, 3], y = [None, 'b', 'c']) >>> df.drop_na() >>> df.drop_na('x')
- drop_null(*args)[source]¶
Drop rows containing missing values
- Parameters:
*args (str) – Columns to drop nulls from (defaults to all)
- Returns:
Tibble with rows in
argswith missing values dropped- Return type:
Examples
>>> df = tp.tibble(x = [1, None, 3], y = [None, 'b', 'c'], z = range(3)} >>> df.drop_null() >>> df.drop_null('x', 'y')
- fill(*args, direction='down', by=None)[source]¶
Fill in missing values with previous or next value
- Parameters:
*args (str) – Columns to fill
direction (str) – Direction to fill. One of [‘down’, ‘up’, ‘downup’, ‘updown’]
by (str, list) – Columns to group by
- Returns:
Tibble with missing values filled
- Return type:
Examples
>>> df = tp.tibble({'a': [1, None, 3, 4, 5], ... 'b': [None, 2, None, None, 5], ... 'groups': ['a', 'a', 'a', 'b', 'b']}) >>> df.fill('a', 'b') >>> df.fill('a', 'b', by = 'groups') >>> df.fill('a', 'b', direction = 'downup')
- filter(*args, by=None)[source]¶
Filter rows on one or more conditions
- Parameters:
*args (Expr) – Conditions to filter by
by (str, list) – Columns to group by
- Returns:
A tibble with rows that match condition.
- Return type:
Examples
>>> df = tp.tibble({'a': range(3), 'b': ['a', 'a', 'b']}) >>> df.filter(col('a') < 2, col('b') == 'a') >>> df.filter((col('a') < 2) & (col('b') == 'a')) >>> df.filter(col('a') <= tp.mean(col('a')), by = 'b')
- freq(vars=None, groups=None, na_rm=False, na_label=None)[source]¶
Compute frequency table.
- Parameters:
vars (str, list, or dict) – Variables to return value frequencies for. If a dict is provided, the key should be the variable name and the values the variable label for the output
groups (str, list, dict, or None, optional) – Variable names to condition marginal frequencies on. If a dict is provided, the key should be the variable name and the values the variable label for the output Defaults to None (no grouping).
na_rm (bool, optional) – Whether to include NAs in the calculation. Defaults to False.
na_label (str) – Label to use for the NA values
- Returns:
A tibble with relative frequencies and counts.
- Return type:
- full_join(df, left_on=None, right_on=None, on=None, suffix: str = '_right')[source]¶
Perform an full join
- Parameters:
df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.
- Returns:
Union between the original and the df tibbles. The rows that don’t match in one of the tibbles will be completed with missing values.
- Return type:
Examples
>>> df1.full_join(df2) >>> df1.full_join(df2, on = 'x') >>> df1.full_join(df2, left_on = 'left_x', right_on = 'x')
- get_dupes(*cols)[source]¶
Find duplicate rows
- Parameters:
*cols (str) – Column names to check for duplicates. If empty, checks all columns.
- Returns:
A tibble containing duplicate rows with a ‘dupe_count’ column.
- Return type:
Examples
>>> df.get_dupes('x', 'y')
- glimpse(regex='.')[source]¶
Print compact information about the data
- Parameters:
regex (str, list, dict) – Return information of the variables that match the regular expression, the list, or the dictionary. If dictionary is used, the variable names must be the dictionary keys.
- Return type:
None
- group_by(group, *args, **kwargs)[source]¶
Takes an existing tibble and converts it into a grouped tibble where operations are performed “by group”. ungroup() happens automatically after the operation is performed.
- Parameters:
group (str, list) – Variable names to group by.
- Returns:
A tibble with values grouped by one or more columns.
- Return type:
Grouped tibble
- hoist(col_name, *, remove=False, **fields)[source]¶
Pull named elements out of a list- or struct-column into top-level columns.
- Parameters:
col_name (str) – Name of the list or struct column to reach into.
remove (bool) – If True, drop the original column after hoisting.
**fields (str, int, or list) – Each keyword defines a new top-level column. The value is a path into the list/struct column: a field name, an integer list index, or a list of such steps for nested access.
Examples
>>> df = tp.tibble(meta = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]) >>> df.hoist('meta', a = 'a')
- inner_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶
Perform an inner join
- Parameters:
df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.
- Returns:
A tibble with intersection of cases in the original and df tibbles.
- Return type:
Examples
>>> df1.inner_join(df2) >>> df1.inner_join(df2, on = 'x') >>> df1.inner_join(df2, left_on = 'left_x', right_on = 'x')
- left_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶
Perform a left join
- Parameters:
df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.
- Returns:
The original tibble with added columns from tibble df if they match columns in the original one. Columns to match on are given in the function parameters.
- Return type:
Examples
>>> df1.left_join(df2) >>> df1.left_join(df2, on = 'x') >>> df1.left_join(df2, left_on = 'left_x', right_on = 'x')
- mutate(*args, by=None, **kwargs)[source]¶
Add or modify columns
- Parameters:
*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify
- Returns:
Original tibble with new column created.
- Return type:
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), c = ['a', 'a', 'b']}) >>> df.mutate(double_a = col('a') * 2, ... a_plus_b = col('a') + col('b')) >>> df.mutate(row_num = row_number(), by = 'c')
- nest(by, *args, **kwargs)[source]¶
creates a nested tibble
- Parameters:
by (list, str) – Columns to nest on
kwargs –
- datalist of column names
columns to select to include in the nested data If not provided, include all columns except the ones used in ‘by’
- keystr
name of the resulting nested column.
- names_sepstr
If not provided (default), the names in the nested data will come from the former names. If a string, the new inner names in the nested dataframe will use the outer names with names_sep automatically stripped. This makes names_sep roughly symmetric between nesting and unnesting.
- Returns:
The resulting tibble with have a column that contains nested tibbles
- Return type:
- pack(**groups)[source]¶
Pack several columns into one or more struct columns.
- Parameters:
**groups (list or str) – Each keyword defines a new struct column. The value is a list of existing column names to pack into that struct.
Examples
>>> df = tp.tibble(x = [1, 2], y = [3, 4], z = ['a', 'b']) >>> df.pack(position = ['x', 'y'])
- pipe(fn, *args, **kwargs)[source]¶
Apply a function to the entire DataFrame
- Parameters:
fn (callable) – Function to apply. The tibble is passed as the first argument.
*args (any) – Additional positional arguments passed to fn.
**kwargs (any) – Additional keyword arguments passed to fn.
- Returns:
Result of
fn(self, *args, **kwargs).- Return type:
any
Examples
>>> def add_column(df, name, value): ... return df.mutate(**{name: value}) >>> df.pipe(add_column, 'new_col', 1)
- pivot_longer(cols=None, names_to='name', values_to='value')[source]¶
Pivot data from wide to long
- Parameters:
cols (Expr) – List of the columns to pivot. Defaults to all columns.
names_to (str) – Name of the new “names” column.
values_to (str) – Name of the new “values” column
- Returns:
Original tibble, but in long format.
- Return type:
Examples
>>> df = tp.tibble({'id': ['id1', 'id2'], 'a': [1, 2], 'b': [1, 2]}) >>> df.pivot_longer(cols = ['a', 'b']) >>> df.pivot_longer(cols = ['a', 'b'], names_to = 'stuff', values_to = 'things')
- pivot_wider(names_from='name', values_from='value', id_cols=None, values_fn='first', values_fill=None)[source]¶
Pivot data from long to wide
- Parameters:
names_from (str) – Column to get the new column names from.
values_from (str) – Column to get the new column values from
id_cols (str, list) – A set of columns that uniquely identifies each observation. Defaults to all columns in the data table except for the columns specified in names_from and values_from.
values_fn (str) – Function for how multiple entries per group should be dealt with. Any of ‘first’, ‘count’, ‘sum’, ‘max’, ‘min’, ‘mean’, ‘median’, ‘last’
values_fill (str) – If values are missing/null, what value should be filled in. Can use: “backward”, “forward”, “mean”, “min”, “max”, “zero”, “one”
- Returns:
Original tibble, but in wide format.
- Return type:
Examples
>>> df = tp.tibble({'id': [1, 1], 'variable': ['a', 'b'], 'value': [1, 2]}) >>> df.pivot_wider(names_from = 'variable', values_from = 'value')
- print(n=1000, ncols=1000, str_length=1000, digits=2)[source]¶
Print the DataFrame
- Parameters:
n (int, default=1000) – Number of rows to print
ncols (int, default=1000) – Number of columns to print
str_length (int, default=1000) – Maximum length of the strings.
- Return type:
None
- pull(var=None)[source]¶
Extract a column as a series
- Parameters:
var (str) – Name of the column to extract. Defaults to the last column.
- Returns:
The series will contain the values of the column from var.
- Return type:
Series
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3)) >>> df.pull('a')
- relevel(x, ref)[source]¶
Change the reference level a string or factor and covert to factor
Inputs¶
- xstr
Variable name
- refstr
Reference level
- returns:
The original tibble with the column specified in x as an ordered factors, with first category specified in ref.
- rtype:
tibble
- relocate(*args, before=None, after=None)[source]¶
Move a column or columns to a new position
- Parameters:
*args (str, Expr) – Columns to move
- Returns:
Original tibble with columns relocated.
- Return type:
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.relocate('a', before = 'c') >>> df.relocate('b', after = 'c')
- rename(*args, regex=False, tolower=False, strict=False, **kwargs)[source]¶
Rename columns
- Parameters:
*args (str or dict) – If a single dict is provided, it is used as {old_name: new_name}. If strings are provided, they are treated as pairs: new_name, old_name, …
regex (bool, default False) – If True, uses regular expression replacement {<matched from>:<matched to>}
tolower (bool, default False) – If True, convert all to lower case
**kwargs (str) – Keyword arguments in the form new_name=’old_name’
- Returns:
Original tibble with columns renamed.
- Return type:
Examples
>>> df = tp.tibble({'x': range(3), 't': range(3), 'z': ['a', 'a', 'b']}) >>> df.rename({'x': 'new_x'}) >>> df.rename(new_x = 'x') >>> df.rename('new_x', 'x')
- replace(rep, regex=False)[source]¶
Replace method from polars pandas. Replaces values of a column.
- Parameters:
rep (dict) –
- Format to use polars’ replace:
{<varname>:{<old value>:<new value>, …}}
- Format to use pandas’ replace:
{<old value>:<new value>, …}
regex (bool) – If true, replace using regular expression. It uses pandas replace()
- Returns:
Original tibble with values of columns replaced based on rep`.
- Return type:
- replace_na(replace=None)[source]¶
Replace null values in specified columns
- Parameters:
replace (dict) – Dictionary mapping column names to replacement values.
- Returns:
A tibble with nulls replaced.
- Return type:
Examples
>>> df.replace_na({'x': 0, 'y': 'missing'})
- replace_null(replace=None)[source]¶
Replace null values
- Parameters:
replace (dict, str, int, or float) – Dictionary of column/replacement pairs, or values to replace null values. If not dict, replace in all columns. If replace is a string, it will replace nulls in all string columns, and so on.
- Returns:
Original tibble with missing/null values replaced.
- Return type:
Examples
>>> df = tp.tibble({'a': [None, 'abc', 'cde'], 'b':[None, 1, 2], 'c': [None, 1.1, 2.2]}) >>> df.replace_null({'a': 'New value'}) >>> df.replace_null({'a': 1}) >>> df.replace_null({'b': 1}) >>> df.replace_null({'b': 1.1}) >>> df.replace_null({'c': 1}) >>> df.replace_null('a') >>> df.replace_null(1) >>> df.replace_null(1.1)
- right_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶
Perform a right join
- Parameters:
df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.
- Returns:
Every row of
dfwith matching columns fromself. Unmatched rows on the left side receive null values.- Return type:
Examples
>>> df1.right_join(df2) >>> df1.right_join(df2, on = 'x') >>> df1.right_join(df2, left_on = 'left_x', right_on = 'x')
- sample_frac(fraction, seed=None, with_replacement=False)[source]¶
Randomly sample a fraction of rows
- Parameters:
fraction (float) – Fraction of rows to sample (between 0 and 1).
seed (int, optional) – Random seed for reproducibility.
with_replacement (bool) – Whether to sample with replacement.
- Returns:
A tibble with a random fraction of rows.
- Return type:
Examples
>>> df.sample_frac(0.5, seed = 42)
- sample_n(n, seed=None, with_replacement=False)[source]¶
Randomly sample n rows
- Parameters:
n (int) – Number of rows to sample.
seed (int, optional) – Random seed for reproducibility.
with_replacement (bool) – Whether to sample with replacement.
- Returns:
A tibble with n randomly sampled rows.
- Return type:
Examples
>>> df.sample_n(5, seed = 42)
- save_data(fn, copies=None, sep=';', kws_latex=None, *args, **kws)[source]¶
Save data based on the filename.
- Parameters:
fn (callable, str) – Path and filename
copies (list of str) – List with strings with the file extensions. Copies of the file are saved based on the extension, using the same filename and path used in “
fn”.sep (str (optional)) – Set the column separator to export to text-like files (.csv, .tsv, .txt, etc.)
kws_latex (dict) – Arguments of to_latex(). See tibble.to_latex()
Notes
Additional positional and keyword arguments are passed to the underlying method used to save the file, which is based on the file extension.
.tex => tidypolars_extra.tibble.to_latex
.csv => polars.write_csv (uses sep=’;’ as default)
.tsv => polars.write_csv (uses sep=’ ‘ as default)
.dat => polars.write_csv (uses sep=’ ‘ as default)
.txt => polars.write_csv (uses sep=’ ‘ as default)
.xls => polars.write_excel
.xlsx => polars.write_excel
.dta => pandas.DataFrame.to_stata
.parquet => polars.write_parquet
Use silently=True to save quietly (Default False).
- select(*args)[source]¶
Select or drop columns
- Parameters:
*args (str, list, dict, or combinations of them) – Columns to select. It can combine names, list of names, and a dict. If dict, it will rename the columns based on the dict. It also accepts helper functions:
tp.matches(<regex>),tp.contains(<str>),tp.where(<str>).
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'abcba': ['a', 'a', 'b']}) >>> df.select('a', 'b') >>> df.select(col('a'), col('b')) >>> df.select({'a': 'new name'}, tp.matches("c")) >>> df.select(tp.where('numeric'))
- semi_join(df, left_on=None, right_on=None, on=None)[source]¶
Perform a semi join (keep rows with a match in df, no columns added)
- Parameters:
df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
- Returns:
Rows from the original tibble that have a match in df.
- Return type:
Examples
>>> df1.semi_join(df2, on = 'x')
- separate(sep_col, into, sep='_', remove=True)[source]¶
Separate a character column into multiple columns
- Parameters:
sep_col (str) – Column to split into multiple columns
into (list) – List of new column names
sep (str) – Separator to split on. Default to ‘_’
remove (bool) – If True removes the input column from the output data frame
- Returns:
Original tibble with a column splitted based on sep.
- Return type:
Examples
>>> df = tp.tibble(x = ['a_a', 'b_b', 'c_c']) >>> df.separate('x', into = ['left', 'right'])
- separate_longer_delim(sep_col, delim)[source]¶
Split a string column by
deliminto longer rows.- Parameters:
sep_col (str) – Column to split.
delim (str) – Delimiter to split on.
Examples
>>> df = tp.tibble(x = ['a,b', 'c']) >>> df.separate_longer_delim('x', ',')
- separate_longer_position(sep_col, width)[source]¶
Split each string into chunks of
widthcharacters and convert into longer rows.- Parameters:
sep_col (str) – Column to split.
width (int) – Width of each chunk in characters.
Examples
>>> df = tp.tibble(x = ['abcd', 'efgh']) >>> df.separate_longer_position('x', 2)
- separate_rows(*cols, sep=',')[source]¶
Split the given columns on
sepand explode them into longer rows. Superseded byseparate_longer_delim()but kept for tidyr parity.- Parameters:
*cols (str) – Columns to split and explode.
sep (str) – Delimiter to split on (default:
',').
Examples
>>> df = tp.tibble(x = ['a,b', 'c'], y = [1, 2]) >>> df.separate_rows('x', sep = ',')
- separate_wider_delim(sep_col, delim, names, *, remove=True, too_few='error', too_many='error')[source]¶
Split a string column into several columns using a delimiter.
- Parameters:
sep_col (str) – Column to split.
delim (str) – Delimiter to split on.
names (list) – Names of the resulting columns.
remove (bool) – If True (default) drop the original column.
too_few (str) – One of
'error'(default) or'align_start'. When'error', raises if a row produces fewer fields thanlen(names).too_many (str) – One of
'error'(default) or'drop'. When'error', raises if a row produces more fields thanlen(names).
Examples
>>> df = tp.tibble(x = ['a_1', 'b_2']) >>> df.separate_wider_delim('x', '_', names = ['letter', 'num'])
- separate_wider_position(sep_col, widths, *, remove=True)[source]¶
Split a string column into several columns by character positions.
- Parameters:
sep_col (str) – Column to split.
widths (dict) – Mapping of new column name → width in characters.
remove (bool) – If True (default) drop the original column.
Examples
>>> df = tp.tibble(x = ['2024Q1', '2025Q2']) >>> df.separate_wider_position('x', widths = {'year': 4, 'q': 2})
- separate_wider_regex(sep_col, patterns, *, remove=True)[source]¶
Split a string column using a regular expression with named groups.
- Parameters:
sep_col (str) – Column to split.
patterns (str or dict) – Either a regex string containing named capturing groups, or a dict
{name: sub_pattern}which is assembled into a single regex of named groups in the given order.remove (bool) – If True (default) drop the original column.
Examples
>>> df = tp.tibble(x = ['id-001', 'id-002']) >>> df.separate_wider_regex('x', {'prefix': '[a-z]+', '_sep': '-', 'num': '\d+'})
- set_names(nm=None)[source]¶
Change the column names of the data frame
- Parameters:
nm (list) – A list of new names for the data frame
Examples
>>> df = tp.tibble(x = range(3), y = range(3)) >>> df.set_names(['a', 'b'])
- slice(*args, by=None)[source]¶
Grab rows from a data frame
- Parameters:
*args (int, list) – Rows to grab
by (str, list) – Columns to group by
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.slice(0, 1) >>> df.slice(0, by = 'c')
- slice_head(n=5, *, by=None)[source]¶
Grab top rows from a data frame
- Parameters:
n (int) – Number of rows to grab
by (str, list) – Columns to group by
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.slice_head(2) >>> df.slice_head(1, by = 'c')
- slice_max(order_by, n=1, *, with_ties=True, by=None)[source]¶
Select rows with the largest values of
order_by.- Parameters:
order_by (str, list) – Column(s) to order by (descending).
n (int) – Number of rows to return per group.
with_ties (bool) – If True (default), include tied rows even if that exceeds
n.by (str, list, optional) – Columns to group by.
Examples
>>> df = tp.tibble(x = [1, 2, 2, 3], g = ['a', 'a', 'b', 'b']) >>> df.slice_max('x', n = 1) >>> df.slice_max('x', n = 1, by = 'g')
- slice_min(order_by, n=1, *, with_ties=True, by=None)[source]¶
Select rows with the smallest values of
order_by.- Parameters:
order_by (str, list) – Column(s) to order by (ascending).
n (int) – Number of rows to return per group.
with_ties (bool) – If True (default), include tied rows even if that exceeds
n.by (str, list, optional) – Columns to group by.
Examples
>>> df = tp.tibble(x = [1, 2, 2, 3], g = ['a', 'a', 'b', 'b']) >>> df.slice_min('x', n = 1) >>> df.slice_min('x', n = 1, by = 'g')
- slice_sample(n=None, *, prop=None, replace=False, seed=None, by=None)[source]¶
Randomly sample rows. Modern replacement for
sample_n()andsample_frac().- Parameters:
n (int, optional) – Number of rows to sample. Provide exactly one of
norprop.prop (float, optional) – Fraction of rows to sample (between 0 and 1).
replace (bool) – Whether to sample with replacement.
seed (int, optional) – Random seed for reproducibility.
by (str, list, optional) – Columns to group by; sampling happens within each group.
Examples
>>> df.slice_sample(n = 3, seed = 42) >>> df.slice_sample(prop = 0.5, by = 'g', seed = 42)
- slice_tail(n=5, *, by=None)[source]¶
Grab bottom rows from a data frame
- Parameters:
n (int) – Number of rows to grab
by (str, list) – Columns to group by
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.slice_tail(2) >>> df.slice_tail(1, by = 'c')
- summarize(*args, by=None, **kwargs)[source]¶
Aggregate data with summary statistics
- Parameters:
*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify
- Returns:
A tibble with the summaries
- Return type:
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.summarize(avg_a = tp.mean(col('a'))) >>> df.summarize(avg_a = tp.mean(col('a')), ... by = 'c') >>> df.summarize(avg_a = tp.mean(col('a')), ... max_b = tp.max(col('b')))
- tab(row, col, groups=None, margins=True, normalize='all', margins_name='Total', stat='both', na_rm=True, na_label='NA', digits=2)[source]¶
Create a 2x2 contingency table for two categorical variables, with optional grouping, margins, and normalization.
- Parameters:
row (str) – Name of the variable to be used for the rows of the table.
col (str) – Name of the variable to be used for the columns of the table.
groups (str or list of str, optional) – Variable name(s) to use as grouping variables. When provided, a separate 2x2 table is generated for each group.
margins (bool, default True) – If True, include row and column totals (margins) in the table.
normalize ({'all', 'row', 'columns'}, default 'all') –
- Specifies how to compute the marginal percentages in each cell:
’all’: percentages computed over the entire table.
’row’: percentages computed across each row.
’columns’: percentages computed down each column.
margins_name (str, default 'Total') – Name to assign to the row and column totals.
stat ({'both', 'perc', 'n'}, default 'both') –
- Determines the statistic to display in each cell:
’both’: returns both percentages and sample size.
’perc’: returns percentages only.
’n’: returns sample size only.
na_rm (bool, default True) – If True, remove rows with missing values in the row or col variables.
na_label (str, default 'NA') – Label to use for missing values when na_rm is False.
digits (int, default 2) – Number of digits to round the percentages to.
- Returns:
A contingency table as a tibble. The table contains counts and/or percentages as specified by the stat parameter, includes margins if requested, and is formatted with group headers when grouping variables are provided.
- Return type:
- to_csv(*args, **kws)[source]¶
Save tibble to csv.
Details¶
See polars write_csv() for details.
- rtype:
None
- to_dict(*, as_series=True)[source]¶
Aggregate data with summary statistics
- Parameters:
as_series (bool) – If True - returns the dict values as Series If False - returns the dict values as lists
Examples
>>> df.to_dict() >>> df.to_dict(as_series = False)
- to_dta(*args, **kws)[source]¶
Save tibble to dta.
Details¶
See polars write_dta() for details.
- rtype:
None
- to_excel(*args, **kws)[source]¶
Save tibble to excel.
Details¶
See polars write_excel() for details.
- rtype:
None
- to_latex(fn=None, header=None, digits=4, caption=None, label=None, align=None, na_rep='', position='!htb', group_rows_by=None, group_title_align='l', footnotes=None, footnotes_width='\\linewidth', index=False, escape=False, longtable=False, longtable_singlespace=True, rotate=False, scale=True, parse_linebreaks=True, tabular=False, *args, **kws)[source]¶
Convert the object to a LaTeX tabular representation.
- Parameters:
fn (str) – Path with filename
header (list of tuples, optional) –
The column headers for the LaTeX table. Each tuple corresponds to a column. Example creating upper level header with grouped columns:
[("", "col 1"), ("Group A", "col 2"), ("Group A", "col 3"), ("Group B", "col 4"), ("Group B", "col 5"), ]
Example creating two upper level headers with grouped columns:
[("Group 1", "" , "col 1"), ("Group 1", "Group A", "col 2"), ("Group 1", "Group A", "col 3"), ("" , "Group B", "col 4"), ("" , "Group B", "col 5"), ]
digits (int, default=4) – Number of decimal places to round the numerical values in the table.
caption (str, optional) – The caption for the LaTeX table.
label (str, optional) – The label for referencing the table in LaTeX.
align (str, optional) – Column alignment specifications (e.g., ‘lcr’).
na_rep (str, default='') – The representation for NaN values in the table.
position (str, default='!htbp') – The placement option for the table in the LaTeX document.
footnotes (dict, optional) – A dictionary where keys are column alignments (‘c’, ‘r’, or ‘l’) and values are the respective footnote strings.
footnotes_width (str, None) – Width of the footnote. Example: ‘linewidth’, ‘40pt’ If None, impose no restriction to the width
group_rows_by (str, default=None) – Name of the variable in the data with values to group the rows by.
group_title_align (str, default='l') – Alignment of the title of each row group.
index (bool, default=False) – Whether to include the index in the LaTeX table.
escape (bool, default=False) – Whether to escape LaTeX special characters.
longtable (bool, deafult=False) – If True, table spans multiple pages
longtable_singlespace (bool) – Force single space to longtables
rotate (bool) – Whether to use landscape table
scale (bool, default=True) – If True, scales the table to fit the linewidth when the table exceeds that size. Ignored when
longtable=True(LaTeX limitation because longtable does not use tabular).parse_linebreaks (bool, default=True) – If True, parse n and replace it with makecel to produce linebreaks
tabular (bool, default=False) – Whether to use a tabular format for the output.
- Returns:
A LaTeX formatted string of the tibble.
- Return type:
str
- to_markdown()[source]¶
Render the tibble as a Markdown table string
- Returns:
A Markdown-formatted table string.
- Return type:
str
Examples
>>> print(df.to_markdown())
- to_parquet(file=str, compression='snappy', use_pyarrow=False, silently=False, *args, **kws)[source]¶
Write a data frame to a parquet
- transmute(*args, by=None, **kwargs)[source]¶
Add or modify columns, keeping only the new columns
- Parameters:
*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify
- Returns:
A tibble with only the newly created columns (and grouping columns if by is used).
- Return type:
Examples
>>> df.transmute(double_a = col('a') * 2)
- unite(col='_united', unite_cols=[], sep='_', remove=True)[source]¶
Unite multiple columns by pasting strings together
- Parameters:
col (str) – Name of the new column
unite_cols (list) – List of columns to unite
sep (str) – Separator to use between values
remove (bool) – If True removes input columns from the data frame
Examples
>>> df = tp.tibble(a = ["a", "a", "a"], b = ["b", "b", "b"], c = range(3)) >>> df.unite("united_col", unite_cols = ["a", "b"])
- unnest(col)[source]¶
Unnest a nested tibble :param col: Columns to unnest :type col: str
- Returns:
The nested tibble will be expanded and become unested rows of the original tibble.
- Return type:
- unnest_longer(col_name, *, values_to=None, indices_to=None)[source]¶
Turn each element of a list- or struct-column into its own row.
For list columns, this behaves like
DataFrame.explode. For struct columns, each row is expanded into one row per field, with the field name going intoindices_toand the field value intovalues_to.- Parameters:
col_name (str) – Name of the list or struct column to unnest.
values_to (str, optional) – Name of the output value column. For list columns this renames the exploded column. For struct columns this names the value column; defaults to
col_name.indices_to (str, optional) – For struct columns, the name of the field-name column. Defaults to
f"{col_name}_id".
Examples
>>> df = tp.tibble(id = [1, 2], vals = [[10, 20], [30]]) >>> df.unnest_longer('vals')
- unnest_wider(col_name, *, names_sep=None)[source]¶
Turn each element of a struct- or list-column into its own column.
- Parameters:
col_name (str) – Name of the column to unnest.
names_sep (str, optional) – If provided, the output column names become
f"{col_name}{names_sep}{field}"to avoid collisions.
Examples
>>> df = tp.tibble(id = [1, 2], pt = [{'x': 1, 'y': 2}, {'x': 3, 'y': 4}]) >>> df.unnest_wider('pt')
- unpack(*cols)[source]¶
Unpack one or more struct columns into their component columns.
- Parameters:
*cols (str) – Names of the struct columns to unpack.
Examples
>>> df = tp.tibble(id = [1, 2]).pack(pt = ['id']) # contrived >>> df.unpack('pt')
- property names¶
Get column names
- Returns:
Names of the columns
- Return type:
list
Examples
>>> df.names
- property ncol¶
Get number of columns
- Returns:
Number of columns
- Return type:
int
Examples
>>> df.ncol
- property nrow¶
Get number of rows
- Returns:
Number of rows
- Return type:
int
Examples
>>> df.nrow
- tidypolars_extra.abs(x)[source]¶
Absolute value
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(abs_x = tp.abs('x')) >>> df.mutate(abs_x = tp.abs(col('x')))
- tidypolars_extra.across(cols, fn=lambda x: ..., names_prefix=None, names_suffix=None)[source]¶
Apply a function across a selection of columns
- Parameters:
cols (list) – Columns to operate on
fn (lambda) – A function or lambda to apply to each column
names_prefix (Optional - str) – Prefix to append to changed columns
Examples
>>> df = tp.tibble(x = ['a', 'a', 'b'], y = range(3), z = range(3)) >>> df.mutate(across(['y', 'z'], lambda x: x * 2)) >>> df.mutate(across(tp.Int64, lambda x: x * 2, names_prefix = "double_")) >>> df.summarize(across(['y', 'z'], tp.mean), by = 'x')
- tidypolars_extra.as_character(x)[source]¶
Convert to string. Defaults to Utf8.
- Parameters:
x (Str) – Column to operate on
Examples
>>> df.mutate(string_x = tp.as_string('x')) # or equivalently >>> df.mutate(character_x = tp.as_character('x'))
- tidypolars_extra.as_date(x, fmt=None)[source]¶
Convert a string to a Date
- Parameters:
x (Expr, Series) – Column to operate on
fmt (str) – “yyyy-mm-dd”
Examples
>>> df = tp.tibble(x = ['2021-01-01', '2021-10-01']) >>> df.mutate(date_x = tp.as_date(col('x')))
- tidypolars_extra.as_datetime(x, fmt=None)[source]¶
Convert a string to a Datetime
- Parameters:
x (Expr, Series) – Column to operate on
fmt (str) – “yyyy-mm-dd”
Examples
>>> df = tp.tibble(x = ['2021-01-01', '2021-10-01']) >>> df.mutate(date_x = tp.as_datetime(col('x')))
- tidypolars_extra.as_factor(x, levels=None)[source]¶
Convert to factor (R naming), equlivalent to Enum or Categorical (polars), depending on whether ‘levels’ is provided.
- Parameters:
x (Str) – Column to operate on
levels (list of str) – Categories to use in the factor. The catogories will be ordered as they appear in the list. If None (default), it will create an unordered factor (polars Categorical).
Examples
>>> df.mutate(factor_x = tp.as_factor('x')) # or equivalently >>> df.mutate(categorical_x = tp.as_categorical('x'))
- tidypolars_extra.as_float(x)[source]¶
Convert to float. Defaults to Float64.
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(float_x = tp.as_float(col('x')))
- tidypolars_extra.as_integer(x)[source]¶
Convert to integer. Defaults to Int64.
- Parameters:
x (Expr) – Column to operate on
Examples
>>> df.mutate(int_x = tp.as_integer(col('x')))
- tidypolars_extra.as_logical(x)[source]¶
Convert to a boolean (polars) or ‘logical’ (R naming)
- Parameters:
x (Str) – Column to operate on
Examples
>>> df.mutate(bool_x = tp.as_boolean(col('x'))) # or equivalently >>> df.mutate(logical_x = tp.as_logical(col('x')))
- tidypolars_extra.as_string(x)[source]¶
Convert column to string. Alias to as_character (R naming). Equivalent to Utf8 type (polars)
- tidypolars_extra.between(x, left, right)[source]¶
Test if values of a column are between two values
- Parameters:
x (Expr, Series) – Column to operate on
left (int) – Value to test if column is greater than or equal to
right (int) – Value to test if column is less than or equal to
Examples
>>> df = tp.tibble(x = range(4)) >>> df.filter(tp.between(col('x'), 1, 3))
- tidypolars_extra.case_when(*args, _default=None)[source]¶
Case when
- Parameters:
*args (Expr) – When called with a single expression, returns pl.when() for chaining (e.g., tp.case_when(cond).then(val).otherwise(val)). When called with paired args (condition, value, condition, value, …), builds the full case expression.
_default (optional) – Default value when no condition is met (used with paired args)
Examples
>>> df = tp.tibble(x = range(1, 4)) >>> # Chaining style >>> df.mutate(case_x = tp.case_when(col('x') < 2).then(0) ... .when(col('x') < 3).then(1) ... .otherwise(0)) >>> # Paired args style >>> df.mutate( >>> case_x = tp.case_when(col('x') < 2, 1, >>> col('x') < 3, 2, >>> _default = 0) >>> )
- tidypolars_extra.cast(x, dtype)[source]¶
General type conversion.
- Parameters:
x (Expr, Series) – Column to operate on
dtype (DataType) – Type to convert to
Examples
>>> df.mutate(abs_x = tp.cast(col('x'), tp.Float64))
- tidypolars_extra.ceiling_date(x, unit='month', change_on_boundary=False)[source]¶
Round date up to the nearest unit
- Parameters:
x (Expr, str) – Date/datetime column
unit (str) – Unit to round to: ‘year’, ‘month’, ‘week’, ‘day’, ‘hour’, ‘minute’, ‘second’
change_on_boundary (bool) – If False (default), dates already at a boundary are unchanged. If True, boundary dates are bumped to the next unit.
- Returns:
Date/datetime rounded up.
- Return type:
Expr
Examples
>>> df.mutate(month_end = tp.ceiling_date('date', 'month'))
- tidypolars_extra.coalesce(*args)[source]¶
Coalesce missing values
- Parameters:
args (Expr) – Columns to coalesce
Examples
>>> df.mutate(abs_x = tp.cast(col('x'), tp.Float64))
- tidypolars_extra.contains(match, ignore_case=True)[source]¶
Contains a literal string
- Parameters:
match (str) – String to match columns
ignore_case (bool) – If TRUE, the default, ignores case when matching names.
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.select(contains('c'))
- tidypolars_extra.cor(x, y, method='pearson')[source]¶
Find the correlation of two columns
- Parameters:
x (Expr) – A column
y (Expr) – A column
method (str) – Type of correlation to find. Either ‘pearson’ or ‘spearman’.
Examples
>>> df.summarize(cor = tp.cor(col('x'), col('y')))
- tidypolars_extra.count(x)[source]¶
Number of observations in each group
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(count = tp.count(col('x')))
- tidypolars_extra.cov(x, y)[source]¶
Find the covariance of two columns
- Parameters:
x (Expr) – A column
y (Expr) – A column
Examples
>>> df.summarize(cor = tp.cov(col('x'), col('y')))
- tidypolars_extra.cume_dist(x)[source]¶
Compute cumulative distribution (proportion of values <= current value)
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(cd = tp.cume_dist('x'))
- tidypolars_extra.cummax(x)[source]¶
Cumulative maximum
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(cmax = tp.cummax('x'))
- tidypolars_extra.cummin(x)[source]¶
Cumulative minimum
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(cmin = tp.cummin('x'))
- tidypolars_extra.cumprod(x)[source]¶
Cumulative product
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(cprod = tp.cumprod('x'))
- tidypolars_extra.cumsum(x)[source]¶
Cumulative sum
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(csum = tp.cumsum('x'))
- tidypolars_extra.days(n=1)[source]¶
Create a duration of n days
- Parameters:
n (int) – Number of days
- Returns:
A duration literal.
- Return type:
Expr
Examples
>>> df.mutate(tomorrow = col('date') + tp.days(1))
- tidypolars_extra.difftime(x, y, units='days')[source]¶
Compute time differences in specified units
- Parameters:
x (Expr, str) – Start date/datetime column
y (Expr, str) – End date/datetime column
units (str) – Units for the result: ‘days’, ‘hours’, ‘minutes’, ‘seconds’, ‘weeks’
- Returns:
Numeric expression with the time difference.
- Return type:
Expr
Examples
>>> df.mutate(diff = tp.difftime('date1', 'date2', units='days'))
- tidypolars_extra.dt_round(x, rule, n)[source]¶
Round the datetime
- Parameters:
x (Expr, Series) – Column to operate on
rule (str) – Units of the downscaling operation. Any of:
"month","week","day","hour","minute","second".n (int) – Number of units (e.g. 5 “day”, 15 “minute”.
Examples
>>> df.mutate(monthday = tp.mday(col('x')))
- tidypolars_extra.ends_with(match, ignore_case=True)[source]¶
Ends with a suffix
- Parameters:
match (str) – String to match columns
ignore_case (bool) – If TRUE, the default, ignores case when matching names.
Examples
>>> df = tp.tibble({'a': range(3), 'b_code': range(3), 'c_code': ['a', 'a', 'b']}) >>> df.select(ends_with('code'))
- tidypolars_extra.everything()[source]¶
Selects all columns
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.select(everything())
- tidypolars_extra.fct_collapse(x, **kwargs)[source]¶
Collapse multiple factor levels into one
- Parameters:
x (Expr, str) – Factor/categorical column
**kwargs – Mapping of new_level = [‘old1’, ‘old2’, …]
- Returns:
Expression with collapsed levels.
- Return type:
Expr
Examples
>>> df.mutate(x_collapsed = tp.fct_collapse('x', ab=['a', 'b'], cd=['c', 'd']))
- tidypolars_extra.fct_infreq(df, col_name)[source]¶
Reorder factor levels by frequency (most common first)
- Parameters:
df (tibble) – The DataFrame containing the column
col_name (str) – Name of the column to reorder
- Returns:
DataFrame with column cast to Enum with levels ordered by frequency.
- Return type:
Examples
>>> df = tp.tibble(x=['a', 'b', 'a', 'a', 'b', 'c']) >>> df = tp.fct_infreq(df, 'x')
- tidypolars_extra.fct_lump(x, n=None, prop=None, other_level='Other')[source]¶
Collapse least frequent factor levels into ‘Other’
Uses a ranking approach: for each value, computes its frequency rank and replaces values outside the top n with other_level.
- Parameters:
x (Expr, str) – Factor/categorical column
n (int, optional) – Number of most frequent levels to keep
prop (float, optional) – Minimum proportion to keep a level (0 to 1)
other_level (str) – Label for collapsed levels (default: ‘Other’)
- Returns:
Expression with infrequent levels replaced.
- Return type:
Expr
Examples
>>> df.mutate(x_lumped = tp.fct_lump('x', n=3))
- tidypolars_extra.fct_recode(x, **kwargs)[source]¶
Manually recode factor levels
- Parameters:
x (Expr, str) – Factor/categorical column
**kwargs – Mapping of new_level = ‘old_level’ or new_level = [‘old1’, ‘old2’]
- Returns:
Expression with recoded levels.
- Return type:
Expr
Examples
>>> df.mutate(x_recoded = tp.fct_recode('x', good='a', bad='b'))
- tidypolars_extra.fct_rev(df, col_name)[source]¶
Reverse factor level order
- Parameters:
df (tibble) – The DataFrame containing the column
col_name (str) – Name of the column to reverse
- Returns:
DataFrame with column cast to Enum with reversed level order.
- Return type:
Examples
>>> df = tp.tibble(x=['a', 'b', 'c']) >>> df = tp.fct_rev(df, 'x')
- tidypolars_extra.first(x)[source]¶
Get first value
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(first_x = tp.first('x')) >>> df.summarize(first_x = tp.first(col('x')))
- tidypolars_extra.floor(x)[source]¶
Round numbers down to the lower integer
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(floor_x = tp.floor(col('x')))
- tidypolars_extra.floor_date(x, unit='month')[source]¶
Round date down to the nearest unit
- Parameters:
x (Expr, str) – Date/datetime column
unit (str) – Unit to round to: ‘year’, ‘month’, ‘week’, ‘day’, ‘hour’, ‘minute’, ‘second’
- Returns:
Date/datetime rounded down.
- Return type:
Expr
Examples
>>> df.mutate(month_start = tp.floor_date('date', 'month'))
- tidypolars_extra.from_pandas(df)[source]¶
Convert from pandas DataFrame to tibble
- Parameters:
df (DataFrame) – pd.DataFrame to convert to a tibble
- Return type:
Examples
>>> tp.from_pandas(df)
- tidypolars_extra.from_polars(df)[source]¶
Convert from polars DataFrame to tibble
- Parameters:
df (DataFrame) – pl.DataFrame to convert to a tibble
- Return type:
Examples
>>> tp.from_polars(df)
- tidypolars_extra.hour(x)[source]¶
Extract the hour from a datetime
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(hour = tp.hour(col('x')))
- tidypolars_extra.hours(n=1)[source]¶
Create a duration of n hours
- Parameters:
n (int) – Number of hours
- Returns:
A duration literal.
- Return type:
Expr
Examples
>>> df.mutate(later = col('datetime') + tp.hours(2))
- tidypolars_extra.if_else(condition, true, false)[source]¶
If Else
- Parameters:
condition (Expr) – A logical expression
true – Value if the condition is true
false – Value if the condition is false
Examples
>>> df = tp.tibble(x = range(1, 4)) >>> df.mutate(if_x = tp.if_else(col('x') < 2, 1, 2))
- tidypolars_extra.iqr(x)[source]¶
Compute the interquartile range (Q3 - Q1)
Use in summarize() context only. Not suitable for mutate().
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(iqr_val = tp.iqr('x'))
- tidypolars_extra.is_finite(x)[source]¶
Test if values are finite
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(finite = tp.is_finite('x'))
- tidypolars_extra.is_in(x, values)[source]¶
Test if values are in a list
- Parameters:
x (Expr, Series) – Column to operate on
values (list) – List of values to check
Examples
>>> df.mutate(in_list = tp.is_in('x', [1, 2]))
- tidypolars_extra.is_infinite(x)[source]¶
Test if values are infinite
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(infinite = tp.is_infinite('x'))
- tidypolars_extra.is_not(x)[source]¶
Negate a boolean expression
- Parameters:
x (Expr) – Boolean expression to negate
Examples
>>> df.mutate(not_finite = tp.is_not(tp.is_finite(col('x'))))
- tidypolars_extra.is_not_in(x, values)[source]¶
Test if values are not in a list
- Parameters:
x (Expr, Series) – Column to operate on
values (list) – List of values to check
Examples
>>> df.mutate(not_in = tp.is_not_in('x', [1, 2]))
- tidypolars_extra.is_not_null(x)[source]¶
Test if values are not null
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(not_null = tp.is_not_null('x'))
- tidypolars_extra.is_null(x)[source]¶
Test if values are null
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(null = tp.is_null('x'))
- tidypolars_extra.lag(x, n: int = 1, default=None)[source]¶
Get lagging values
- Parameters:
x (Expr, Series) – Column to operate on
n (int) – Number of positions to lag by
default (optional) – Value to fill in missing values
Examples
>>> df.mutate(lag_x = tp.lag(col('x'))) >>> df.mutate(lag_x = tp.lag('x'))
- tidypolars_extra.last(x)[source]¶
Get last value
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(last_x = tp.last('x')) >>> df.summarize(last_x = tp.last(col('x')))
- tidypolars_extra.lead(x, n: int = 1, default=None)[source]¶
Get leading values
- Parameters:
x (Expr, Series) – Column to operate on
n (int) – Number of positions to lead by
default (optional) – Value to fill in missing values
Examples
>>> df.mutate(lead_x = tp.lead(col('x'))) >>> df.mutate(lead_x = col('x').lead())
- tidypolars_extra.length(x)[source]¶
Number of observations in each group.
Alias for
count().- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(length = tp.length(col('x')))
- tidypolars_extra.log(x)[source]¶
Compute the natural logarithm of a column
- Parameters:
x (Expr) – Column to operate on
Examples
>>> df.mutate(log = tp.log('x'))
- tidypolars_extra.log10(x)[source]¶
Compute the base 10 logarithm of a column
- Parameters:
x (Expr) – Column to operate on
Examples
>>> df.mutate(log = tp.log10('x'))
- tidypolars_extra.mad(x)[source]¶
Compute the median absolute deviation
Use in summarize() context only. Not suitable for mutate().
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(mad_val = tp.mad('x'))
- tidypolars_extra.make_date(year=1970, month=1, day=1)[source]¶
Create a date object
- Parameters:
year (Expr, str, int) – Column or literal
month (Expr, str, int) – Column or literal
day (Expr, str, int) – Column or literal
Examples
>>> df.mutate(date = tp.make_date(2000, 1, 1))
- tidypolars_extra.make_datetime(year=1970, month=1, day=1, hour=0, minute=0, second=0)[source]¶
Create a datetime object
- Parameters:
year (Expr, str, int) – Column or literal
month (Expr, str, int) – Column or literal
day (Expr, str, int) – Column or literal
hour (Expr, str, int) – Column or literal
minute (Expr, str, int) – Column or literal
second (Expr, str, int) – Column or literal
Examples
>>> df.mutate(date = tp.make_datetime(2000, 1, 1))
- tidypolars_extra.map(cols, _fun)[source]¶
Apply function by row
- Parameters:
cols (list of str) – A list with the name of the columns in the data to apply function
_fun (a function) – The function to apply to the columns. The function is applied to each row separately
- tidypolars_extra.matches(match, ignore_case=False)[source]¶
Matches pattern
- Parameters:
match (str) – String to match columns
ignore_case (bool) – If True, the default, ignores case when matching names.
Examples
>>> df = tp.tibble({'a': range(3), 'add': range(3), 'sub': ['a', 'a', 'b']}) >>> df.select(tp.matches('a'))
- tidypolars_extra.max(x)[source]¶
Get column max
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(max_x = tp.max('x')) >>> df.summarize(max_x = tp.max(col('x')))
- tidypolars_extra.mday(x)[source]¶
Extract the month day from a date from 1 to 31.
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(monthday = tp.mday(col('x')))
- tidypolars_extra.mean(x)[source]¶
Get column mean
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(mean_x = tp.mean('x')) >>> df.summarize(mean_x = tp.mean(col('x')))
- tidypolars_extra.median(x)[source]¶
Get column median
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(median_x = tp.median('x')) >>> df.summarize(median_x = tp.median(col('x')))
- tidypolars_extra.microseconds(n=1)[source]¶
Create a duration of n microseconds
- Parameters:
n (int) – Number of microseconds
- Returns:
A duration literal.
- Return type:
Expr
- tidypolars_extra.milliseconds(n=1)[source]¶
Create a duration of n milliseconds
- Parameters:
n (int) – Number of milliseconds
- Returns:
A duration literal.
- Return type:
Expr
- tidypolars_extra.min(x)[source]¶
Get column minimum
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(min_x = tp.min('x')) >>> df.summarize(min_x = tp.min(col('x')))
- tidypolars_extra.minute(x)[source]¶
Extract the minute from a datetime
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(hour = tp.minute(col('x')))
- tidypolars_extra.minutes(n=1)[source]¶
Create a duration of n minutes
- Parameters:
n (int) – Number of minutes
- Returns:
A duration literal.
- Return type:
Expr
Examples
>>> df.mutate(later = col('datetime') + tp.minutes(30))
- tidypolars_extra.mode(x)[source]¶
Compute the statistical mode (most frequent value)
Returns the first mode if there are ties (non-deterministic for ties). Use in summarize() context.
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(m = tp.mode('x'))
- tidypolars_extra.month(x)[source]¶
Extract the month from a date
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(year = tp.month(col('x')))
- tidypolars_extra.n()[source]¶
Number of observations in each group
Examples
>>> df.summarize(count = tp.n())
- tidypolars_extra.n_distinct(x)[source]¶
Get number of distinct values in a column
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(min_x = tp.n_distinct('x')) >>> df.summarize(min_x = tp.n_distinct(col('x')))
- tidypolars_extra.n_missing(x)[source]¶
Count the number of null/missing values in a column
- Parameters:
x (Expr, str) – Column to operate on
- Returns:
Count of null values.
- Return type:
Expr
Examples
>>> df.summarize(missing = tp.n_missing('x'))
- tidypolars_extra.now()[source]¶
Return the current datetime as a polars literal
- Returns:
A literal expression with the current datetime.
- Return type:
Expr
Examples
>>> df.mutate(now = tp.now())
- tidypolars_extra.ntile(x, n)[source]¶
Divide values into n roughly equal groups
- Parameters:
x (Expr, Series) – Column to operate on
n (int) – Number of groups
Examples
>>> df.mutate(quartile = tp.ntile('x', 4))
- tidypolars_extra.paste(*args, sep=' ')[source]¶
Concatenate strings together
- Parameters:
args (Expr, str) – Columns and or strings to concatenate
Examples
>>> df = tp.tibble(x = ['a', 'b', 'c']) >>> df.mutate(x_end = tp.paste(col('x'), 'end', sep = '_'))
- tidypolars_extra.paste0(*args)[source]¶
Concatenate strings together with no separator
- Parameters:
args (Expr, str) – Columns and or strings to concatenate
Examples
>>> df = tp.tibble(x = ['a', 'b', 'c']) >>> df.mutate(xend = tp.paste0(col('x'), 'end'))
- tidypolars_extra.pct_missing(x)[source]¶
Compute the percentage of null/missing values in a column
- Parameters:
x (Expr, str) – Column to operate on
- Returns:
Percentage of null values (0 to 100).
- Return type:
Expr
Examples
>>> df.summarize(pct = tp.pct_missing('x'))
- tidypolars_extra.percent_rank(x)[source]¶
Compute percent rank (values between 0 and 1)
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(prank = tp.percent_rank('x'))
- tidypolars_extra.quantile(x, quantile=0.5)[source]¶
Get number of distinct values in a column
- Parameters:
x (Expr, Series) – Column to operate on
quantile (float) – Quantile to return
Examples
>>> df.summarize(quantile_x = tp.quantile('x', .25))
- tidypolars_extra.quarter(x)[source]¶
Extract the quarter from a date
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(quarter = tp.quarter(col('x')))
- tidypolars_extra.rank(x, method='dense')[source]¶
Assigns a minimum rank to each element in the input list, handling ties by assigning the same (lowest) rank to tied values. The next distinct value’s rank is increased by the number of tied values before it.
- Parameters:
x (str) – Column to operate on
method (str) – dense (default): Assigns ranks in a consecutive manner, without gaps, even for ties. average : Assigns the average rank to tied values. min: Assigns the minimum rank to tied values. max: Assigns the maximum rank to tied values. ordinal: Assigns a distinct rank to each value based on its order of appearance.
- Returns:
A list of ranks corresponding to the elements of x.
- Return type:
list of int
Examples
>>> rank([10, 20, 20, 30]) [1, 2, 2, 3] >>> rank([3, 1, 2]) [3, 1, 2] # since sorted order is 1,2,3 => ranks are assigned as per their order >>> rank(["b", "a", "a", "c"]) [2, 1, 1, 3]
- tidypolars_extra.rep(x, times=1)[source]¶
Replicate the values in x
- Parameters:
x (const, Series) – Value or Series to repeat
times (int) – Number of times to repeat
Examples
>>> tp.rep(1, 3) >>> tp.rep(pl.Series(range(3)), 3)
- tidypolars_extra.replace_null(x, replace=None)[source]¶
Replace null values
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df = tp.tibble(x = [0, None], y = [None, None]) >>> df.mutate(x = tp.replace_null(col('x'), 1))
- tidypolars_extra.round(x, digits=0)[source]¶
Round a column to the specified number of decimal places
- Parameters:
x (Expr, Series) – Column to operate on
digits (int) – Decimals to round to
Examples
>>> df.mutate(x = tp.round(col('x')))
- tidypolars_extra.row_number()[source]¶
Return row number
Examples
>>> df.mutate(row_num = tp.row_number())
- tidypolars_extra.scale(x)[source]¶
Standardize the input by scaling it to a mean of 0 and a standard deviation of 1.
- Parameters:
x (Expr) – Column to operate on
- Returns:
The standardized version of the input data.
- Return type:
array-like
- tidypolars_extra.sd(x)[source]¶
Get column standard deviation
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(sd_x = tp.sd('x')) >>> df.summarize(sd_x = tp.sd(col('x')))
- tidypolars_extra.second(x)[source]¶
Extract the second from a datetime
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(hour = tp.minute(col('x')))
- tidypolars_extra.seconds(n=1)[source]¶
Create a duration of n seconds
- Parameters:
n (int) – Number of seconds
- Returns:
A duration literal.
- Return type:
Expr
Examples
>>> df.mutate(later = col('datetime') + tp.seconds(10))
- tidypolars_extra.sqrt(x)[source]¶
Get column square root
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(sqrt_x = tp.sqrt('x'))
- tidypolars_extra.starts_with(match, ignore_case=True)[source]¶
Starts with a prefix
- Parameters:
match (str) – String to match columns
ignore_case (bool) – If TRUE, the default, ignores case when matching names.
Examples
>>> df = tp.tibble({'a': range(3), 'add': range(3), 'sub': ['a', 'a', 'b']}) >>> df.select(starts_with('a'))
- tidypolars_extra.str_c(*args, sep='')[source]¶
Concatenate strings together.
Alias for
paste().- Parameters:
args (Expr, str) – Columns and/or strings to concatenate
Examples
>>> df = tp.tibble(x = ['a', 'b', 'c']) >>> df.mutate(x_end = str_c(col('x'), 'end', sep = '_'))
- tidypolars_extra.str_count(string, pattern)[source]¶
Count occurrences of a pattern in a string
- Parameters:
string (Expr, str) – Column to operate on
pattern (str) – Regular expression pattern to count
Examples
>>> df.mutate(n = tp.str_count('x', 'a'))
- tidypolars_extra.str_detect(string, pattern, negate=False)[source]¶
Detect the presence or absence of a pattern in a string
- Parameters:
string (str) – Input series to operate on
pattern (str) – Pattern to look for
negate (bool) – If True, return non-matching elements
Examples
>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape']) >>> df.mutate(x = str_detect('name', 'a')) >>> df.mutate(x = str_detect('name', ['a', 'e']))
- tidypolars_extra.str_dup(string, times)[source]¶
Duplicate/repeat a string
- Parameters:
string (Expr, str) – Column to operate on
times (int) – Number of times to repeat
Examples
>>> df.mutate(repeated = tp.str_dup('x', 3))
- tidypolars_extra.str_ends(string, pattern, negate=False)[source]¶
Detect the presence or absence of a pattern at the end of a string.
- Parameters:
string (Expr) – Column to operate on
pattern (str) – Pattern to look for
negate (bool) – If True, return non-matching elements
Examples
>>> df = tp.tibble(words = ['apple', 'bear', 'amazing']) >>> df.filter(tp.str_ends(col('words'), 'ing'))
- tidypolars_extra.str_extract(string, pattern)[source]¶
Extract the target capture group from provided patterns
- Parameters:
string (str) – Input series to operate on
pattern (str) – Pattern to look for
Examples
>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape']) >>> df.mutate(x = str_extract(col('name'), 'e'))
- tidypolars_extra.str_extract_all(string, pattern)[source]¶
Extract all matches of a pattern
- Parameters:
string (Expr, str) – Column to operate on
pattern (str) – Regular expression pattern with capture group
- Returns:
A list column with all matches.
- Return type:
Expr
Examples
>>> df.mutate(matches = tp.str_extract_all('x', r'\d+'))
- tidypolars_extra.str_length(string)[source]¶
Length of a string
- Parameters:
string (str) – Input series to operate on
Examples
>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape']) >>> df.mutate(x = str_length(col('name')))
- tidypolars_extra.str_pad(string, width, side='left', pad=' ')[source]¶
Pad a string to a specified width
- Parameters:
string (Expr, str) – Column to operate on
width (int) – Minimum width of resulting string
side (str) – Side to pad on: ‘left’, ‘right’, or ‘both’
pad (str) – Character to pad with (single character)
Examples
>>> df.mutate(padded = tp.str_pad('x', 10))
- tidypolars_extra.str_remove(string, pattern)[source]¶
Removes the first matched patterns in a string
- Parameters:
string (str) – Input series to operate on
pattern (str) – Pattern to look for
Examples
>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape']) >>> df.mutate(x = str_remove(col('name'), 'a'))
- tidypolars_extra.str_remove_all(string, pattern)[source]¶
Removes all matched patterns in a string
- Parameters:
string (str) – Input series to operate on
pattern (str) – Pattern to look for
Examples
>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape']) >>> df.mutate(x = str_remove_all(col('name'), 'a'))
- tidypolars_extra.str_replace(string, pattern, replacement)[source]¶
Replaces the first matched patterns in a string
- Parameters:
string (str) – Input series to operate on
pattern (str) – Pattern to look for
replacement (str) – String that replaces anything that matches the pattern
Examples
>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape']) >>> df.mutate(x = str_replace(col('name'), 'a', 'A'))
- tidypolars_extra.str_replace_all(string, pattern, replacement)[source]¶
Replaces all matched patterns in a string
- Parameters:
string (str) – Input series to operate on
pattern (str) – Pattern to look for
replacement (str) – String that replaces anything that matches the pattern
Examples
>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape']) >>> df.mutate(x = str_replace_all(col('name'), 'a', 'A'))
- tidypolars_extra.str_split(string, pattern)[source]¶
Split a string by a pattern
- Parameters:
string (Expr, str) – Column to operate on
pattern (str) – Pattern to split on
- Returns:
A list column with split parts.
- Return type:
Expr
Examples
>>> df.mutate(parts = tp.str_split('x', '_'))
- tidypolars_extra.str_squish(string)[source]¶
Remove leading/trailing whitespace and collapse internal whitespace
- Parameters:
string (Expr, str) – Column to operate on
Examples
>>> df.mutate(clean = tp.str_squish('x'))
- tidypolars_extra.str_starts(string, pattern, negate=False)[source]¶
Detect the presence or absence of a pattern at the beginning of a string.
- Parameters:
string (Expr) – Column to operate on
pattern (str) – Pattern to look for
negate (bool) – If True, return non-matching elements
Examples
>>> df = tp.tibble(words = ['apple', 'bear', 'amazing']) >>> df.filter(tp.str_starts(col('words'), 'a'))
- tidypolars_extra.str_sub(string, start=0, end=None)[source]¶
Extract portion of string based on start and end inputs
- Parameters:
string (str) – Input series to operate on
start (int) – First position of the character to return
end (int) – Last position of the character to return
Examples
>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape']) >>> df.mutate(x = str_sub(col('name'), 0, 3))
- tidypolars_extra.str_to_lower(string)[source]¶
Convert case of a string
- Parameters:
string (str) – Convert case of this string
Examples
>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape']) >>> df.mutate(x = str_to_lower(col('name')))
- tidypolars_extra.str_to_title(string)[source]¶
Convert string to Title Case
- Parameters:
string (Expr, str) – Column to operate on
Examples
>>> df.mutate(titled = tp.str_to_title('x'))
- tidypolars_extra.str_to_upper(string)[source]¶
Convert case of a string
- Parameters:
string (str) – Convert case of this string
Examples
>>> df = tp.tibble(name = ['apple', 'banana', 'pear', 'grape']) >>> df.mutate(x = str_to_upper(col('name')))
- tidypolars_extra.str_trim(string, side='both')[source]¶
Trim whitespace
- Parameters:
string (Expr, Series) – Column or series to operate on
side (str) –
- One of:
”both”
”left”
”right”
Examples
>>> df = tp.tibble(x = [' a ', ' b ', ' c ']) >>> df.mutate(x = tp.str_trim(col('x')))
- tidypolars_extra.str_wrap(string, width, sep='list')[source]¶
Split string
- Parameters:
string (str) – Column name to operate on
width (int) – Width to split the string
sep (string) – One of “n”: put “n” to split the string; return a single string “list”: return a list based on width
- tidypolars_extra.sum(x)[source]¶
Get column sum
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.summarize(sum_x = tp.sum('x')) >>> df.summarize(sum_x = tp.sum(col('x')))
- tidypolars_extra.today()[source]¶
Return the current date as a polars literal
- Returns:
A literal expression with today’s date.
- Return type:
Expr
Examples
>>> df.mutate(today = tp.today())
- tidypolars_extra.var(x)[source]¶
Get column variance
- Parameters:
x (Expr) – Column to operate on
Examples
>>> df.summarize(sum_x = tp.var('x')) >>> df.summarize(sum_x = tp.var(col('x')))
- tidypolars_extra.wday(x)[source]¶
Extract the weekday from a date from sunday = 1 to saturday = 7.
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(weekday = tp.wday(col('x')))
- tidypolars_extra.week(x)[source]¶
Extract the week from a date
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(week = tp.week(col('x')))
- tidypolars_extra.weeks(n=1)[source]¶
Create a duration of n weeks
- Parameters:
n (int) – Number of weeks
- Returns:
A duration literal.
- Return type:
Expr
Examples
>>> df.mutate(next_week = col('date') + tp.weeks(1))
- tidypolars_extra.weighted_mean(x, w)[source]¶
Compute weighted mean
- Parameters:
x (Expr, Series) – Column of values
w (Expr, Series) – Column of weights
Examples
>>> df.summarize(wm = tp.weighted_mean('x', 'w'))
- tidypolars_extra.where(col_type)[source]¶
Select columns by type using a string
- Options:
character : factor (ordered or unordered) and string string : only strings, exclude factors factor : ordered or unordered factors ordered : only ordered factors unordered : only unordered factors
numeric : float or integet float : only float integer : only integer
date : date datetime : data and time
Examples
>>> from tidypolars_extra.data import mtcars >>> df = mtcars >>> df.select(tp.where("integer")) >>> df.select(tp.where("numeric")) >>> df.select(tp.where("string") | tp.where("integer"))
- tidypolars_extra.yday(x)[source]¶
Extract the year day from a date from 1 to 366.
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(yearday = tp.yday(col('x')))
- tidypolars_extra.year(x)[source]¶
Extract the year from a date
- Parameters:
x (Expr, Series) – Column to operate on
Examples
>>> df.mutate(year = tp.year(col('x')))