tidypolars_extra.tibble_df¶

Classes¶

`TibbleGroupBy`	Starts a new GroupBy operation.
`tibble`	A data frame object that provides methods familiar to R tidyverse users.

Functions¶

`from_pandas`(df)	Convert from pandas DataFrame to tibble
`from_polars`(df)	Convert from polars DataFrame to tibble

Module Contents¶

class tidypolars_extra.tibble_df.TibbleGroupBy(df, by, *args, **kwargs)[source]¶

Bases: tidypolars_extra.type_conversion.pl.dataframe.group_by.GroupBy

Starts a new GroupBy operation.

Utility class for performing a group by operation over the given DataFrame.

Generated by calling df.group_by(…).

Parameters:

df – DataFrame to perform the group by operation over.
*by – Column or columns to group by. Accepts expression input. Strings are parsed as column names.
maintain_order – Ensure that the order of the groups is consistent with the input data. This is slower than a default group by.
predicates – Predicate expressions to filter groups after aggregation.
**named_by – Additional column(s) to group by, specified as keyword arguments. The columns will be named as the keyword used.

filter(*args, **kwargs)[source]¶

mutate(*args, **kwargs)[source]¶

summarize(*args, **kwargs)[source]¶

by[source]¶

df[source]¶

class tidypolars_extra.tibble_df.tibble(*args, **kwargs)[source]¶

Bases: tidypolars_extra.type_conversion.pl.DataFrame

A data frame object that provides methods familiar to R tidyverse users.

anti_join(df, left_on=None, right_on=None, on=None)[source]¶

Perform an anti join (keep rows without a match in df)

Parameters:

df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.

Returns:

Rows from the original tibble that do not have a match in df.

Return type:

tibble

Examples

>>> df1.anti_join(df2, on = 'x')

arrange(*args)[source]¶

Arrange/sort rows

Parameters:: *args (str) – Columns to sort by

Examples

>>> df = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
>>> # Arrange in ascending order
>>> df.arrange('x', 'y')
>>> # Arrange some columns descending
>>> df.arrange(tp.desc('x'), 'y')

Returns:: Original tibble ordered by args
Return type:: tibble

assert_no_nulls(*cols)[source]¶

Assert that specified columns contain no null values

Parameters:: *cols (str) – Column names to check. If empty, checks all columns.
Returns:: Returns self if assertion passes.
Return type:: tibble
Raises:: AssertionError – If any null values are found.

Examples

>>> df.assert_no_nulls('x', 'y')

assert_unique(*cols)[source]¶

Assert that specified columns have unique combinations

Parameters:: *cols (str) – Column names to check. If empty, checks all columns.
Returns:: Returns self if assertion passes.
Return type:: tibble
Raises:: AssertionError – If duplicate combinations are found.

Examples

>>> df.assert_unique('id')

bind_cols(*args)[source]¶

Bind data frames by columns

Parameters:: *args (tibble) – Data frame to bind
Returns:: The original tibble with added columns from the other tibble specified in args
Return type:: tibble

Examples

>>> df1 = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
>>> df2 = tp.tibble({'a': ['c', 'c', 'c'], 'b': range(4, 7)})
>>> df1.bind_cols(df2)

bind_rows(*args)[source]¶

Bind data frames by row

Parameters:: *args (tibble, list) – Data frames to bind by row
Returns:: The original tibble with added rows from the other tibble specified in args
Return type:: tibble

Examples

>>> df1 = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
>>> df2 = tp.tibble({'x': ['c', 'c', 'c'], 'y': range(4, 7)})
>>> df1.bind_rows(df2)

clean_names(case='snake')[source]¶

Standardize column names

Parameters:: case (str) – Case style for column names. Options: ‘snake’ (default), ‘lower’, ‘upper’.
Returns:: A tibble with cleaned column names.
Return type:: tibble

Examples

>>> df = tp.tibble(**{"First Name": [1], "Last.Name": [2], "AGE (years)": [30]})
>>> df.clean_names()

clone()[source]¶: Very cheap deep clone

colnames(regex='.', type=None, include_factor=True)[source]¶

Return the names of numeric columns in self that match ‘regex’ type: (str)

include_factor: (boolean): When type=string, include or not factors

complete(*cols, fill=None)[source]¶

Complete a DataFrame with all combinations of specified columns

Parameters:

*cols (str) – Column names to find all combinations of.
fill (dict, optional) – Dictionary of column names to fill values for missing combinations.

Returns:

A tibble with all combinations of the specified columns, with missing values filled according to fill parameter.

Return type:

tibble

Examples

>>> df = tp.tibble(x = [1, 1, 2], y = ['a', 'b', 'a'], val = [10, 20, 30])
>>> df.complete('x', 'y')

count(*args, sort=False, name='n')[source]¶

Returns row counts of the dataset. If bare column names are provided, count() returns counts by group.

Parameters:

*args (str, Expr) – Columns to group by
sort (bool) – Should columns be ordered in descending order by count
name (str) – The name of the new column in the output. If omitted, it will default to “n”.

Returns:

If no agument is provided, just return the nomber of rows. If column names are provided, it will count the unique values across columns

Return type:

tibble

Examples

>>> df = tp.tibble({'a': [1, 1, 2, 3],
...:                 'b': ['a', 'a', 'b', 'b']})
>>> df.count()
shape: (1, 1)
┌─────┐
│   n │
│ u32 │
╞═════╡
│   4 │
└─────┘
>>> df.count('a', 'b')
shape: (3, 3)
┌─────────────────┐
│   a   b       n │
│ i64   str   u32 │
╞═════════════════╡
│   1   a       2 │
│   2   b       1 │
│   3   b       1 │
└─────────────────┘

cross_join(df, suffix='_right')[source]¶

Perform a cross join (Cartesian product)

Parameters:

df (tibble) – DataFrame to join with.
suffix (str) – Suffix to append to columns with a duplicate name.

Returns:

All combinations of rows from both tibbles.

Return type:

tibble

Examples

>>> df1.cross_join(df2)

crossing(*args, **kwargs)[source]¶

Expands the existing tibble for each value of the variables used in the crossing() argument. See Returns.

Parameters:

*args (list) – One unamed list is accepted.
**kwargs (list) – Keyword will be the variable name, and the values in the list will be in the expanded tibble

Returns:

A tibble with varibles containing all combinations of the values in the arguments passed to crossing(). The original tibble will be replicated for each unique combination.

Return type:

tibble

Examples

>>> df = tp.tibble({'a': [1, 2], "b": [3, 5]})
>>> df
shape: (2, 2)
┌───────────┐
│   a     b │
│ i64   i64 │
╞═══════════╡
│   1     3 │
│   2     5 │
└───────────┘
>>> df.crossing(c = ['a', 'b', 'c'])
shape: (6, 3)
┌─────────────────┐
│   a     b   c   │
│ i64   i64   str │
╞═════════════════╡
│   1     3   a   │
│   1     3   b   │
│   1     3   c   │
│   2     5   a   │
│   2     5   b   │
│   2     5   c   │
└─────────────────┘

describe()[source]¶

Generate summary statistics for all columns

Returns:: A tibble with summary statistics including column name, type, count of non-null values, null count, unique count, and for numeric columns: mean, std, min, 25%, 50%, 75%, max.
Return type:: tibble

Examples

>>> df.describe()

descriptive_statistics(vars=None, groups=None, include_categorical=True, include_type=False)[source]¶

Compute descriptive statistics for numerical variables and optionally frequency statistics for categorical variables, with support for grouping.

Parameters:

vars (str, list, dict, or None, default None) – The variables for which to compute statistics. - If None, all variables in the dataset (as given by self.names) are used. - If a string, it is interpreted as a single variable name. - If a list, each element is treated as a variable name. - If a dict, keys are variable names and values are their labels.
groups (str, list, dict, or None, default None) – Variable(s) to group by when computing statistics. - If None, overall statistics are computed. - If a string, it is interpreted as a single grouping variable. - If a list, each element is treated as a grouping variable. - If a dict, keys are grouping variable names and values are their labels.
include_categorical (bool, default True) – Whether to include frequency statistics for categorical variables in the output.
include_type (bool, default False) – If True, adds a column indicating the variable type (“Num” for numerical, “Cat” for categorical).

Returns:

A tibble containing the descriptive statistics. For numerical variables, the statistics include N (count of non-missing values), Missing (percentage of missing values), Mean (average value), Std.Dev. (standard deviation), Min (minimum value), and Max (maximum value). If grouping is specified, these statistics are computed for each group. When include_categorical is True, frequency statistics for categorical variables are appended to the result.

Return type:

tibble

distinct(*args, keep_all=False)[source]¶

Select distinct/unique rows

Parameters:

*args (str, Expr) – Columns to find distinct/unique rows
keep_all (boll) – If True, keep all columns. Otherwise, return only the ones used to select the distinct rows.

Returns:

Tibble after removing the repeated rows based on args

Return type:

tibble

Examples

>>> df = tp.tibble({'a': range(3), 'b': ['a', 'a', 'b']})
>>> df.distinct()
>>> df.distinct('b')

drop(*args)[source]¶

Drop unwanted columns

Parameters:: *args (str) – Columns to drop
Returns:: Tibble with columns in args dropped
Return type:: tibble

Examples

>>> df.drop('x', 'y')

drop_na(*args)[source]¶

Drop rows containing missing values. Alias for drop_null(), matching tidyr’s drop_na() spelling.

Parameters:: *args (str) – Columns to drop nulls from (defaults to all)
Returns:: Tibble with rows containing nulls in args removed.
Return type:: tibble

Examples

>>> df = tp.tibble(x = [1, None, 3], y = [None, 'b', 'c'])
>>> df.drop_na()
>>> df.drop_na('x')

drop_null(*args)[source]¶

Drop rows containing missing values

Parameters:: *args (str) – Columns to drop nulls from (defaults to all)
Returns:: Tibble with rows in args with missing values dropped
Return type:: tibble

Examples

>>> df = tp.tibble(x = [1, None, 3], y = [None, 'b', 'c'], z = range(3)}
>>> df.drop_null()
>>> df.drop_null('x', 'y')

equals(other, null_equal=True)[source]¶: Check if two tibbles are equal

fill(*args, direction='down', by=None)[source]¶

Fill in missing values with previous or next value

Parameters:

*args (str) – Columns to fill
direction (str) – Direction to fill. One of [‘down’, ‘up’, ‘downup’, ‘updown’]
by (str, list) – Columns to group by

Returns:

Tibble with missing values filled

Return type:

tibble

Examples

>>> df = tp.tibble({'a': [1, None, 3, 4, 5],
...                 'b': [None, 2, None, None, 5],
...                 'groups': ['a', 'a', 'a', 'b', 'b']})
>>> df.fill('a', 'b')
>>> df.fill('a', 'b', by = 'groups')
>>> df.fill('a', 'b', direction = 'downup')

filter(*args, by=None)[source]¶

Filter rows on one or more conditions

Parameters:

*args (Expr) – Conditions to filter by
by (str, list) – Columns to group by

Returns:

A tibble with rows that match condition.

Return type:

tibble

Examples

>>> df = tp.tibble({'a': range(3), 'b': ['a', 'a', 'b']})
>>> df.filter(col('a') < 2, col('b') == 'a')
>>> df.filter((col('a') < 2) & (col('b') == 'a'))
>>> df.filter(col('a') <= tp.mean(col('a')), by = 'b')

freq(vars=None, groups=None, na_rm=False, na_label=None)[source]¶

Compute frequency table.

Parameters:

vars (str, list, or dict) – Variables to return value frequencies for. If a dict is provided, the key should be the variable name and the values the variable label for the output
groups (str, list, dict, or None, optional) – Variable names to condition marginal frequencies on. If a dict is provided, the key should be the variable name and the values the variable label for the output Defaults to None (no grouping).
na_rm (bool, optional) – Whether to include NAs in the calculation. Defaults to False.
na_label (str) – Label to use for the NA values

Returns:

A tibble with relative frequencies and counts.

Return type:

tibble

full_join(df, left_on=None, right_on=None, on=None, suffix: str = '_right')[source]¶

Perform an full join

Parameters:

df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.

Returns:

Union between the original and the df tibbles. The rows that don’t match in one of the tibbles will be completed with missing values.

Return type:

tibble

Examples

>>> df1.full_join(df2)
>>> df1.full_join(df2, on = 'x')
>>> df1.full_join(df2, left_on = 'left_x', right_on = 'x')

get_dupes(*cols)[source]¶

Find duplicate rows

Parameters:: *cols (str) – Column names to check for duplicates. If empty, checks all columns.
Returns:: A tibble containing duplicate rows with a ‘dupe_count’ column.
Return type:: tibble

Examples

>>> df.get_dupes('x', 'y')

glimpse(regex='.')[source]¶

Print compact information about the data

Parameters:: regex (str, list, dict) – Return information of the variables that match the regular expression, the list, or the dictionary. If dictionary is used, the variable names must be the dictionary keys.
Return type:: None

group_by(group, *args, **kwargs)[source]¶

Takes an existing tibble and converts it into a grouped tibble where operations are performed “by group”. ungroup() happens automatically after the operation is performed.

Parameters:: group (str, list) – Variable names to group by.
Returns:: A tibble with values grouped by one or more columns.
Return type:: Grouped tibble

head(n=5, *, by=None)[source]¶: Alias for .slice_head()

hoist(col_name, *, remove=False, **fields)[source]¶

Pull named elements out of a list- or struct-column into top-level columns.

Parameters:

col_name (str) – Name of the list or struct column to reach into.
remove (bool) – If True, drop the original column after hoisting.
**fields (str, int, or list) – Each keyword defines a new top-level column. The value is a path into the list/struct column: a field name, an integer list index, or a list of such steps for nested access.

Examples

>>> df = tp.tibble(meta = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
>>> df.hoist('meta', a = 'a')

inner_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶

Perform an inner join

Parameters:

df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.

Returns:

A tibble with intersection of cases in the original and df tibbles.

Return type:

tibble

Examples

>>> df1.inner_join(df2)
>>> df1.inner_join(df2, on = 'x')
>>> df1.inner_join(df2, left_on = 'left_x', right_on = 'x')

iterrows()[source]¶

left_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶

Perform a left join

Parameters:

df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.

Returns:

The original tibble with added columns from tibble df if they match columns in the original one. Columns to match on are given in the function parameters.

Return type:

tibble

Examples

>>> df1.left_join(df2)
>>> df1.left_join(df2, on = 'x')
>>> df1.left_join(df2, left_on = 'left_x', right_on = 'x')

mutate(*args, by=None, **kwargs)[source]¶

Add or modify columns

Parameters:

*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify

Returns:

Original tibble with new column created.

Return type:

tibble

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), c = ['a', 'a', 'b']})
>>> df.mutate(double_a = col('a') * 2,
...           a_plus_b = col('a') + col('b'))
>>> df.mutate(row_num = row_number(), by = 'c')

nest(by, *args, **kwargs)[source]¶

creates a nested tibble

Parameters:

by (list, str) – Columns to nest on
kwargs –

datalist of column names

columns to select to include in the nested data If not provided, include all columns except the ones used in ‘by’

keystr
name of the resulting nested column.

names_sepstr
If not provided (default), the names in the nested data will come from the former names. If a string, the new inner names in the nested dataframe will use the outer names with names_sep automatically stripped. This makes names_sep roughly symmetric between nesting and unnesting.

Returns:

The resulting tibble with have a column that contains nested tibbles

Return type:

tibble

pack(**groups)[source]¶

Pack several columns into one or more struct columns.

Parameters:: **groups (list or str) – Each keyword defines a new struct column. The value is a list of existing column names to pack into that struct.

Examples

>>> df = tp.tibble(x = [1, 2], y = [3, 4], z = ['a', 'b'])
>>> df.pack(position = ['x', 'y'])

pipe(fn, *args, **kwargs)[source]¶

Apply a function to the entire DataFrame

Parameters:

fn (callable) – Function to apply. The tibble is passed as the first argument.
*args (any) – Additional positional arguments passed to fn.
**kwargs (any) – Additional keyword arguments passed to fn.

Returns:

Result of fn(self, *args, **kwargs).

Return type:

any

Examples

>>> def add_column(df, name, value):
...     return df.mutate(**{name: value})
>>> df.pipe(add_column, 'new_col', 1)

pivot_longer(cols=None, names_to='name', values_to='value')[source]¶

Pivot data from wide to long

Parameters:

cols (Expr) – List of the columns to pivot. Defaults to all columns.
names_to (str) – Name of the new “names” column.
values_to (str) – Name of the new “values” column

Returns:

Original tibble, but in long format.

Return type:

tibble

Examples

>>> df = tp.tibble({'id': ['id1', 'id2'], 'a': [1, 2], 'b': [1, 2]})
>>> df.pivot_longer(cols = ['a', 'b'])
>>> df.pivot_longer(cols = ['a', 'b'], names_to = 'stuff', values_to = 'things')

pivot_wider(names_from='name', values_from='value', id_cols=None, values_fn='first', values_fill=None)[source]¶

Pivot data from long to wide

Parameters:

names_from (str) – Column to get the new column names from.
values_from (str) – Column to get the new column values from
id_cols (str, list) – A set of columns that uniquely identifies each observation. Defaults to all columns in the data table except for the columns specified in names_from and values_from.
values_fn (str) – Function for how multiple entries per group should be dealt with. Any of ‘first’, ‘count’, ‘sum’, ‘max’, ‘min’, ‘mean’, ‘median’, ‘last’
values_fill (str) – If values are missing/null, what value should be filled in. Can use: “backward”, “forward”, “mean”, “min”, “max”, “zero”, “one”

Returns:

Original tibble, but in wide format.

Return type:

tibble

Examples

>>> df = tp.tibble({'id': [1, 1], 'variable': ['a', 'b'], 'value': [1, 2]})
>>> df.pivot_wider(names_from = 'variable', values_from = 'value')

print(n=1000, ncols=1000, str_length=1000, digits=2)[source]¶

Print the DataFrame

Parameters:

n (int, default=1000) – Number of rows to print
ncols (int, default=1000) – Number of columns to print
str_length (int, default=1000) – Maximum length of the strings.

Return type:

None

pull(var=None)[source]¶

Extract a column as a series

Parameters:: var (str) – Name of the column to extract. Defaults to the last column.
Returns:: The series will contain the values of the column from var.
Return type:: Series

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3))
>>> df.pull('a')

relevel(x, ref)[source]¶

Change the reference level a string or factor and covert to factor

Inputs¶

xstr: Variable name
refstr: Reference level

returns:: The original tibble with the column specified in x as an ordered factors, with first category specified in ref.
rtype:: tibble

relocate(*args, before=None, after=None)[source]¶

Move a column or columns to a new position

Parameters:: *args (str, Expr) – Columns to move
Returns:: Original tibble with columns relocated.
Return type:: tibble

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.relocate('a', before = 'c')
>>> df.relocate('b', after = 'c')

rename(*args, regex=False, tolower=False, strict=False, **kwargs)[source]¶

Rename columns

Parameters:

*args (str or dict) – If a single dict is provided, it is used as {old_name: new_name}. If strings are provided, they are treated as pairs: new_name, old_name, …
regex (bool, default False) – If True, uses regular expression replacement {<matched from>:<matched to>}
tolower (bool, default False) – If True, convert all to lower case
**kwargs (str) – Keyword arguments in the form new_name=’old_name’

Returns:

Original tibble with columns renamed.

Return type:

tibble

Examples

>>> df = tp.tibble({'x': range(3), 't': range(3), 'z': ['a', 'a', 'b']})
>>> df.rename({'x': 'new_x'})
>>> df.rename(new_x = 'x')
>>> df.rename('new_x', 'x')

replace(rep, regex=False)[source]¶

Replace method from polars pandas. Replaces values of a column.

Parameters:

rep (dict) –

Format to use polars’ replace:
{<varname>:{<old value>:<new value>, …}}

Format to use pandas’ replace:
{<old value>:<new value>, …}
regex (bool) – If true, replace using regular expression. It uses pandas replace()

Returns:

Original tibble with values of columns replaced based on rep`.

Return type:

tibble

replace_na(replace=None)[source]¶

Replace null values in specified columns

Parameters:: replace (dict) – Dictionary mapping column names to replacement values.
Returns:: A tibble with nulls replaced.
Return type:: tibble

Examples

>>> df.replace_na({'x': 0, 'y': 'missing'})

replace_null(replace=None)[source]¶

Replace null values

Parameters:: replace (dict, str, int, or float) – Dictionary of column/replacement pairs, or values to replace null values. If not dict, replace in all columns. If replace is a string, it will replace nulls in all string columns, and so on.
Returns:: Original tibble with missing/null values replaced.
Return type:: tibble

Examples

>>> df = tp.tibble({'a': [None, 'abc', 'cde'], 'b':[None, 1, 2], 'c': [None, 1.1, 2.2]})
>>> df.replace_null({'a': 'New value'})
>>> df.replace_null({'a': 1})
>>> df.replace_null({'b': 1})
>>> df.replace_null({'b': 1.1})
>>> df.replace_null({'c': 1})
>>> df.replace_null('a')
>>> df.replace_null(1)
>>> df.replace_null(1.1)

right_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶

Perform a right join

Parameters:

df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.

Returns:

Every row of df with matching columns from self. Unmatched rows on the left side receive null values.

Return type:

tibble

Examples

>>> df1.right_join(df2)
>>> df1.right_join(df2, on = 'x')
>>> df1.right_join(df2, left_on = 'left_x', right_on = 'x')

sample_frac(fraction, seed=None, with_replacement=False)[source]¶

Randomly sample a fraction of rows

Parameters:

fraction (float) – Fraction of rows to sample (between 0 and 1).
seed (int, optional) – Random seed for reproducibility.
with_replacement (bool) – Whether to sample with replacement.

Returns:

A tibble with a random fraction of rows.

Return type:

tibble

Examples

>>> df.sample_frac(0.5, seed = 42)

sample_n(n, seed=None, with_replacement=False)[source]¶

Randomly sample n rows

Parameters:

n (int) – Number of rows to sample.
seed (int, optional) – Random seed for reproducibility.
with_replacement (bool) – Whether to sample with replacement.

Returns:

A tibble with n randomly sampled rows.

Return type:

tibble

Examples

>>> df.sample_n(5, seed = 42)

save_data(fn, copies=None, sep=';', kws_latex=None, *args, **kws)[source]¶

Save data based on the filename.

Parameters:

fn (callable, str) – Path and filename
copies (list of str) – List with strings with the file extensions. Copies of the file are saved based on the extension, using the same filename and path used in “fn”.
sep (str (optional)) – Set the column separator to export to text-like files (.csv, .tsv, .txt, etc.)
kws_latex (dict) – Arguments of to_latex(). See tibble.to_latex()

Notes

Additional positional and keyword arguments are passed to the underlying method used to save the file, which is based on the file extension.

.tex => tidypolars_extra.tibble.to_latex
.csv => polars.write_csv (uses sep=’;’ as default)
.tsv => polars.write_csv (uses sep=’ ‘ as default)
.dat => polars.write_csv (uses sep=’ ‘ as default)
.txt => polars.write_csv (uses sep=’ ‘ as default)
.xls => polars.write_excel
.xlsx => polars.write_excel
.dta => pandas.DataFrame.to_stata
.parquet => polars.write_parquet

Use silently=True to save quietly (Default False).

select(*args)[source]¶

Select or drop columns

Parameters:: *args (str, list, dict, or combinations of them) – Columns to select. It can combine names, list of names, and a dict. If dict, it will rename the columns based on the dict. It also accepts helper functions: tp.matches(<regex>), tp.contains(<str>), tp.where(<str>).

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'abcba': ['a', 'a', 'b']})
>>> df.select('a', 'b')
>>> df.select(col('a'), col('b'))
>>> df.select({'a': 'new name'}, tp.matches("c"))
>>> df.select(tp.where('numeric'))

semi_join(df, left_on=None, right_on=None, on=None)[source]¶

Perform a semi join (keep rows with a match in df, no columns added)

Parameters:

df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.

Returns:

Rows from the original tibble that have a match in df.

Return type:

tibble

Examples

>>> df1.semi_join(df2, on = 'x')

separate(sep_col, into, sep='_', remove=True)[source]¶

Separate a character column into multiple columns

Parameters:

sep_col (str) – Column to split into multiple columns
into (list) – List of new column names
sep (str) – Separator to split on. Default to ‘_’
remove (bool) – If True removes the input column from the output data frame

Returns:

Original tibble with a column splitted based on sep.

Return type:

tibble

Examples

>>> df = tp.tibble(x = ['a_a', 'b_b', 'c_c'])
>>> df.separate('x', into = ['left', 'right'])

separate_longer_delim(sep_col, delim)[source]¶

Split a string column by delim into longer rows.

Parameters:

sep_col (str) – Column to split.
delim (str) – Delimiter to split on.

Examples

>>> df = tp.tibble(x = ['a,b', 'c'])
>>> df.separate_longer_delim('x', ',')

separate_longer_position(sep_col, width)[source]¶

Split each string into chunks of width characters and convert into longer rows.

Parameters:

sep_col (str) – Column to split.
width (int) – Width of each chunk in characters.

Examples

>>> df = tp.tibble(x = ['abcd', 'efgh'])
>>> df.separate_longer_position('x', 2)

separate_rows(*cols, sep=',')[source]¶

Split the given columns on sep and explode them into longer rows. Superseded by separate_longer_delim() but kept for tidyr parity.

Parameters:

*cols (str) – Columns to split and explode.
sep (str) – Delimiter to split on (default: ',').

Examples

>>> df = tp.tibble(x = ['a,b', 'c'], y = [1, 2])
>>> df.separate_rows('x', sep = ',')

separate_wider_delim(sep_col, delim, names, *, remove=True, too_few='error', too_many='error')[source]¶

Split a string column into several columns using a delimiter.

Parameters:

sep_col (str) – Column to split.
delim (str) – Delimiter to split on.
names (list) – Names of the resulting columns.
remove (bool) – If True (default) drop the original column.
too_few (str) – One of 'error' (default) or 'align_start'. When 'error', raises if a row produces fewer fields than len(names).
too_many (str) – One of 'error' (default) or 'drop'. When 'error', raises if a row produces more fields than len(names).

Examples

>>> df = tp.tibble(x = ['a_1', 'b_2'])
>>> df.separate_wider_delim('x', '_', names = ['letter', 'num'])

separate_wider_position(sep_col, widths, *, remove=True)[source]¶

Split a string column into several columns by character positions.

Parameters:

sep_col (str) – Column to split.
widths (dict) – Mapping of new column name → width in characters.
remove (bool) – If True (default) drop the original column.

Examples

>>> df = tp.tibble(x = ['2024Q1', '2025Q2'])
>>> df.separate_wider_position('x', widths = {'year': 4, 'q': 2})

separate_wider_regex(sep_col, patterns, *, remove=True)[source]¶

Split a string column using a regular expression with named groups.

Parameters:

sep_col (str) – Column to split.
patterns (str or dict) – Either a regex string containing named capturing groups, or a dict {name: sub_pattern} which is assembled into a single regex of named groups in the given order.
remove (bool) – If True (default) drop the original column.

Examples

>>> df = tp.tibble(x = ['id-001', 'id-002'])
>>> df.separate_wider_regex('x', {'prefix': '[a-z]+', '_sep': '-', 'num': '\d+'})

set_names(nm=None)[source]¶

Change the column names of the data frame

Parameters:: nm (list) – A list of new names for the data frame

Examples

>>> df = tp.tibble(x = range(3), y = range(3))
>>> df.set_names(['a', 'b'])

slice(*args, by=None)[source]¶

Grab rows from a data frame

Parameters:

*args (int, list) – Rows to grab
by (str, list) – Columns to group by

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.slice(0, 1)
>>> df.slice(0, by = 'c')

slice_head(n=5, *, by=None)[source]¶

Grab top rows from a data frame

Parameters:

n (int) – Number of rows to grab
by (str, list) – Columns to group by

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.slice_head(2)
>>> df.slice_head(1, by = 'c')

slice_max(order_by, n=1, *, with_ties=True, by=None)[source]¶

Select rows with the largest values of order_by.

Parameters:

order_by (str, list) – Column(s) to order by (descending).
n (int) – Number of rows to return per group.
with_ties (bool) – If True (default), include tied rows even if that exceeds n.
by (str, list, optional) – Columns to group by.

Examples

>>> df = tp.tibble(x = [1, 2, 2, 3], g = ['a', 'a', 'b', 'b'])
>>> df.slice_max('x', n = 1)
>>> df.slice_max('x', n = 1, by = 'g')

slice_min(order_by, n=1, *, with_ties=True, by=None)[source]¶

Select rows with the smallest values of order_by.

Parameters:

order_by (str, list) – Column(s) to order by (ascending).
n (int) – Number of rows to return per group.
with_ties (bool) – If True (default), include tied rows even if that exceeds n.
by (str, list, optional) – Columns to group by.

Examples

>>> df = tp.tibble(x = [1, 2, 2, 3], g = ['a', 'a', 'b', 'b'])
>>> df.slice_min('x', n = 1)
>>> df.slice_min('x', n = 1, by = 'g')

slice_sample(n=None, *, prop=None, replace=False, seed=None, by=None)[source]¶

Randomly sample rows. Modern replacement for sample_n() and sample_frac().

Parameters:

n (int, optional) – Number of rows to sample. Provide exactly one of n or prop.
prop (float, optional) – Fraction of rows to sample (between 0 and 1).
replace (bool) – Whether to sample with replacement.
seed (int, optional) – Random seed for reproducibility.
by (str, list, optional) – Columns to group by; sampling happens within each group.

Examples

>>> df.slice_sample(n = 3, seed = 42)
>>> df.slice_sample(prop = 0.5, by = 'g', seed = 42)

slice_tail(n=5, *, by=None)[source]¶

Grab bottom rows from a data frame

Parameters:

n (int) – Number of rows to grab
by (str, list) – Columns to group by

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.slice_tail(2)
>>> df.slice_tail(1, by = 'c')

summarise(*args, by=None, **kwargs)[source]¶: Alias for .summarize()

summarize(*args, by=None, **kwargs)[source]¶

Aggregate data with summary statistics

Parameters:

*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify

Returns:

A tibble with the summaries

Return type:

tibble

Examples

>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']})
>>> df.summarize(avg_a = tp.mean(col('a')))
>>> df.summarize(avg_a = tp.mean(col('a')),
...              by = 'c')
>>> df.summarize(avg_a = tp.mean(col('a')),
...              max_b = tp.max(col('b')))

tab(row, col, groups=None, margins=True, normalize='all', margins_name='Total', stat='both', na_rm=True, na_label='NA', digits=2)[source]¶

Create a 2x2 contingency table for two categorical variables, with optional grouping, margins, and normalization.

Parameters:

row (str) – Name of the variable to be used for the rows of the table.
col (str) – Name of the variable to be used for the columns of the table.
groups (str or list of str, optional) – Variable name(s) to use as grouping variables. When provided, a separate 2x2 table is generated for each group.
margins (bool, default True) – If True, include row and column totals (margins) in the table.
normalize ({'all', 'row', 'columns'}, default 'all') –
Specifies how to compute the marginal percentages in each cell:
- ’all’: percentages computed over the entire table.
- ’row’: percentages computed across each row.
- ’columns’: percentages computed down each column.
margins_name (str, default 'Total') – Name to assign to the row and column totals.
stat ({'both', 'perc', 'n'}, default 'both') –
Determines the statistic to display in each cell:
- ’both’: returns both percentages and sample size.
- ’perc’: returns percentages only.
- ’n’: returns sample size only.
na_rm (bool, default True) – If True, remove rows with missing values in the row or col variables.
na_label (str, default 'NA') – Label to use for missing values when na_rm is False.
digits (int, default 2) – Number of digits to round the percentages to.

Returns:

A contingency table as a tibble. The table contains counts and/or percentages as specified by the stat parameter, includes margins if requested, and is formatted with group headers when grouping variables are provided.

Return type:

tibble

tail(n=5, *, by=None)[source]¶: Alias for .slice_tail()

to_csv(*args, **kws)[source]¶

Save tibble to csv.

Details¶

See polars write_csv() for details.

rtype:: None

to_dict(*, as_series=True)[source]¶

Aggregate data with summary statistics

Parameters:: as_series (bool) – If True - returns the dict values as Series If False - returns the dict values as lists

Examples

>>> df.to_dict()
>>> df.to_dict(as_series = False)

to_dta(*args, **kws)[source]¶

Save tibble to dta.

Details¶

See polars write_dta() for details.

rtype:: None

to_excel(*args, **kws)[source]¶

Save tibble to excel.

Details¶

See polars write_excel() for details.

rtype:: None

to_latex(fn=None, header=None, digits=4, caption=None, label=None, align=None, na_rep='', position='!htb', group_rows_by=None, group_title_align='l', footnotes=None, footnotes_width='\\linewidth', index=False, escape=False, longtable=False, longtable_singlespace=True, rotate=False, scale=True, parse_linebreaks=True, tabular=False, *args, **kws)[source]¶

Convert the object to a LaTeX tabular representation.

Parameters:

fn (str) – Path with filename

header (list of tuples, optional) –

The column headers for the LaTeX table. Each tuple corresponds to a column. Example creating upper level header with grouped columns:

[("", "col 1"),
 ("Group A", "col 2"),
 ("Group A", "col 3"),
 ("Group B", "col 4"),
 ("Group B", "col 5"),
]

Example creating two upper level headers with grouped columns:

[("Group 1", ""       , "col 1"),
 ("Group 1", "Group A", "col 2"),
 ("Group 1", "Group A", "col 3"),
 (""       , "Group B", "col 4"),
 (""       , "Group B", "col 5"),
]

digits (int, default=4) – Number of decimal places to round the numerical values in the table.
caption (str, optional) – The caption for the LaTeX table.
label (str, optional) – The label for referencing the table in LaTeX.
align (str, optional) – Column alignment specifications (e.g., ‘lcr’).
na_rep (str, default='') – The representation for NaN values in the table.
position (str, default='!htbp') – The placement option for the table in the LaTeX document.
footnotes (dict, optional) – A dictionary where keys are column alignments (‘c’, ‘r’, or ‘l’) and values are the respective footnote strings.
footnotes_width (str, None) – Width of the footnote. Example: ‘linewidth’, ‘40pt’ If None, impose no restriction to the width
group_rows_by (str, default=None) – Name of the variable in the data with values to group the rows by.
group_title_align (str, default='l') – Alignment of the title of each row group.
index (bool, default=False) – Whether to include the index in the LaTeX table.
escape (bool, default=False) – Whether to escape LaTeX special characters.
longtable (bool, deafult=False) – If True, table spans multiple pages
longtable_singlespace (bool) – Force single space to longtables
rotate (bool) – Whether to use landscape table
scale (bool, default=True) – If True, scales the table to fit the linewidth when the table exceeds that size. Ignored when longtable=True (LaTeX limitation because longtable does not use tabular).
parse_linebreaks (bool, default=True) – If True, parse n and replace it with makecel to produce linebreaks
tabular (bool, default=False) – Whether to use a tabular format for the output.

Returns:

A LaTeX formatted string of the tibble.

Return type:

str

to_markdown()[source]¶

Render the tibble as a Markdown table string

Returns:: A Markdown-formatted table string.
Return type:: str

Examples

>>> print(df.to_markdown())

to_pandas()[source]¶

Convert to a pandas DataFrame

Examples

>>> df.to_pandas()

to_parquet(file=str, compression='snappy', use_pyarrow=False, silently=False, *args, **kws)[source]¶: Write a data frame to a parquet

to_polars()[source]¶

Convert to a polars DataFrame

Examples

>>> df.to_polars()

transmute(*args, by=None, **kwargs)[source]¶

Add or modify columns, keeping only the new columns

Parameters:

*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify

Returns:

A tibble with only the newly created columns (and grouping columns if by is used).

Return type:

tibble

Examples

>>> df.transmute(double_a = col('a') * 2)

unite(col='_united', unite_cols=[], sep='_', remove=True)[source]¶

Unite multiple columns by pasting strings together

Parameters:

col (str) – Name of the new column
unite_cols (list) – List of columns to unite
sep (str) – Separator to use between values
remove (bool) – If True removes input columns from the data frame

Examples

>>> df = tp.tibble(a = ["a", "a", "a"], b = ["b", "b", "b"], c = range(3))
>>> df.unite("united_col", unite_cols = ["a", "b"])

unnest(col)[source]¶

Unnest a nested tibble :param col: Columns to unnest :type col: str

Returns:: The nested tibble will be expanded and become unested rows of the original tibble.
Return type:: tibble

unnest_longer(col_name, *, values_to=None, indices_to=None)[source]¶

Turn each element of a list- or struct-column into its own row.

For list columns, this behaves like DataFrame.explode. For struct columns, each row is expanded into one row per field, with the field name going into indices_to and the field value into values_to.

Parameters:

col_name (str) – Name of the list or struct column to unnest.
values_to (str, optional) – Name of the output value column. For list columns this renames the exploded column. For struct columns this names the value column; defaults to col_name.
indices_to (str, optional) – For struct columns, the name of the field-name column. Defaults to f"{col_name}_id".

Examples

>>> df = tp.tibble(id = [1, 2], vals = [[10, 20], [30]])
>>> df.unnest_longer('vals')

unnest_wider(col_name, *, names_sep=None)[source]¶

Turn each element of a struct- or list-column into its own column.

Parameters:

col_name (str) – Name of the column to unnest.
names_sep (str, optional) – If provided, the output column names become f"{col_name}{names_sep}{field}" to avoid collisions.

Examples

>>> df = tp.tibble(id = [1, 2], pt = [{'x': 1, 'y': 2}, {'x': 3, 'y': 4}])
>>> df.unnest_wider('pt')

unpack(*cols)[source]¶

Unpack one or more struct columns into their component columns.

Parameters:: *cols (str) – Names of the struct columns to unpack.

Examples

>>> df = tp.tibble(id = [1, 2]).pack(pt = ['id'])  # contrived
>>> df.unpack('pt')

property names[source]¶

Get column names

Returns:: Names of the columns
Return type:: list

Examples

>>> df.names

property ncol[source]¶

Get number of columns

Returns:: Number of columns
Return type:: int

Examples

>>> df.ncol

property nrow[source]¶

Get number of rows

Returns:: Number of rows
Return type:: int

Examples

>>> df.nrow

tidypolars_extra.tibble_df.from_pandas(df)[source]¶

Convert from pandas DataFrame to tibble

Parameters:: df (DataFrame) – pd.DataFrame to convert to a tibble
Return type:: tibble

Examples

>>> tp.from_pandas(df)

tidypolars_extra.tibble_df.from_polars(df)[source]¶

Convert from polars DataFrame to tibble

Parameters:: df (DataFrame) – pl.DataFrame to convert to a tibble
Return type:: tibble

Examples

>>> tp.from_polars(df)