tidypolars_extra.tibble_df¶
Classes¶
Starts a new GroupBy operation. |
|
A data frame object that provides methods familiar to R tidyverse users. |
Functions¶
|
Convert from pandas DataFrame to tibble |
|
Convert from polars DataFrame to tibble |
Module Contents¶
- class tidypolars_extra.tibble_df.TibbleGroupBy(df, by, *args, **kwargs)[source]¶
Bases:
tidypolars_extra.type_conversion.pl.dataframe.group_by.GroupByStarts a new GroupBy operation.
Utility class for performing a group by operation over the given DataFrame.
Generated by calling df.group_by(…).
- Parameters:
df – DataFrame to perform the group by operation over.
*by – Column or columns to group by. Accepts expression input. Strings are parsed as column names.
maintain_order – Ensure that the order of the groups is consistent with the input data. This is slower than a default group by.
predicates – Predicate expressions to filter groups after aggregation.
**named_by – Additional column(s) to group by, specified as keyword arguments. The columns will be named as the keyword used.
- class tidypolars_extra.tibble_df.tibble(*args, **kwargs)[source]¶
Bases:
tidypolars_extra.type_conversion.pl.DataFrameA data frame object that provides methods familiar to R tidyverse users.
- anti_join(df, left_on=None, right_on=None, on=None)[source]¶
Perform an anti join (keep rows without a match in df)
- Parameters:
df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
- Returns:
Rows from the original tibble that do not have a match in df.
- Return type:
Examples
>>> df1.anti_join(df2, on = 'x')
- arrange(*args)[source]¶
Arrange/sort rows
- Parameters:
*args (str) – Columns to sort by
Examples
>>> df = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)}) >>> # Arrange in ascending order >>> df.arrange('x', 'y') >>> # Arrange some columns descending >>> df.arrange(tp.desc('x'), 'y')
- Returns:
Original tibble ordered by
args- Return type:
- assert_no_nulls(*cols)[source]¶
Assert that specified columns contain no null values
- Parameters:
*cols (str) – Column names to check. If empty, checks all columns.
- Returns:
Returns self if assertion passes.
- Return type:
- Raises:
AssertionError – If any null values are found.
Examples
>>> df.assert_no_nulls('x', 'y')
- assert_unique(*cols)[source]¶
Assert that specified columns have unique combinations
- Parameters:
*cols (str) – Column names to check. If empty, checks all columns.
- Returns:
Returns self if assertion passes.
- Return type:
- Raises:
AssertionError – If duplicate combinations are found.
Examples
>>> df.assert_unique('id')
- bind_cols(*args)[source]¶
Bind data frames by columns
- Parameters:
*args (tibble) – Data frame to bind
- Returns:
The original tibble with added columns from the other tibble specified in
args- Return type:
Examples
>>> df1 = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)}) >>> df2 = tp.tibble({'a': ['c', 'c', 'c'], 'b': range(4, 7)}) >>> df1.bind_cols(df2)
- bind_rows(*args)[source]¶
Bind data frames by row
- Parameters:
*args (tibble, list) – Data frames to bind by row
- Returns:
The original tibble with added rows from the other tibble specified in
args- Return type:
Examples
>>> df1 = tp.tibble({'x': ['a', 'a', 'b'], 'y': range(3)}) >>> df2 = tp.tibble({'x': ['c', 'c', 'c'], 'y': range(4, 7)}) >>> df1.bind_rows(df2)
- clean_names(case='snake')[source]¶
Standardize column names
- Parameters:
case (str) – Case style for column names. Options: ‘snake’ (default), ‘lower’, ‘upper’.
- Returns:
A tibble with cleaned column names.
- Return type:
Examples
>>> df = tp.tibble(**{"First Name": [1], "Last.Name": [2], "AGE (years)": [30]}) >>> df.clean_names()
- colnames(regex='.', type=None, include_factor=True)[source]¶
Return the names of numeric columns in self that match ‘regex’ type: (str)
- include_factor: (boolean)
When type=string, include or not factors
- complete(*cols, fill=None)[source]¶
Complete a DataFrame with all combinations of specified columns
- Parameters:
*cols (str) – Column names to find all combinations of.
fill (dict, optional) – Dictionary of column names to fill values for missing combinations.
- Returns:
A tibble with all combinations of the specified columns, with missing values filled according to fill parameter.
- Return type:
Examples
>>> df = tp.tibble(x = [1, 1, 2], y = ['a', 'b', 'a'], val = [10, 20, 30]) >>> df.complete('x', 'y')
- count(*args, sort=False, name='n')[source]¶
Returns row counts of the dataset. If bare column names are provided, count() returns counts by group.
- Parameters:
*args (str, Expr) – Columns to group by
sort (bool) – Should columns be ordered in descending order by count
name (str) – The name of the new column in the output. If omitted, it will default to “n”.
- Returns:
If no agument is provided, just return the nomber of rows. If column names are provided, it will count the unique values across columns
- Return type:
Examples
>>> df = tp.tibble({'a': [1, 1, 2, 3], ...: 'b': ['a', 'a', 'b', 'b']}) >>> df.count() shape: (1, 1) ┌─────┐ │ n │ │ u32 │ ╞═════╡ │ 4 │ └─────┘ >>> df.count('a', 'b') shape: (3, 3) ┌─────────────────┐ │ a b n │ │ i64 str u32 │ ╞═════════════════╡ │ 1 a 2 │ │ 2 b 1 │ │ 3 b 1 │ └─────────────────┘
- cross_join(df, suffix='_right')[source]¶
Perform a cross join (Cartesian product)
- Parameters:
df (tibble) – DataFrame to join with.
suffix (str) – Suffix to append to columns with a duplicate name.
- Returns:
All combinations of rows from both tibbles.
- Return type:
Examples
>>> df1.cross_join(df2)
- crossing(*args, **kwargs)[source]¶
Expands the existing tibble for each value of the variables used in the crossing() argument. See Returns.
- Parameters:
*args (list) – One unamed list is accepted.
**kwargs (list) – Keyword will be the variable name, and the values in the list will be in the expanded tibble
- Returns:
A tibble with varibles containing all combinations of the values in the arguments passed to crossing(). The original tibble will be replicated for each unique combination.
- Return type:
Examples
>>> df = tp.tibble({'a': [1, 2], "b": [3, 5]}) >>> df shape: (2, 2) ┌───────────┐ │ a b │ │ i64 i64 │ ╞═══════════╡ │ 1 3 │ │ 2 5 │ └───────────┘ >>> df.crossing(c = ['a', 'b', 'c']) shape: (6, 3) ┌─────────────────┐ │ a b c │ │ i64 i64 str │ ╞═════════════════╡ │ 1 3 a │ │ 1 3 b │ │ 1 3 c │ │ 2 5 a │ │ 2 5 b │ │ 2 5 c │ └─────────────────┘
- describe()[source]¶
Generate summary statistics for all columns
- Returns:
A tibble with summary statistics including column name, type, count of non-null values, null count, unique count, and for numeric columns: mean, std, min, 25%, 50%, 75%, max.
- Return type:
Examples
>>> df.describe()
- descriptive_statistics(vars=None, groups=None, include_categorical=True, include_type=False)[source]¶
Compute descriptive statistics for numerical variables and optionally frequency statistics for categorical variables, with support for grouping.
- Parameters:
vars (str, list, dict, or None, default None) – The variables for which to compute statistics. - If None, all variables in the dataset (as given by self.names) are used. - If a string, it is interpreted as a single variable name. - If a list, each element is treated as a variable name. - If a dict, keys are variable names and values are their labels.
groups (str, list, dict, or None, default None) – Variable(s) to group by when computing statistics. - If None, overall statistics are computed. - If a string, it is interpreted as a single grouping variable. - If a list, each element is treated as a grouping variable. - If a dict, keys are grouping variable names and values are their labels.
include_categorical (bool, default True) – Whether to include frequency statistics for categorical variables in the output.
include_type (bool, default False) – If True, adds a column indicating the variable type (“Num” for numerical, “Cat” for categorical).
- Returns:
A tibble containing the descriptive statistics. For numerical variables, the statistics include N (count of non-missing values), Missing (percentage of missing values), Mean (average value), Std.Dev. (standard deviation), Min (minimum value), and Max (maximum value). If grouping is specified, these statistics are computed for each group. When
include_categoricalis True, frequency statistics for categorical variables are appended to the result.- Return type:
- distinct(*args, keep_all=False)[source]¶
Select distinct/unique rows
- Parameters:
*args (str, Expr) – Columns to find distinct/unique rows
keep_all (boll) – If True, keep all columns. Otherwise, return only the ones used to select the distinct rows.
- Returns:
Tibble after removing the repeated rows based on
args- Return type:
Examples
>>> df = tp.tibble({'a': range(3), 'b': ['a', 'a', 'b']}) >>> df.distinct() >>> df.distinct('b')
- drop(*args)[source]¶
Drop unwanted columns
- Parameters:
*args (str) – Columns to drop
- Returns:
Tibble with columns in
argsdropped- Return type:
Examples
>>> df.drop('x', 'y')
- drop_na(*args)[source]¶
Drop rows containing missing values. Alias for
drop_null(), matching tidyr’sdrop_na()spelling.- Parameters:
*args (str) – Columns to drop nulls from (defaults to all)
- Returns:
Tibble with rows containing nulls in
argsremoved.- Return type:
Examples
>>> df = tp.tibble(x = [1, None, 3], y = [None, 'b', 'c']) >>> df.drop_na() >>> df.drop_na('x')
- drop_null(*args)[source]¶
Drop rows containing missing values
- Parameters:
*args (str) – Columns to drop nulls from (defaults to all)
- Returns:
Tibble with rows in
argswith missing values dropped- Return type:
Examples
>>> df = tp.tibble(x = [1, None, 3], y = [None, 'b', 'c'], z = range(3)} >>> df.drop_null() >>> df.drop_null('x', 'y')
- fill(*args, direction='down', by=None)[source]¶
Fill in missing values with previous or next value
- Parameters:
*args (str) – Columns to fill
direction (str) – Direction to fill. One of [‘down’, ‘up’, ‘downup’, ‘updown’]
by (str, list) – Columns to group by
- Returns:
Tibble with missing values filled
- Return type:
Examples
>>> df = tp.tibble({'a': [1, None, 3, 4, 5], ... 'b': [None, 2, None, None, 5], ... 'groups': ['a', 'a', 'a', 'b', 'b']}) >>> df.fill('a', 'b') >>> df.fill('a', 'b', by = 'groups') >>> df.fill('a', 'b', direction = 'downup')
- filter(*args, by=None)[source]¶
Filter rows on one or more conditions
- Parameters:
*args (Expr) – Conditions to filter by
by (str, list) – Columns to group by
- Returns:
A tibble with rows that match condition.
- Return type:
Examples
>>> df = tp.tibble({'a': range(3), 'b': ['a', 'a', 'b']}) >>> df.filter(col('a') < 2, col('b') == 'a') >>> df.filter((col('a') < 2) & (col('b') == 'a')) >>> df.filter(col('a') <= tp.mean(col('a')), by = 'b')
- freq(vars=None, groups=None, na_rm=False, na_label=None)[source]¶
Compute frequency table.
- Parameters:
vars (str, list, or dict) – Variables to return value frequencies for. If a dict is provided, the key should be the variable name and the values the variable label for the output
groups (str, list, dict, or None, optional) – Variable names to condition marginal frequencies on. If a dict is provided, the key should be the variable name and the values the variable label for the output Defaults to None (no grouping).
na_rm (bool, optional) – Whether to include NAs in the calculation. Defaults to False.
na_label (str) – Label to use for the NA values
- Returns:
A tibble with relative frequencies and counts.
- Return type:
- full_join(df, left_on=None, right_on=None, on=None, suffix: str = '_right')[source]¶
Perform an full join
- Parameters:
df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.
- Returns:
Union between the original and the df tibbles. The rows that don’t match in one of the tibbles will be completed with missing values.
- Return type:
Examples
>>> df1.full_join(df2) >>> df1.full_join(df2, on = 'x') >>> df1.full_join(df2, left_on = 'left_x', right_on = 'x')
- get_dupes(*cols)[source]¶
Find duplicate rows
- Parameters:
*cols (str) – Column names to check for duplicates. If empty, checks all columns.
- Returns:
A tibble containing duplicate rows with a ‘dupe_count’ column.
- Return type:
Examples
>>> df.get_dupes('x', 'y')
- glimpse(regex='.')[source]¶
Print compact information about the data
- Parameters:
regex (str, list, dict) – Return information of the variables that match the regular expression, the list, or the dictionary. If dictionary is used, the variable names must be the dictionary keys.
- Return type:
None
- group_by(group, *args, **kwargs)[source]¶
Takes an existing tibble and converts it into a grouped tibble where operations are performed “by group”. ungroup() happens automatically after the operation is performed.
- Parameters:
group (str, list) – Variable names to group by.
- Returns:
A tibble with values grouped by one or more columns.
- Return type:
Grouped tibble
- hoist(col_name, *, remove=False, **fields)[source]¶
Pull named elements out of a list- or struct-column into top-level columns.
- Parameters:
col_name (str) – Name of the list or struct column to reach into.
remove (bool) – If True, drop the original column after hoisting.
**fields (str, int, or list) – Each keyword defines a new top-level column. The value is a path into the list/struct column: a field name, an integer list index, or a list of such steps for nested access.
Examples
>>> df = tp.tibble(meta = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]) >>> df.hoist('meta', a = 'a')
- inner_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶
Perform an inner join
- Parameters:
df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.
- Returns:
A tibble with intersection of cases in the original and df tibbles.
- Return type:
Examples
>>> df1.inner_join(df2) >>> df1.inner_join(df2, on = 'x') >>> df1.inner_join(df2, left_on = 'left_x', right_on = 'x')
- left_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶
Perform a left join
- Parameters:
df (tibble) – Lazy DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.
- Returns:
The original tibble with added columns from tibble df if they match columns in the original one. Columns to match on are given in the function parameters.
- Return type:
Examples
>>> df1.left_join(df2) >>> df1.left_join(df2, on = 'x') >>> df1.left_join(df2, left_on = 'left_x', right_on = 'x')
- mutate(*args, by=None, **kwargs)[source]¶
Add or modify columns
- Parameters:
*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify
- Returns:
Original tibble with new column created.
- Return type:
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), c = ['a', 'a', 'b']}) >>> df.mutate(double_a = col('a') * 2, ... a_plus_b = col('a') + col('b')) >>> df.mutate(row_num = row_number(), by = 'c')
- nest(by, *args, **kwargs)[source]¶
creates a nested tibble
- Parameters:
by (list, str) – Columns to nest on
kwargs –
- datalist of column names
columns to select to include in the nested data If not provided, include all columns except the ones used in ‘by’
- keystr
name of the resulting nested column.
- names_sepstr
If not provided (default), the names in the nested data will come from the former names. If a string, the new inner names in the nested dataframe will use the outer names with names_sep automatically stripped. This makes names_sep roughly symmetric between nesting and unnesting.
- Returns:
The resulting tibble with have a column that contains nested tibbles
- Return type:
- pack(**groups)[source]¶
Pack several columns into one or more struct columns.
- Parameters:
**groups (list or str) – Each keyword defines a new struct column. The value is a list of existing column names to pack into that struct.
Examples
>>> df = tp.tibble(x = [1, 2], y = [3, 4], z = ['a', 'b']) >>> df.pack(position = ['x', 'y'])
- pipe(fn, *args, **kwargs)[source]¶
Apply a function to the entire DataFrame
- Parameters:
fn (callable) – Function to apply. The tibble is passed as the first argument.
*args (any) – Additional positional arguments passed to fn.
**kwargs (any) – Additional keyword arguments passed to fn.
- Returns:
Result of
fn(self, *args, **kwargs).- Return type:
any
Examples
>>> def add_column(df, name, value): ... return df.mutate(**{name: value}) >>> df.pipe(add_column, 'new_col', 1)
- pivot_longer(cols=None, names_to='name', values_to='value')[source]¶
Pivot data from wide to long
- Parameters:
cols (Expr) – List of the columns to pivot. Defaults to all columns.
names_to (str) – Name of the new “names” column.
values_to (str) – Name of the new “values” column
- Returns:
Original tibble, but in long format.
- Return type:
Examples
>>> df = tp.tibble({'id': ['id1', 'id2'], 'a': [1, 2], 'b': [1, 2]}) >>> df.pivot_longer(cols = ['a', 'b']) >>> df.pivot_longer(cols = ['a', 'b'], names_to = 'stuff', values_to = 'things')
- pivot_wider(names_from='name', values_from='value', id_cols=None, values_fn='first', values_fill=None)[source]¶
Pivot data from long to wide
- Parameters:
names_from (str) – Column to get the new column names from.
values_from (str) – Column to get the new column values from
id_cols (str, list) – A set of columns that uniquely identifies each observation. Defaults to all columns in the data table except for the columns specified in names_from and values_from.
values_fn (str) – Function for how multiple entries per group should be dealt with. Any of ‘first’, ‘count’, ‘sum’, ‘max’, ‘min’, ‘mean’, ‘median’, ‘last’
values_fill (str) – If values are missing/null, what value should be filled in. Can use: “backward”, “forward”, “mean”, “min”, “max”, “zero”, “one”
- Returns:
Original tibble, but in wide format.
- Return type:
Examples
>>> df = tp.tibble({'id': [1, 1], 'variable': ['a', 'b'], 'value': [1, 2]}) >>> df.pivot_wider(names_from = 'variable', values_from = 'value')
- print(n=1000, ncols=1000, str_length=1000, digits=2)[source]¶
Print the DataFrame
- Parameters:
n (int, default=1000) – Number of rows to print
ncols (int, default=1000) – Number of columns to print
str_length (int, default=1000) – Maximum length of the strings.
- Return type:
None
- pull(var=None)[source]¶
Extract a column as a series
- Parameters:
var (str) – Name of the column to extract. Defaults to the last column.
- Returns:
The series will contain the values of the column from var.
- Return type:
Series
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3)) >>> df.pull('a')
- relevel(x, ref)[source]¶
Change the reference level a string or factor and covert to factor
Inputs¶
- xstr
Variable name
- refstr
Reference level
- returns:
The original tibble with the column specified in x as an ordered factors, with first category specified in ref.
- rtype:
tibble
- relocate(*args, before=None, after=None)[source]¶
Move a column or columns to a new position
- Parameters:
*args (str, Expr) – Columns to move
- Returns:
Original tibble with columns relocated.
- Return type:
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.relocate('a', before = 'c') >>> df.relocate('b', after = 'c')
- rename(*args, regex=False, tolower=False, strict=False, **kwargs)[source]¶
Rename columns
- Parameters:
*args (str or dict) – If a single dict is provided, it is used as {old_name: new_name}. If strings are provided, they are treated as pairs: new_name, old_name, …
regex (bool, default False) – If True, uses regular expression replacement {<matched from>:<matched to>}
tolower (bool, default False) – If True, convert all to lower case
**kwargs (str) – Keyword arguments in the form new_name=’old_name’
- Returns:
Original tibble with columns renamed.
- Return type:
Examples
>>> df = tp.tibble({'x': range(3), 't': range(3), 'z': ['a', 'a', 'b']}) >>> df.rename({'x': 'new_x'}) >>> df.rename(new_x = 'x') >>> df.rename('new_x', 'x')
- replace(rep, regex=False)[source]¶
Replace method from polars pandas. Replaces values of a column.
- Parameters:
rep (dict) –
- Format to use polars’ replace:
{<varname>:{<old value>:<new value>, …}}
- Format to use pandas’ replace:
{<old value>:<new value>, …}
regex (bool) – If true, replace using regular expression. It uses pandas replace()
- Returns:
Original tibble with values of columns replaced based on rep`.
- Return type:
- replace_na(replace=None)[source]¶
Replace null values in specified columns
- Parameters:
replace (dict) – Dictionary mapping column names to replacement values.
- Returns:
A tibble with nulls replaced.
- Return type:
Examples
>>> df.replace_na({'x': 0, 'y': 'missing'})
- replace_null(replace=None)[source]¶
Replace null values
- Parameters:
replace (dict, str, int, or float) – Dictionary of column/replacement pairs, or values to replace null values. If not dict, replace in all columns. If replace is a string, it will replace nulls in all string columns, and so on.
- Returns:
Original tibble with missing/null values replaced.
- Return type:
Examples
>>> df = tp.tibble({'a': [None, 'abc', 'cde'], 'b':[None, 1, 2], 'c': [None, 1.1, 2.2]}) >>> df.replace_null({'a': 'New value'}) >>> df.replace_null({'a': 1}) >>> df.replace_null({'b': 1}) >>> df.replace_null({'b': 1.1}) >>> df.replace_null({'c': 1}) >>> df.replace_null('a') >>> df.replace_null(1) >>> df.replace_null(1.1)
- right_join(df, left_on=None, right_on=None, on=None, suffix='_right')[source]¶
Perform a right join
- Parameters:
df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
suffix (str) – Suffix to append to columns with a duplicate name.
- Returns:
Every row of
dfwith matching columns fromself. Unmatched rows on the left side receive null values.- Return type:
Examples
>>> df1.right_join(df2) >>> df1.right_join(df2, on = 'x') >>> df1.right_join(df2, left_on = 'left_x', right_on = 'x')
- sample_frac(fraction, seed=None, with_replacement=False)[source]¶
Randomly sample a fraction of rows
- Parameters:
fraction (float) – Fraction of rows to sample (between 0 and 1).
seed (int, optional) – Random seed for reproducibility.
with_replacement (bool) – Whether to sample with replacement.
- Returns:
A tibble with a random fraction of rows.
- Return type:
Examples
>>> df.sample_frac(0.5, seed = 42)
- sample_n(n, seed=None, with_replacement=False)[source]¶
Randomly sample n rows
- Parameters:
n (int) – Number of rows to sample.
seed (int, optional) – Random seed for reproducibility.
with_replacement (bool) – Whether to sample with replacement.
- Returns:
A tibble with n randomly sampled rows.
- Return type:
Examples
>>> df.sample_n(5, seed = 42)
- save_data(fn, copies=None, sep=';', kws_latex=None, *args, **kws)[source]¶
Save data based on the filename.
- Parameters:
fn (callable, str) – Path and filename
copies (list of str) – List with strings with the file extensions. Copies of the file are saved based on the extension, using the same filename and path used in “
fn”.sep (str (optional)) – Set the column separator to export to text-like files (.csv, .tsv, .txt, etc.)
kws_latex (dict) – Arguments of to_latex(). See tibble.to_latex()
Notes
Additional positional and keyword arguments are passed to the underlying method used to save the file, which is based on the file extension.
.tex => tidypolars_extra.tibble.to_latex
.csv => polars.write_csv (uses sep=’;’ as default)
.tsv => polars.write_csv (uses sep=’ ‘ as default)
.dat => polars.write_csv (uses sep=’ ‘ as default)
.txt => polars.write_csv (uses sep=’ ‘ as default)
.xls => polars.write_excel
.xlsx => polars.write_excel
.dta => pandas.DataFrame.to_stata
.parquet => polars.write_parquet
Use silently=True to save quietly (Default False).
- select(*args)[source]¶
Select or drop columns
- Parameters:
*args (str, list, dict, or combinations of them) – Columns to select. It can combine names, list of names, and a dict. If dict, it will rename the columns based on the dict. It also accepts helper functions:
tp.matches(<regex>),tp.contains(<str>),tp.where(<str>).
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'abcba': ['a', 'a', 'b']}) >>> df.select('a', 'b') >>> df.select(col('a'), col('b')) >>> df.select({'a': 'new name'}, tp.matches("c")) >>> df.select(tp.where('numeric'))
- semi_join(df, left_on=None, right_on=None, on=None)[source]¶
Perform a semi join (keep rows with a match in df, no columns added)
- Parameters:
df (tibble) – DataFrame to join with.
left_on (str, list) – Join column(s) of the left DataFrame.
right_on (str, list) – Join column(s) of the right DataFrame.
on (str, list) – Join column(s) of both DataFrames. If set, left_on and right_on should be None.
- Returns:
Rows from the original tibble that have a match in df.
- Return type:
Examples
>>> df1.semi_join(df2, on = 'x')
- separate(sep_col, into, sep='_', remove=True)[source]¶
Separate a character column into multiple columns
- Parameters:
sep_col (str) – Column to split into multiple columns
into (list) – List of new column names
sep (str) – Separator to split on. Default to ‘_’
remove (bool) – If True removes the input column from the output data frame
- Returns:
Original tibble with a column splitted based on sep.
- Return type:
Examples
>>> df = tp.tibble(x = ['a_a', 'b_b', 'c_c']) >>> df.separate('x', into = ['left', 'right'])
- separate_longer_delim(sep_col, delim)[source]¶
Split a string column by
deliminto longer rows.- Parameters:
sep_col (str) – Column to split.
delim (str) – Delimiter to split on.
Examples
>>> df = tp.tibble(x = ['a,b', 'c']) >>> df.separate_longer_delim('x', ',')
- separate_longer_position(sep_col, width)[source]¶
Split each string into chunks of
widthcharacters and convert into longer rows.- Parameters:
sep_col (str) – Column to split.
width (int) – Width of each chunk in characters.
Examples
>>> df = tp.tibble(x = ['abcd', 'efgh']) >>> df.separate_longer_position('x', 2)
- separate_rows(*cols, sep=',')[source]¶
Split the given columns on
sepand explode them into longer rows. Superseded byseparate_longer_delim()but kept for tidyr parity.- Parameters:
*cols (str) – Columns to split and explode.
sep (str) – Delimiter to split on (default:
',').
Examples
>>> df = tp.tibble(x = ['a,b', 'c'], y = [1, 2]) >>> df.separate_rows('x', sep = ',')
- separate_wider_delim(sep_col, delim, names, *, remove=True, too_few='error', too_many='error')[source]¶
Split a string column into several columns using a delimiter.
- Parameters:
sep_col (str) – Column to split.
delim (str) – Delimiter to split on.
names (list) – Names of the resulting columns.
remove (bool) – If True (default) drop the original column.
too_few (str) – One of
'error'(default) or'align_start'. When'error', raises if a row produces fewer fields thanlen(names).too_many (str) – One of
'error'(default) or'drop'. When'error', raises if a row produces more fields thanlen(names).
Examples
>>> df = tp.tibble(x = ['a_1', 'b_2']) >>> df.separate_wider_delim('x', '_', names = ['letter', 'num'])
- separate_wider_position(sep_col, widths, *, remove=True)[source]¶
Split a string column into several columns by character positions.
- Parameters:
sep_col (str) – Column to split.
widths (dict) – Mapping of new column name → width in characters.
remove (bool) – If True (default) drop the original column.
Examples
>>> df = tp.tibble(x = ['2024Q1', '2025Q2']) >>> df.separate_wider_position('x', widths = {'year': 4, 'q': 2})
- separate_wider_regex(sep_col, patterns, *, remove=True)[source]¶
Split a string column using a regular expression with named groups.
- Parameters:
sep_col (str) – Column to split.
patterns (str or dict) – Either a regex string containing named capturing groups, or a dict
{name: sub_pattern}which is assembled into a single regex of named groups in the given order.remove (bool) – If True (default) drop the original column.
Examples
>>> df = tp.tibble(x = ['id-001', 'id-002']) >>> df.separate_wider_regex('x', {'prefix': '[a-z]+', '_sep': '-', 'num': '\d+'})
- set_names(nm=None)[source]¶
Change the column names of the data frame
- Parameters:
nm (list) – A list of new names for the data frame
Examples
>>> df = tp.tibble(x = range(3), y = range(3)) >>> df.set_names(['a', 'b'])
- slice(*args, by=None)[source]¶
Grab rows from a data frame
- Parameters:
*args (int, list) – Rows to grab
by (str, list) – Columns to group by
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.slice(0, 1) >>> df.slice(0, by = 'c')
- slice_head(n=5, *, by=None)[source]¶
Grab top rows from a data frame
- Parameters:
n (int) – Number of rows to grab
by (str, list) – Columns to group by
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.slice_head(2) >>> df.slice_head(1, by = 'c')
- slice_max(order_by, n=1, *, with_ties=True, by=None)[source]¶
Select rows with the largest values of
order_by.- Parameters:
order_by (str, list) – Column(s) to order by (descending).
n (int) – Number of rows to return per group.
with_ties (bool) – If True (default), include tied rows even if that exceeds
n.by (str, list, optional) – Columns to group by.
Examples
>>> df = tp.tibble(x = [1, 2, 2, 3], g = ['a', 'a', 'b', 'b']) >>> df.slice_max('x', n = 1) >>> df.slice_max('x', n = 1, by = 'g')
- slice_min(order_by, n=1, *, with_ties=True, by=None)[source]¶
Select rows with the smallest values of
order_by.- Parameters:
order_by (str, list) – Column(s) to order by (ascending).
n (int) – Number of rows to return per group.
with_ties (bool) – If True (default), include tied rows even if that exceeds
n.by (str, list, optional) – Columns to group by.
Examples
>>> df = tp.tibble(x = [1, 2, 2, 3], g = ['a', 'a', 'b', 'b']) >>> df.slice_min('x', n = 1) >>> df.slice_min('x', n = 1, by = 'g')
- slice_sample(n=None, *, prop=None, replace=False, seed=None, by=None)[source]¶
Randomly sample rows. Modern replacement for
sample_n()andsample_frac().- Parameters:
n (int, optional) – Number of rows to sample. Provide exactly one of
norprop.prop (float, optional) – Fraction of rows to sample (between 0 and 1).
replace (bool) – Whether to sample with replacement.
seed (int, optional) – Random seed for reproducibility.
by (str, list, optional) – Columns to group by; sampling happens within each group.
Examples
>>> df.slice_sample(n = 3, seed = 42) >>> df.slice_sample(prop = 0.5, by = 'g', seed = 42)
- slice_tail(n=5, *, by=None)[source]¶
Grab bottom rows from a data frame
- Parameters:
n (int) – Number of rows to grab
by (str, list) – Columns to group by
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.slice_tail(2) >>> df.slice_tail(1, by = 'c')
- summarize(*args, by=None, **kwargs)[source]¶
Aggregate data with summary statistics
- Parameters:
*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify
- Returns:
A tibble with the summaries
- Return type:
Examples
>>> df = tp.tibble({'a': range(3), 'b': range(3), 'c': ['a', 'a', 'b']}) >>> df.summarize(avg_a = tp.mean(col('a'))) >>> df.summarize(avg_a = tp.mean(col('a')), ... by = 'c') >>> df.summarize(avg_a = tp.mean(col('a')), ... max_b = tp.max(col('b')))
- tab(row, col, groups=None, margins=True, normalize='all', margins_name='Total', stat='both', na_rm=True, na_label='NA', digits=2)[source]¶
Create a 2x2 contingency table for two categorical variables, with optional grouping, margins, and normalization.
- Parameters:
row (str) – Name of the variable to be used for the rows of the table.
col (str) – Name of the variable to be used for the columns of the table.
groups (str or list of str, optional) – Variable name(s) to use as grouping variables. When provided, a separate 2x2 table is generated for each group.
margins (bool, default True) – If True, include row and column totals (margins) in the table.
normalize ({'all', 'row', 'columns'}, default 'all') –
- Specifies how to compute the marginal percentages in each cell:
’all’: percentages computed over the entire table.
’row’: percentages computed across each row.
’columns’: percentages computed down each column.
margins_name (str, default 'Total') – Name to assign to the row and column totals.
stat ({'both', 'perc', 'n'}, default 'both') –
- Determines the statistic to display in each cell:
’both’: returns both percentages and sample size.
’perc’: returns percentages only.
’n’: returns sample size only.
na_rm (bool, default True) – If True, remove rows with missing values in the row or col variables.
na_label (str, default 'NA') – Label to use for missing values when na_rm is False.
digits (int, default 2) – Number of digits to round the percentages to.
- Returns:
A contingency table as a tibble. The table contains counts and/or percentages as specified by the stat parameter, includes margins if requested, and is formatted with group headers when grouping variables are provided.
- Return type:
- to_csv(*args, **kws)[source]¶
Save tibble to csv.
Details¶
See polars write_csv() for details.
- rtype:
None
- to_dict(*, as_series=True)[source]¶
Aggregate data with summary statistics
- Parameters:
as_series (bool) – If True - returns the dict values as Series If False - returns the dict values as lists
Examples
>>> df.to_dict() >>> df.to_dict(as_series = False)
- to_dta(*args, **kws)[source]¶
Save tibble to dta.
Details¶
See polars write_dta() for details.
- rtype:
None
- to_excel(*args, **kws)[source]¶
Save tibble to excel.
Details¶
See polars write_excel() for details.
- rtype:
None
- to_latex(fn=None, header=None, digits=4, caption=None, label=None, align=None, na_rep='', position='!htb', group_rows_by=None, group_title_align='l', footnotes=None, footnotes_width='\\linewidth', index=False, escape=False, longtable=False, longtable_singlespace=True, rotate=False, scale=True, parse_linebreaks=True, tabular=False, *args, **kws)[source]¶
Convert the object to a LaTeX tabular representation.
- Parameters:
fn (str) – Path with filename
header (list of tuples, optional) –
The column headers for the LaTeX table. Each tuple corresponds to a column. Example creating upper level header with grouped columns:
[("", "col 1"), ("Group A", "col 2"), ("Group A", "col 3"), ("Group B", "col 4"), ("Group B", "col 5"), ]
Example creating two upper level headers with grouped columns:
[("Group 1", "" , "col 1"), ("Group 1", "Group A", "col 2"), ("Group 1", "Group A", "col 3"), ("" , "Group B", "col 4"), ("" , "Group B", "col 5"), ]
digits (int, default=4) – Number of decimal places to round the numerical values in the table.
caption (str, optional) – The caption for the LaTeX table.
label (str, optional) – The label for referencing the table in LaTeX.
align (str, optional) – Column alignment specifications (e.g., ‘lcr’).
na_rep (str, default='') – The representation for NaN values in the table.
position (str, default='!htbp') – The placement option for the table in the LaTeX document.
footnotes (dict, optional) – A dictionary where keys are column alignments (‘c’, ‘r’, or ‘l’) and values are the respective footnote strings.
footnotes_width (str, None) – Width of the footnote. Example: ‘linewidth’, ‘40pt’ If None, impose no restriction to the width
group_rows_by (str, default=None) – Name of the variable in the data with values to group the rows by.
group_title_align (str, default='l') – Alignment of the title of each row group.
index (bool, default=False) – Whether to include the index in the LaTeX table.
escape (bool, default=False) – Whether to escape LaTeX special characters.
longtable (bool, deafult=False) – If True, table spans multiple pages
longtable_singlespace (bool) – Force single space to longtables
rotate (bool) – Whether to use landscape table
scale (bool, default=True) – If True, scales the table to fit the linewidth when the table exceeds that size. Ignored when
longtable=True(LaTeX limitation because longtable does not use tabular).parse_linebreaks (bool, default=True) – If True, parse n and replace it with makecel to produce linebreaks
tabular (bool, default=False) – Whether to use a tabular format for the output.
- Returns:
A LaTeX formatted string of the tibble.
- Return type:
str
- to_markdown()[source]¶
Render the tibble as a Markdown table string
- Returns:
A Markdown-formatted table string.
- Return type:
str
Examples
>>> print(df.to_markdown())
- to_parquet(file=str, compression='snappy', use_pyarrow=False, silently=False, *args, **kws)[source]¶
Write a data frame to a parquet
- transmute(*args, by=None, **kwargs)[source]¶
Add or modify columns, keeping only the new columns
- Parameters:
*args (Expr) – Column expressions to add or modify
by (str, list) – Columns to group by
**kwargs (Expr) – Column expressions to add or modify
- Returns:
A tibble with only the newly created columns (and grouping columns if by is used).
- Return type:
Examples
>>> df.transmute(double_a = col('a') * 2)
- unite(col='_united', unite_cols=[], sep='_', remove=True)[source]¶
Unite multiple columns by pasting strings together
- Parameters:
col (str) – Name of the new column
unite_cols (list) – List of columns to unite
sep (str) – Separator to use between values
remove (bool) – If True removes input columns from the data frame
Examples
>>> df = tp.tibble(a = ["a", "a", "a"], b = ["b", "b", "b"], c = range(3)) >>> df.unite("united_col", unite_cols = ["a", "b"])
- unnest(col)[source]¶
Unnest a nested tibble :param col: Columns to unnest :type col: str
- Returns:
The nested tibble will be expanded and become unested rows of the original tibble.
- Return type:
- unnest_longer(col_name, *, values_to=None, indices_to=None)[source]¶
Turn each element of a list- or struct-column into its own row.
For list columns, this behaves like
DataFrame.explode. For struct columns, each row is expanded into one row per field, with the field name going intoindices_toand the field value intovalues_to.- Parameters:
col_name (str) – Name of the list or struct column to unnest.
values_to (str, optional) – Name of the output value column. For list columns this renames the exploded column. For struct columns this names the value column; defaults to
col_name.indices_to (str, optional) – For struct columns, the name of the field-name column. Defaults to
f"{col_name}_id".
Examples
>>> df = tp.tibble(id = [1, 2], vals = [[10, 20], [30]]) >>> df.unnest_longer('vals')
- unnest_wider(col_name, *, names_sep=None)[source]¶
Turn each element of a struct- or list-column into its own column.
- Parameters:
col_name (str) – Name of the column to unnest.
names_sep (str, optional) – If provided, the output column names become
f"{col_name}{names_sep}{field}"to avoid collisions.
Examples
>>> df = tp.tibble(id = [1, 2], pt = [{'x': 1, 'y': 2}, {'x': 3, 'y': 4}]) >>> df.unnest_wider('pt')
- unpack(*cols)[source]¶
Unpack one or more struct columns into their component columns.
- Parameters:
*cols (str) – Names of the struct columns to unpack.
Examples
>>> df = tp.tibble(id = [1, 2]).pack(pt = ['id']) # contrived >>> df.unpack('pt')
- property names[source]¶
Get column names
- Returns:
Names of the columns
- Return type:
list
Examples
>>> df.names