Overview¶
tidypolars-extra brings R’s Tidyverse-style
data manipulation to Python on top of the fast Polars
engine. If you already know dplyr, tidyr, stringr, or lubridate from
R, you should feel right at home.
This vignette is a quick end-to-end tour of the core verbs — filter,
arrange, select, mutate, group_by/summarize, and joins — using
the Palmer Penguins dataset.
Each per-verb vignette that follows covers one verb in depth; this page
shows how they all fit together.
Setup¶
Every tidypolars-extra function lives under a single top-level namespace, so
the convention is to import the package as tp:
import tidypolars_extra as tp
penguins = tp.tibble(
tp.read_data(fn="tidypolars_extra/data/penguins.csv", sep=",", silently=True)
)
The result is a tibble — a thin wrapper around polars.DataFrame
that exposes tidyverse-style verbs like filter, mutate, and summarize
while keeping all of Polars’ speed underneath.
A first look at the data¶
A tibble has R-style nrow / ncol / names properties and the usual
head() for a quick peek:
penguins.nrow, penguins.ncol
(344, 8)
penguins.names
['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year']
penguins.head()
shape: (5, 8)
┌─────────┬───────────┬───────────────┬───────────────┬──────────────┬─────────────┬────────┬──────┐
│ species ┆ island ┆ bill_length_m ┆ bill_depth_mm ┆ flipper_leng ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ m ┆ --- ┆ th_mm ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ --- ┆ f64 ┆ --- ┆ f64 ┆ str ┆ i64 │
│ ┆ ┆ f64 ┆ ┆ f64 ┆ ┆ ┆ │
╞═════════╪═══════════╪═══════════════╪═══════════════╪══════════════╪═════════════╪════════╪══════╡
│ Adelie ┆ Torgersen ┆ 39.1 ┆ 18.7 ┆ 181.0 ┆ 3750.0 ┆ male ┆ 2007 │
│ Adelie ┆ Torgersen ┆ 39.5 ┆ 17.4 ┆ 186.0 ┆ 3800.0 ┆ female ┆ 2007 │
│ Adelie ┆ Torgersen ┆ 40.3 ┆ 18.0 ┆ 195.0 ┆ 3250.0 ┆ female ┆ 2007 │
│ Adelie ┆ Torgersen ┆ null ┆ null ┆ null ┆ null ┆ null ┆ 2007 │
│ Adelie ┆ Torgersen ┆ 36.7 ┆ 19.3 ┆ 193.0 ┆ 3450.0 ┆ female ┆ 2007 │
└─────────┴───────────┴───────────────┴───────────────┴──────────────┴─────────────┴────────┴──────┘
Palmer Penguins contains body measurements for three species
(Adelie, Chinstrap, Gentoo) across three islands. A handful of rows
have missing measurements, which is realistic — most of the verbs below
handle nulls gracefully, and we’ll drop them explicitly when we need
complete cases.
filter — keep rows of interest¶
filter keeps rows where every condition is True. Rows where a
condition evaluates to null are dropped automatically. Reference columns
with tp.col("name"):
penguins.filter(tp.col("species") == "Gentoo", tp.col("sex") == "female")
shape: (58, 8)
┌─────────┬────────┬────────────────┬───────────────┬────────────────┬─────────────┬────────┬──────┐
│ species ┆ island ┆ bill_length_mm ┆ bill_depth_mm ┆ flipper_length ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ _mm ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ --- ┆ f64 ┆ str ┆ i64 │
│ ┆ ┆ ┆ ┆ f64 ┆ ┆ ┆ │
╞═════════╪════════╪════════════════╪═══════════════╪════════════════╪═════════════╪════════╪══════╡
│ Gentoo ┆ Biscoe ┆ 46.1 ┆ 13.2 ┆ 211.0 ┆ 4500.0 ┆ female ┆ 2007 │
│ Gentoo ┆ Biscoe ┆ 48.7 ┆ 14.1 ┆ 210.0 ┆ 4450.0 ┆ female ┆ 2007 │
│ Gentoo ┆ Biscoe ┆ 46.5 ┆ 13.5 ┆ 210.0 ┆ 4550.0 ┆ female ┆ 2007 │
│ Gentoo ┆ Biscoe ┆ 45.4 ┆ 14.6 ┆ 211.0 ┆ 4800.0 ┆ female ┆ 2007 │
│ Gentoo ┆ Biscoe ┆ 43.3 ┆ 13.4 ┆ 209.0 ┆ 4400.0 ┆ female ┆ 2007 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ Gentoo ┆ Biscoe ┆ 43.5 ┆ 15.2 ┆ 213.0 ┆ 4650.0 ┆ female ┆ 2009 │
│ Gentoo ┆ Biscoe ┆ 46.2 ┆ 14.1 ┆ 217.0 ┆ 4375.0 ┆ female ┆ 2009 │
│ Gentoo ┆ Biscoe ┆ 47.2 ┆ 13.7 ┆ 214.0 ┆ 4925.0 ┆ female ┆ 2009 │
│ Gentoo ┆ Biscoe ┆ 46.8 ┆ 14.3 ┆ 215.0 ┆ 4850.0 ┆ female ┆ 2009 │
│ Gentoo ┆ Biscoe ┆ 45.2 ┆ 14.8 ┆ 212.0 ┆ 5200.0 ┆ female ┆ 2009 │
└─────────┴────────┴────────────────┴───────────────┴────────────────┴─────────────┴────────┴──────┘
Use the | operator (wrapped in parentheses) for OR conditions,
and & for explicit AND. To drop rows with nulls in specific columns,
use drop_na:
complete = penguins.drop_na()
complete.nrow
333
arrange — sort rows¶
arrange sorts ascending by default. Wrap a column in tp.desc(...) to
sort it in descending order; later columns break ties:
complete.arrange(tp.desc("body_mass_g"), "bill_length_mm").head()
shape: (5, 8)
┌─────────┬────────┬────────────────┬───────────────┬──────────────────┬─────────────┬──────┬──────┐
│ species ┆ island ┆ bill_length_mm ┆ bill_depth_mm ┆ flipper_length_m ┆ body_mass_g ┆ sex ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ m ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ --- ┆ f64 ┆ str ┆ i64 │
│ ┆ ┆ ┆ ┆ f64 ┆ ┆ ┆ │
╞═════════╪════════╪════════════════╪═══════════════╪══════════════════╪═════════════╪══════╪══════╡
│ Gentoo ┆ Biscoe ┆ 49.2 ┆ 15.2 ┆ 221.0 ┆ 6300.0 ┆ male ┆ 2007 │
│ Gentoo ┆ Biscoe ┆ 59.6 ┆ 17.0 ┆ 230.0 ┆ 6050.0 ┆ male ┆ 2007 │
│ Gentoo ┆ Biscoe ┆ 48.8 ┆ 16.2 ┆ 222.0 ┆ 6000.0 ┆ male ┆ 2009 │
│ Gentoo ┆ Biscoe ┆ 51.1 ┆ 16.3 ┆ 220.0 ┆ 6000.0 ┆ male ┆ 2008 │
│ Gentoo ┆ Biscoe ┆ 45.2 ┆ 16.4 ┆ 223.0 ┆ 5950.0 ┆ male ┆ 2008 │
└─────────┴────────┴────────────────┴───────────────┴──────────────────┴─────────────┴──────┴──────┘
select — pick columns¶
select accepts bare column names, negative selections ("-year"), and
the tidyselect helpers from tp.starts_with, tp.ends_with, tp.contains,
tp.matches, and tp.everything:
complete.select("species", "island", "sex", "body_mass_g").head()
shape: (5, 4)
┌─────────┬───────────┬────────┬─────────────┐
│ species ┆ island ┆ sex ┆ body_mass_g │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 │
╞═════════╪═══════════╪════════╪═════════════╡
│ Adelie ┆ Torgersen ┆ male ┆ 3750.0 │
│ Adelie ┆ Torgersen ┆ female ┆ 3800.0 │
│ Adelie ┆ Torgersen ┆ female ┆ 3250.0 │
│ Adelie ┆ Torgersen ┆ female ┆ 3450.0 │
│ Adelie ┆ Torgersen ┆ male ┆ 3650.0 │
└─────────┴───────────┴────────┴─────────────┘
complete.select("species", tp.starts_with("bill_")).head()
shape: (5, 3)
┌─────────┬────────────────┬───────────────┐
│ species ┆ bill_length_mm ┆ bill_depth_mm │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═════════╪════════════════╪═══════════════╡
│ Adelie ┆ 39.1 ┆ 18.7 │
│ Adelie ┆ 39.5 ┆ 17.4 │
│ Adelie ┆ 40.3 ┆ 18.0 │
│ Adelie ┆ 36.7 ┆ 19.3 │
│ Adelie ┆ 39.3 ┆ 20.6 │
└─────────┴────────────────┴───────────────┘
complete.select("species", tp.contains("_mm")).head()
shape: (5, 4)
┌─────────┬────────────────┬───────────────┬───────────────────┐
│ species ┆ bill_length_mm ┆ bill_depth_mm ┆ flipper_length_mm │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪════════════════╪═══════════════╪═══════════════════╡
│ Adelie ┆ 39.1 ┆ 18.7 ┆ 181.0 │
│ Adelie ┆ 39.5 ┆ 17.4 ┆ 186.0 │
│ Adelie ┆ 40.3 ┆ 18.0 ┆ 195.0 │
│ Adelie ┆ 36.7 ┆ 19.3 ┆ 193.0 │
│ Adelie ┆ 39.3 ┆ 20.6 ┆ 190.0 │
└─────────┴────────────────┴───────────────┴───────────────────┘
mutate — add or modify columns¶
mutate creates new columns as keyword arguments. Expressions are built
from tp.col(...) and regular Python operators. Use tp.case_when for
multi-branch logic — conditions are checked top-to-bottom and _default
catches everything that falls through:
(
complete
.mutate(
bill_ratio=tp.col("bill_length_mm") / tp.col("bill_depth_mm"),
size_class=tp.case_when(
tp.col("body_mass_g") < 3500, "small",
tp.col("body_mass_g") < 4500, "medium",
_default="large",
),
)
.select("species", "sex", "body_mass_g", "bill_ratio", "size_class")
.head()
)
shape: (5, 5)
┌─────────┬────────┬─────────────┬────────────┬────────────┐
│ species ┆ sex ┆ body_mass_g ┆ bill_ratio ┆ size_class │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ str │
╞═════════╪════════╪═════════════╪════════════╪════════════╡
│ Adelie ┆ male ┆ 3750.0 ┆ 2.090909 ┆ medium │
│ Adelie ┆ female ┆ 3800.0 ┆ 2.270115 ┆ medium │
│ Adelie ┆ female ┆ 3250.0 ┆ 2.238889 ┆ small │
│ Adelie ┆ female ┆ 3450.0 ┆ 1.901554 ┆ small │
│ Adelie ┆ male ┆ 3650.0 ┆ 1.907767 ┆ medium │
└─────────┴────────┴─────────────┴────────────┴────────────┘
group_by + summarize — aggregate by group¶
summarize collapses rows into one row per group. Pass the grouping
columns via by=; inside, use aggregators like tp.mean, tp.sd,
tp.median, or tp.n() (which counts rows in each group):
complete.summarize(
n=tp.n(),
mean_mass=tp.mean("body_mass_g"),
sd_mass=tp.sd("body_mass_g"),
mean_bill=tp.mean("bill_length_mm"),
by="species",
)
shape: (3, 5)
┌───────────┬─────┬─────────────┬────────────┬───────────┐
│ species ┆ n ┆ mean_mass ┆ sd_mass ┆ mean_bill │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪═════╪═════════════╪════════════╪═══════════╡
│ Adelie ┆ 146 ┆ 3706.164384 ┆ 458.620135 ┆ 38.823973 │
│ Chinstrap ┆ 68 ┆ 3733.088235 ┆ 384.335081 ┆ 48.833824 │
│ Gentoo ┆ 119 ┆ 5092.436975 ┆ 501.476154 ┆ 47.568067 │
└───────────┴─────┴─────────────┴────────────┴───────────┘
Pass a list to by= to group by multiple columns:
complete.summarize(
n=tp.n(),
mean_mass=tp.mean("body_mass_g"),
by=["species", "sex"],
)
shape: (6, 4)
┌───────────┬────────┬─────┬─────────────┐
│ species ┆ sex ┆ n ┆ mean_mass │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 ┆ f64 │
╞═══════════╪════════╪═════╪═════════════╡
│ Chinstrap ┆ male ┆ 34 ┆ 3938.970588 │
│ Gentoo ┆ male ┆ 61 ┆ 5484.836066 │
│ Adelie ┆ female ┆ 73 ┆ 3368.835616 │
│ Chinstrap ┆ female ┆ 34 ┆ 3527.205882 │
│ Gentoo ┆ female ┆ 58 ┆ 4679.741379 │
│ Adelie ┆ male ┆ 73 ┆ 4043.493151 │
└───────────┴────────┴─────┴─────────────┘
The by= argument is also available on filter and mutate for
grouped row-filtering and grouped window functions — see the
Group By vignette for the full story.
Joining tables¶
Joins combine two tibbles on shared key columns. left_join keeps every
row from the left table and attaches matching rows from the right. Let’s
build a small lookup table of scientific names and attach it to a per-species
summary:
scientific = tp.tibble(
species=["Adelie", "Chinstrap", "Gentoo"],
scientific_name=[
"Pygoscelis adeliae",
"Pygoscelis antarcticus",
"Pygoscelis papua",
],
)
summary = complete.summarize(
n=tp.n(),
mean_mass=tp.mean("body_mass_g"),
by="species",
)
summary.left_join(scientific, on="species")
shape: (3, 4)
┌───────────┬─────┬─────────────┬────────────────────────┐
│ species ┆ n ┆ mean_mass ┆ scientific_name │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 ┆ str │
╞═══════════╪═════╪═════════════╪════════════════════════╡
│ Adelie ┆ 146 ┆ 3706.164384 ┆ Pygoscelis adeliae │
│ Chinstrap ┆ 68 ┆ 3733.088235 ┆ Pygoscelis antarcticus │
│ Gentoo ┆ 119 ┆ 5092.436975 ┆ Pygoscelis papua │
└───────────┴─────┴─────────────┴────────────────────────┘
inner_join, right_join, and full_join work the same way.
When the key columns have different names in each table, use
left_on= and right_on= instead of on=.
Putting it all together¶
Because every verb returns a new tibble, you can chain them into a
single pipeline that reads top-to-bottom like a recipe — filter,
transform, group, summarize, sort:
(
penguins
.drop_na()
.filter(tp.col("body_mass_g") > 3000)
.mutate(bill_ratio=tp.col("bill_length_mm") / tp.col("bill_depth_mm"))
.summarize(
n=tp.n(),
mean_ratio=tp.mean("bill_ratio"),
mean_mass=tp.mean("body_mass_g"),
by=["species", "sex"],
)
.arrange("species", tp.desc("mean_mass"))
)
shape: (6, 5)
┌───────────┬────────┬─────┬────────────┬─────────────┐
│ species ┆ sex ┆ n ┆ mean_ratio ┆ mean_mass │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 ┆ f64 ┆ f64 │
╞═══════════╪════════╪═════╪════════════╪═════════════╡
│ Adelie ┆ male ┆ 73 ┆ 2.123835 ┆ 4043.493151 │
│ Adelie ┆ female ┆ 65 ┆ 2.118286 ┆ 3424.615385 │
│ Chinstrap ┆ male ┆ 34 ┆ 2.656501 ┆ 3938.970588 │
│ Chinstrap ┆ female ┆ 32 ┆ 2.647082 ┆ 3572.65625 │
│ Gentoo ┆ male ┆ 61 ┆ 3.152081 ┆ 5484.836066 │
│ Gentoo ┆ female ┆ 58 ┆ 3.202391 ┆ 4679.741379 │
└───────────┴────────┴─────┴────────────┴─────────────┘
Where to go next¶
Each core verb has its own vignette with more worked examples and edge cases:
Filter — row selection, grouped filters, helper functions
Arrange — sorting, ties, nulls
Select — tidyselect helpers, negation, reordering
Rename — renaming columns
Mutate — new columns,
case_when,if_else, grouped mutatesTransmute — mutate + select in one step
Summarize — aggregations and helper stats
Group By — the
by=argument and explicit groups
For the full list of string, date, factor, and statistical helpers, see the API reference.