Overview

tidypolars-extra brings R’s Tidyverse-style data manipulation to Python on top of the fast Polars engine. If you already know dplyr, tidyr, stringr, or lubridate from R, you should feel right at home.

This vignette is a quick end-to-end tour of the core verbs — filter, arrange, select, mutate, group_by/summarize, and joins — using the Palmer Penguins dataset. Each per-verb vignette that follows covers one verb in depth; this page shows how they all fit together.

Setup

Every tidypolars-extra function lives under a single top-level namespace, so the convention is to import the package as tp:

import tidypolars_extra as tp

penguins = tp.tibble(
    tp.read_data(fn="tidypolars_extra/data/penguins.csv", sep=",", silently=True)
)

The result is a tibble — a thin wrapper around polars.DataFrame that exposes tidyverse-style verbs like filter, mutate, and summarize while keeping all of Polars’ speed underneath.

A first look at the data

A tibble has R-style nrow / ncol / names properties and the usual head() for a quick peek:

penguins.nrow, penguins.ncol
(344, 8)
penguins.names
['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year']
penguins.head()
shape: (5, 8)
┌─────────┬───────────┬───────────────┬───────────────┬──────────────┬─────────────┬────────┬──────┐
│ species ┆ island    ┆ bill_length_m ┆ bill_depth_mm ┆ flipper_leng ┆ body_mass_g ┆ sex    ┆ year │
│ ---     ┆ ---       ┆ m             ┆ ---           ┆ th_mm        ┆ ---         ┆ ---    ┆ ---  │
│ str     ┆ str       ┆ ---           ┆ f64           ┆ ---          ┆ f64         ┆ str    ┆ i64  │
│         ┆           ┆ f64           ┆               ┆ f64          ┆             ┆        ┆      │
╞═════════╪═══════════╪═══════════════╪═══════════════╪══════════════╪═════════════╪════════╪══════╡
│ Adelie  ┆ Torgersen ┆ 39.1          ┆ 18.7          ┆ 181.0        ┆ 3750.0      ┆ male   ┆ 2007 │
│ Adelie  ┆ Torgersen ┆ 39.5          ┆ 17.4          ┆ 186.0        ┆ 3800.0      ┆ female ┆ 2007 │
│ Adelie  ┆ Torgersen ┆ 40.3          ┆ 18.0          ┆ 195.0        ┆ 3250.0      ┆ female ┆ 2007 │
│ Adelie  ┆ Torgersen ┆ null          ┆ null          ┆ null         ┆ null        ┆ null   ┆ 2007 │
│ Adelie  ┆ Torgersen ┆ 36.7          ┆ 19.3          ┆ 193.0        ┆ 3450.0      ┆ female ┆ 2007 │
└─────────┴───────────┴───────────────┴───────────────┴──────────────┴─────────────┴────────┴──────┘

Palmer Penguins contains body measurements for three species (Adelie, Chinstrap, Gentoo) across three islands. A handful of rows have missing measurements, which is realistic — most of the verbs below handle nulls gracefully, and we’ll drop them explicitly when we need complete cases.

filter — keep rows of interest

filter keeps rows where every condition is True. Rows where a condition evaluates to null are dropped automatically. Reference columns with tp.col("name"):

penguins.filter(tp.col("species") == "Gentoo", tp.col("sex") == "female")
shape: (58, 8)
┌─────────┬────────┬────────────────┬───────────────┬────────────────┬─────────────┬────────┬──────┐
│ species ┆ island ┆ bill_length_mm ┆ bill_depth_mm ┆ flipper_length ┆ body_mass_g ┆ sex    ┆ year │
│ ---     ┆ ---    ┆ ---            ┆ ---           ┆ _mm            ┆ ---         ┆ ---    ┆ ---  │
│ str     ┆ str    ┆ f64            ┆ f64           ┆ ---            ┆ f64         ┆ str    ┆ i64  │
│         ┆        ┆                ┆               ┆ f64            ┆             ┆        ┆      │
╞═════════╪════════╪════════════════╪═══════════════╪════════════════╪═════════════╪════════╪══════╡
│ Gentoo  ┆ Biscoe ┆ 46.1           ┆ 13.2          ┆ 211.0          ┆ 4500.0      ┆ female ┆ 2007 │
│ Gentoo  ┆ Biscoe ┆ 48.7           ┆ 14.1          ┆ 210.0          ┆ 4450.0      ┆ female ┆ 2007 │
│ Gentoo  ┆ Biscoe ┆ 46.5           ┆ 13.5          ┆ 210.0          ┆ 4550.0      ┆ female ┆ 2007 │
│ Gentoo  ┆ Biscoe ┆ 45.4           ┆ 14.6          ┆ 211.0          ┆ 4800.0      ┆ female ┆ 2007 │
│ Gentoo  ┆ Biscoe ┆ 43.3           ┆ 13.4          ┆ 209.0          ┆ 4400.0      ┆ female ┆ 2007 │
│ …       ┆ …      ┆ …              ┆ …             ┆ …              ┆ …           ┆ …      ┆ …    │
│ Gentoo  ┆ Biscoe ┆ 43.5           ┆ 15.2          ┆ 213.0          ┆ 4650.0      ┆ female ┆ 2009 │
│ Gentoo  ┆ Biscoe ┆ 46.2           ┆ 14.1          ┆ 217.0          ┆ 4375.0      ┆ female ┆ 2009 │
│ Gentoo  ┆ Biscoe ┆ 47.2           ┆ 13.7          ┆ 214.0          ┆ 4925.0      ┆ female ┆ 2009 │
│ Gentoo  ┆ Biscoe ┆ 46.8           ┆ 14.3          ┆ 215.0          ┆ 4850.0      ┆ female ┆ 2009 │
│ Gentoo  ┆ Biscoe ┆ 45.2           ┆ 14.8          ┆ 212.0          ┆ 5200.0      ┆ female ┆ 2009 │
└─────────┴────────┴────────────────┴───────────────┴────────────────┴─────────────┴────────┴──────┘

Use the | operator (wrapped in parentheses) for OR conditions, and & for explicit AND. To drop rows with nulls in specific columns, use drop_na:

complete = penguins.drop_na()
complete.nrow
333

arrange — sort rows

arrange sorts ascending by default. Wrap a column in tp.desc(...) to sort it in descending order; later columns break ties:

complete.arrange(tp.desc("body_mass_g"), "bill_length_mm").head()
shape: (5, 8)
┌─────────┬────────┬────────────────┬───────────────┬──────────────────┬─────────────┬──────┬──────┐
│ species ┆ island ┆ bill_length_mm ┆ bill_depth_mm ┆ flipper_length_m ┆ body_mass_g ┆ sex  ┆ year │
│ ---     ┆ ---    ┆ ---            ┆ ---           ┆ m                ┆ ---         ┆ ---  ┆ ---  │
│ str     ┆ str    ┆ f64            ┆ f64           ┆ ---              ┆ f64         ┆ str  ┆ i64  │
│         ┆        ┆                ┆               ┆ f64              ┆             ┆      ┆      │
╞═════════╪════════╪════════════════╪═══════════════╪══════════════════╪═════════════╪══════╪══════╡
│ Gentoo  ┆ Biscoe ┆ 49.2           ┆ 15.2          ┆ 221.0            ┆ 6300.0      ┆ male ┆ 2007 │
│ Gentoo  ┆ Biscoe ┆ 59.6           ┆ 17.0          ┆ 230.0            ┆ 6050.0      ┆ male ┆ 2007 │
│ Gentoo  ┆ Biscoe ┆ 48.8           ┆ 16.2          ┆ 222.0            ┆ 6000.0      ┆ male ┆ 2009 │
│ Gentoo  ┆ Biscoe ┆ 51.1           ┆ 16.3          ┆ 220.0            ┆ 6000.0      ┆ male ┆ 2008 │
│ Gentoo  ┆ Biscoe ┆ 45.2           ┆ 16.4          ┆ 223.0            ┆ 5950.0      ┆ male ┆ 2008 │
└─────────┴────────┴────────────────┴───────────────┴──────────────────┴─────────────┴──────┴──────┘

select — pick columns

select accepts bare column names, negative selections ("-year"), and the tidyselect helpers from tp.starts_with, tp.ends_with, tp.contains, tp.matches, and tp.everything:

complete.select("species", "island", "sex", "body_mass_g").head()
shape: (5, 4)
┌─────────┬───────────┬────────┬─────────────┐
│ species ┆ island    ┆ sex    ┆ body_mass_g │
│ ---     ┆ ---       ┆ ---    ┆ ---         │
│ str     ┆ str       ┆ str    ┆ f64         │
╞═════════╪═══════════╪════════╪═════════════╡
│ Adelie  ┆ Torgersen ┆ male   ┆ 3750.0      │
│ Adelie  ┆ Torgersen ┆ female ┆ 3800.0      │
│ Adelie  ┆ Torgersen ┆ female ┆ 3250.0      │
│ Adelie  ┆ Torgersen ┆ female ┆ 3450.0      │
│ Adelie  ┆ Torgersen ┆ male   ┆ 3650.0      │
└─────────┴───────────┴────────┴─────────────┘
complete.select("species", tp.starts_with("bill_")).head()
shape: (5, 3)
┌─────────┬────────────────┬───────────────┐
│ species ┆ bill_length_mm ┆ bill_depth_mm │
│ ---     ┆ ---            ┆ ---           │
│ str     ┆ f64            ┆ f64           │
╞═════════╪════════════════╪═══════════════╡
│ Adelie  ┆ 39.1           ┆ 18.7          │
│ Adelie  ┆ 39.5           ┆ 17.4          │
│ Adelie  ┆ 40.3           ┆ 18.0          │
│ Adelie  ┆ 36.7           ┆ 19.3          │
│ Adelie  ┆ 39.3           ┆ 20.6          │
└─────────┴────────────────┴───────────────┘
complete.select("species", tp.contains("_mm")).head()
shape: (5, 4)
┌─────────┬────────────────┬───────────────┬───────────────────┐
│ species ┆ bill_length_mm ┆ bill_depth_mm ┆ flipper_length_mm │
│ ---     ┆ ---            ┆ ---           ┆ ---               │
│ str     ┆ f64            ┆ f64           ┆ f64               │
╞═════════╪════════════════╪═══════════════╪═══════════════════╡
│ Adelie  ┆ 39.1           ┆ 18.7          ┆ 181.0             │
│ Adelie  ┆ 39.5           ┆ 17.4          ┆ 186.0             │
│ Adelie  ┆ 40.3           ┆ 18.0          ┆ 195.0             │
│ Adelie  ┆ 36.7           ┆ 19.3          ┆ 193.0             │
│ Adelie  ┆ 39.3           ┆ 20.6          ┆ 190.0             │
└─────────┴────────────────┴───────────────┴───────────────────┘

mutate — add or modify columns

mutate creates new columns as keyword arguments. Expressions are built from tp.col(...) and regular Python operators. Use tp.case_when for multi-branch logic — conditions are checked top-to-bottom and _default catches everything that falls through:

(
    complete
    .mutate(
        bill_ratio=tp.col("bill_length_mm") / tp.col("bill_depth_mm"),
        size_class=tp.case_when(
            tp.col("body_mass_g") < 3500, "small",
            tp.col("body_mass_g") < 4500, "medium",
            _default="large",
        ),
    )
    .select("species", "sex", "body_mass_g", "bill_ratio", "size_class")
    .head()
)
shape: (5, 5)
┌─────────┬────────┬─────────────┬────────────┬────────────┐
│ species ┆ sex    ┆ body_mass_g ┆ bill_ratio ┆ size_class │
│ ---     ┆ ---    ┆ ---         ┆ ---        ┆ ---        │
│ str     ┆ str    ┆ f64         ┆ f64        ┆ str        │
╞═════════╪════════╪═════════════╪════════════╪════════════╡
│ Adelie  ┆ male   ┆ 3750.0      ┆ 2.090909   ┆ medium     │
│ Adelie  ┆ female ┆ 3800.0      ┆ 2.270115   ┆ medium     │
│ Adelie  ┆ female ┆ 3250.0      ┆ 2.238889   ┆ small      │
│ Adelie  ┆ female ┆ 3450.0      ┆ 1.901554   ┆ small      │
│ Adelie  ┆ male   ┆ 3650.0      ┆ 1.907767   ┆ medium     │
└─────────┴────────┴─────────────┴────────────┴────────────┘

group_by + summarize — aggregate by group

summarize collapses rows into one row per group. Pass the grouping columns via by=; inside, use aggregators like tp.mean, tp.sd, tp.median, or tp.n() (which counts rows in each group):

complete.summarize(
    n=tp.n(),
    mean_mass=tp.mean("body_mass_g"),
    sd_mass=tp.sd("body_mass_g"),
    mean_bill=tp.mean("bill_length_mm"),
    by="species",
)
shape: (3, 5)
┌───────────┬─────┬─────────────┬────────────┬───────────┐
│ species   ┆ n   ┆ mean_mass   ┆ sd_mass    ┆ mean_bill │
│ ---       ┆ --- ┆ ---         ┆ ---        ┆ ---       │
│ str       ┆ u32 ┆ f64         ┆ f64        ┆ f64       │
╞═══════════╪═════╪═════════════╪════════════╪═══════════╡
│ Adelie    ┆ 146 ┆ 3706.164384 ┆ 458.620135 ┆ 38.823973 │
│ Chinstrap ┆ 68  ┆ 3733.088235 ┆ 384.335081 ┆ 48.833824 │
│ Gentoo    ┆ 119 ┆ 5092.436975 ┆ 501.476154 ┆ 47.568067 │
└───────────┴─────┴─────────────┴────────────┴───────────┘

Pass a list to by= to group by multiple columns:

complete.summarize(
    n=tp.n(),
    mean_mass=tp.mean("body_mass_g"),
    by=["species", "sex"],
)
shape: (6, 4)
┌───────────┬────────┬─────┬─────────────┐
│ species   ┆ sex    ┆ n   ┆ mean_mass   │
│ ---       ┆ ---    ┆ --- ┆ ---         │
│ str       ┆ str    ┆ u32 ┆ f64         │
╞═══════════╪════════╪═════╪═════════════╡
│ Chinstrap ┆ male   ┆ 34  ┆ 3938.970588 │
│ Gentoo    ┆ male   ┆ 61  ┆ 5484.836066 │
│ Adelie    ┆ female ┆ 73  ┆ 3368.835616 │
│ Chinstrap ┆ female ┆ 34  ┆ 3527.205882 │
│ Gentoo    ┆ female ┆ 58  ┆ 4679.741379 │
│ Adelie    ┆ male   ┆ 73  ┆ 4043.493151 │
└───────────┴────────┴─────┴─────────────┘

The by= argument is also available on filter and mutate for grouped row-filtering and grouped window functions — see the Group By vignette for the full story.

Joining tables

Joins combine two tibbles on shared key columns. left_join keeps every row from the left table and attaches matching rows from the right. Let’s build a small lookup table of scientific names and attach it to a per-species summary:

scientific = tp.tibble(
    species=["Adelie", "Chinstrap", "Gentoo"],
    scientific_name=[
        "Pygoscelis adeliae",
        "Pygoscelis antarcticus",
        "Pygoscelis papua",
    ],
)

summary = complete.summarize(
    n=tp.n(),
    mean_mass=tp.mean("body_mass_g"),
    by="species",
)

summary.left_join(scientific, on="species")
shape: (3, 4)
┌───────────┬─────┬─────────────┬────────────────────────┐
│ species   ┆ n   ┆ mean_mass   ┆ scientific_name        │
│ ---       ┆ --- ┆ ---         ┆ ---                    │
│ str       ┆ u32 ┆ f64         ┆ str                    │
╞═══════════╪═════╪═════════════╪════════════════════════╡
│ Adelie    ┆ 146 ┆ 3706.164384 ┆ Pygoscelis adeliae     │
│ Chinstrap ┆ 68  ┆ 3733.088235 ┆ Pygoscelis antarcticus │
│ Gentoo    ┆ 119 ┆ 5092.436975 ┆ Pygoscelis papua       │
└───────────┴─────┴─────────────┴────────────────────────┘

inner_join, right_join, and full_join work the same way. When the key columns have different names in each table, use left_on= and right_on= instead of on=.

Putting it all together

Because every verb returns a new tibble, you can chain them into a single pipeline that reads top-to-bottom like a recipe — filter, transform, group, summarize, sort:

(
    penguins
    .drop_na()
    .filter(tp.col("body_mass_g") > 3000)
    .mutate(bill_ratio=tp.col("bill_length_mm") / tp.col("bill_depth_mm"))
    .summarize(
        n=tp.n(),
        mean_ratio=tp.mean("bill_ratio"),
        mean_mass=tp.mean("body_mass_g"),
        by=["species", "sex"],
    )
    .arrange("species", tp.desc("mean_mass"))
)
shape: (6, 5)
┌───────────┬────────┬─────┬────────────┬─────────────┐
│ species   ┆ sex    ┆ n   ┆ mean_ratio ┆ mean_mass   │
│ ---       ┆ ---    ┆ --- ┆ ---        ┆ ---         │
│ str       ┆ str    ┆ u32 ┆ f64        ┆ f64         │
╞═══════════╪════════╪═════╪════════════╪═════════════╡
│ Adelie    ┆ male   ┆ 73  ┆ 2.123835   ┆ 4043.493151 │
│ Adelie    ┆ female ┆ 65  ┆ 2.118286   ┆ 3424.615385 │
│ Chinstrap ┆ male   ┆ 34  ┆ 2.656501   ┆ 3938.970588 │
│ Chinstrap ┆ female ┆ 32  ┆ 2.647082   ┆ 3572.65625  │
│ Gentoo    ┆ male   ┆ 61  ┆ 3.152081   ┆ 5484.836066 │
│ Gentoo    ┆ female ┆ 58  ┆ 3.202391   ┆ 4679.741379 │
└───────────┴────────┴─────┴────────────┴─────────────┘

Where to go next

Each core verb has its own vignette with more worked examples and edge cases:

  • Filter — row selection, grouped filters, helper functions

  • Arrange — sorting, ties, nulls

  • Select — tidyselect helpers, negation, reordering

  • Rename — renaming columns

  • Mutate — new columns, case_when, if_else, grouped mutates

  • Transmute — mutate + select in one step

  • Summarize — aggregations and helper stats

  • Group By — the by= argument and explicit groups

For the full list of string, date, factor, and statistical helpers, see the API reference.