Group By

Grouping is a fundamental concept when you need to perform operations separately for different subsets of your data. In tidypolars-extra, grouping is done via the by parameter available on filter, mutate, and summarize.

For example, in the mtcars dataset there are 3 possible values for cylinders (cyl). You can group by cyl to perform operations separately for each of these 3 groups of rows.

import tidypolars_extra as tp

mtcars = tp.tibble(tp.read_data(fn="tidypolars_extra/data/mtcars.csv", sep=",", silently=True))

small_cars = mtcars.select("name", "cyl", "gear", "hp")

small_cars

Grouping with by

The by parameter is supported directly on filter, mutate, and summarize. This is often more concise than a separate group_by step.

Grouped filter

Keep rows where hp is greater than the mean hp within each cyl group:

small_cars.filter(tp.col("hp") > tp.col("hp").mean(), by="cyl")
shape: (15, 4)
┌───────────────────┬─────┬──────┬─────┐
│ name              ┆ cyl ┆ gear ┆ hp  │
╞═══════════════════╪═════╪══════╪═════╡
│ Datsun 710        ┆ 4   ┆ 4    ┆ 93  │
│ Merc 230          ┆ 4   ┆ 4    ┆ 95  │
│ …                 ┆ …   ┆ …   ┆ …   │
└───────────────────┴─────┴──────┴─────┘

Grouped mutate

Compute the average hp within each cyl group and attach it to every row:

small_cars.mutate(avg_hp=tp.col("hp").mean(), by="cyl")
shape: (32, 5)
┌────────────────┬─────┬──────┬─────┬────────────┐
│ name           ┆ cyl ┆ gear ┆ hp  ┆ avg_hp     │
╞════════════════╪═════╪══════╪═════╪════════════╡
│ Mazda RX4      ┆ 6   ┆ 4    ┆ 110 ┆ 122.285714 │
│ Mazda RX4 Wag  ┆ 6   ┆ 4    ┆ 110 ┆ 122.285714 │
│ …              ┆ …   ┆ …   ┆ …   ┆ …          │
└────────────────┴─────┴──────┴─────┴────────────┘

Grouped summarize

Compute summary statistics per group:

small_cars.summarize(avg_hp=tp.col("hp").mean(), by="cyl")
shape: (3, 2)
┌─────┬────────────┐
│ cyl ┆ avg_hp     │
╞═════╪════════════╡
│ 8   ┆ 209.214286 │
│ 6   ┆ 122.285714 │
│ 4   ┆ 82.636364  │
└─────┴────────────┘

Grouping by multiple columns

Pass a list of column names to by to group by multiple columns:

small_cars.summarize(avg_hp=tp.col("hp").mean(), by=["cyl", "gear"])
shape: (8, 3)
┌─────┬──────┬────────────┐
│ cyl ┆ gear ┆ avg_hp     │
╞═════╪══════╪════════════╡
│ 4   ┆ 4    ┆ 76.0       │
│ 6   ┆ 4    ┆ 116.5      │
│ 8   ┆ 3    ┆ 194.166667 │
│ …   ┆ …   ┆ …          │
└─────┴──────┴────────────┘

Using group_by explicitly

You can also use the .group_by() method, which returns a TibbleGroupBy object. This object supports filter, mutate, and summarize:

g_cyl = mtcars.group_by("cyl")

# Summarize per group
g_cyl.summarize(avg_hp=tp.col("hp").mean())

Note

The by parameter on filter, mutate, and summarize is typically preferred for its conciseness and clarity. Both approaches produce the same result.

Practical example: top N per group

A common pattern is to find the “top N” rows within each group. For example, the 2 cars with the lowest hp in each cyl group:

(
    small_cars
    .arrange("hp")
    .mutate(row_num=tp.row_number(), by="cyl")
    .filter(tp.col("row_num") <= 2)
    .drop("row_num")
)
shape: (6, 4)
┌────────────────┬─────┬──────┬─────┐
│ name           ┆ cyl ┆ gear ┆ hp  │
╞════════════════╪═════╪══════╪═════╡
│ Honda Civic    ┆ 4   ┆ 4    ┆ 52  │
│ Merc 240D      ┆ 4   ┆ 4    ┆ 62  │
│ Valiant        ┆ 6   ┆ 3    ┆ 105 │
│ Mazda RX4      ┆ 6   ┆ 4    ┆ 110 │
│ Dodge Challenger┆ 8   ┆ 3    ┆ 150 │
│ AMC Javelin    ┆ 8   ┆ 3    ┆ 150 │
└────────────────┴─────┴──────┴─────┘