Polars

Misc

Notes from Comparing dplyr with polars
Packages
- {{tidypolars}}

Read csv:

import polars as pl

df = pl.read_csv("https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv", null_values="NA")

Chaining operations:

df.filter(pl.col("species") == "Gentoo").select(pl.col("^bill.*$"))

Using with a GPU

Can see up to 10x execution time decrease
Notes from Polars + NVIDIA GPU Tutorial

Installation

pip install polars[gpu] --extra-index-url=https://pypi.nvidia.com

Example

gpu_engine = pl.GPUEngine(
    device=0,
    raise_on_fail=True,
)

high_spenders_gpu = (
    transactions
    .group_by("CUST_ID")
    .agg(pl.col("AMOUNT").sum())
    .sort(by="AMOUNT", descending=True)
    .head()
    .collect(engine=gpu_engine)
)

Columns

Quick Reference

	dplyr	polars
Select column	`select(df, x)`	`df.select(pl.col("x"))`
Select multiple columns	`select(df, x, y)`	`df.select(pl.col("x", "y"))`
Rename column	`rename(df, new_x = x)`	`df.rename({"x": "new_x"})`
Select all columns except x	`select(df, -x)`	`df.select(pl.exclude("x"))`
Select all columns that start with “str”	`select(df, starts_with("str"))`	`df.select(pl.col("^bill.*$"))` or `df.select(cs.starts_with("str"))`[1]
Select numeric columns	`select(df, where(is.numeric))`	`df.select(cs.float(), cs.integer())`[1,2]

Rows

Quick Reference

	dplyr	polars
First `n` rows	`head(df, n)`	`df.head(n)`
Filter range of values	`filter(df, between(x, lo, hi))`	`df.filter(pl.col("x").is_between(lo, hi))`
Conditional	`filter(df, x > 4)`	`df.filter(pl.col("x") > 4 )`
Sort rows	`arrange(df, x)`	`df.sort("x")`

Mutate

Quick Reference

dplyr polars

Create new column mutate(df, x_mean = mean(x)) df.with_columns(pl.col("x").mean().alias("x_mean"))

	dplyr	polars
Create new column	`mutate(df, x_mean = mean(x))`	`df.with_columns(pl.col("x").mean().alias("x_mean"))`

Example: Ratio variables

{{polars}}
{dplyr}

df.with_columns((pl.col("bill_length_mm") / pl.col("flipper_length_mm")).alias("bill_flipper_ratio"))

#> shape: (344, 10)
#> ┌───────┬───────────┬───────────┬────────────────┬───┬─────────────┬────────┬──────┬────────────────────┐
#> │ rowid ┆ species   ┆ island    ┆ bill_length_mm ┆ … ┆ body_mass_g ┆ sex    ┆ year ┆ bill_flipper_ratio │
#> │ ---   ┆ ---       ┆ ---       ┆ ---            ┆   ┆ ---         ┆ ---    ┆ ---  ┆ ---                │
#> │ i64   ┆ str       ┆ str       ┆ f64            ┆   ┆ i64         ┆ str    ┆ i64  ┆ f64                │
#> ╞═══════╪═══════════╪═══════════╪════════════════╪═══╪═════════════╪════════╪══════╪════════════════════╡
#> │ 1     ┆ Adelie    ┆ Torgersen ┆ 39.1           ┆ … ┆ 3750        ┆ male   ┆ 2007 ┆ 0.216022           │
#> │ 2     ┆ Adelie    ┆ Torgersen ┆ 39.5           ┆ … ┆ 3800        ┆ female ┆ 2007 ┆ 0.212366           │
#> │ 3     ┆ Adelie    ┆ Torgersen ┆ 40.3           ┆ … ┆ 3250        ┆ female ┆ 2007 ┆ 0.206667           │
#> │ 4     ┆ Adelie    ┆ Torgersen ┆ null           ┆ … ┆ null        ┆ null   ┆ 2007 ┆ null               │
#> │ …     ┆ …         ┆ …         ┆ …              ┆ … ┆ …           ┆ …      ┆ …    ┆ …                  │
#> │ 341   ┆ Chinstrap ┆ Dream     ┆ 43.5           ┆ … ┆ 3400        ┆ female ┆ 2009 ┆ 0.215347           │
#> │ 342   ┆ Chinstrap ┆ Dream     ┆ 49.6           ┆ … ┆ 3775        ┆ male   ┆ 2009 ┆ 0.256995           │
#> │ 343   ┆ Chinstrap ┆ Dream     ┆ 50.8           ┆ … ┆ 4100        ┆ male   ┆ 2009 ┆ 0.241905           │
#> │ 344   ┆ Chinstrap ┆ Dream     ┆ 50.2           ┆ … ┆ 3775        ┆ female ┆ 2009 ┆ 0.253535           │
#> └───────┴───────────┴───────────┴────────────────┴───┴─────────────┴────────┴──────┴────────────────────┘

df |> 
  mutate(bill_flipper_ratio = bill_length_mm / flipper_length_mm)

Summarize

Quick Reference

dplyr polars

Group group_by(df, x) df.group_by("x")

Summarize summarize(df, x_n = n()) df.agg(pl.count().alias("x_n"))

	dplyr	polars
Group	`group_by(df, x)`	`df.group_by("x")`
Summarize	`summarize(df, x_n = n())`	`df.agg(pl.count().alias("x_n"))`

Example: Group_By, Count, Sort

df.group_by("species").agg(pl.count().alias("counts")).sort("counts", descending=True)

#> shape: (3, 2)
#> ┌───────────┬────────┐
#> │ species   ┆ counts │
#> │ ---       ┆ ---    │
#> │ str       ┆ u32    │
#> ╞═══════════╪════════╡
#> │ Adelie    ┆ 152    │
#> │ Gentoo    ┆ 124    │
#> │ Chinstrap ┆ 68     │
#> └───────────┴────────┘

Example: For each species, find the penguin with the lowest body mass

{{polars}}
{dplyr}

df.group_by("species").agg(pl.all().sort_by("body_mass_g").drop_nulls().first())

#> shape: (3, 9)
#> ┌───────────┬───────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐
#> │ species   ┆ rowid ┆ island    ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex    ┆ year │
#> │ ---       ┆ ---   ┆ ---       ┆ ---            ┆   ┆ ---               ┆ ---         ┆ ---    ┆ ---  │
#> │ str       ┆ i64   ┆ str       ┆ f64            ┆   ┆ i64               ┆ i64         ┆ str    ┆ i64  │
#> ╞═══════════╪═══════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡
#> │ Adelie    ┆ 4     ┆ Torgersen ┆ 36.5           ┆ … ┆ 181               ┆ 2850        ┆ female ┆ 2007 │
#> │ Chinstrap ┆ 315   ┆ Dream     ┆ 46.9           ┆ … ┆ 192               ┆ 2700        ┆ female ┆ 2008 │
#> │ Gentoo    ┆ 272   ┆ Biscoe    ┆ 42.7           ┆ … ┆ 208               ┆ 3950        ┆ female ┆ 2009 │
#> └───────────┴───────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘

df |>
  group_by(species) |>
  arrange(body_mass_g) |>
  summarize(body_mass_g = first(body_mass_g))

#> # A tibble: 3 × 2
#>   species   body_mass_g
#>   <fct>           <int>
#> 1 Adelie           2850
#> 2 Chinstrap        2700
#> 3 Gentoo           3950

Joins

Quick Reference

dplyr polars

Join dataframes left_join(df1, df2, by=x) df1.join(df2, on="x", how="left")