Polars
Misc
Notes from Comparing dplyr with polars
Packages
- {{tidypolars}}
Read csv:
import polars as pl = pl.read_csv("https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv", null_values="NA") df
Chaining operations:
filter(pl.col("species") == "Gentoo").select(pl.col("^bill.*$")) df.
Using with a GPU
Can see up to 10x execution time decrease
Notes from Polars + NVIDIA GPU Tutorial
Installation
pip install polars[gpu] --extra-index-url=https://pypi.nvidia.com
Example
= pl.GPUEngine( gpu_engine =0, device=True, raise_on_fail ) = ( high_spenders_gpu transactions"CUST_ID") .group_by("AMOUNT").sum()) .agg(pl.col(="AMOUNT", descending=True) .sort(by .head()=gpu_engine) .collect(engine )
Columns
Quick Reference
dplyr polars Select column select(df, x)
df.select(pl.col("x"))
Select multiple columns select(df, x, y)
df.select(pl.col("x", "y"))
Rename column rename(df, new_x = x)
df.rename({"x": "new_x"})
Select all columns except x select(df, -x)
df.select(pl.exclude("x"))
Select all columns that start with “str” select(df, starts_with("str"))
df.select(pl.col("^bill.*$"))
ordf.select(cs.starts_with("str"))
[1]Select numeric columns select(df, where(is.numeric))
df.select(cs.float(), cs.integer())
[1,2]
Rows
Quick Reference
dplyr polars First n
rowshead(df, n)
df.head(n)
Filter range of values filter(df, between(x, lo, hi))
df.filter(pl.col("x").is_between(lo, hi))
Conditional filter(df, x > 4)
df.filter(pl.col("x") > 4 )
Sort rows arrange(df, x)
df.sort("x")
Mutate
Quick Reference
dplyr polars Create new column mutate(df, x_mean = mean(x))
df.with_columns(pl.col("x").mean().alias("x_mean"))
Example: Ratio variables
"bill_length_mm") / pl.col("flipper_length_mm")).alias("bill_flipper_ratio")) df.with_columns((pl.col( #> shape: (344, 10) #> ┌───────┬───────────┬───────────┬────────────────┬───┬─────────────┬────────┬──────┬────────────────────┐ #> │ rowid ┆ species ┆ island ┆ bill_length_mm ┆ … ┆ body_mass_g ┆ sex ┆ year ┆ bill_flipper_ratio │ #> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ #> │ i64 ┆ str ┆ str ┆ f64 ┆ ┆ i64 ┆ str ┆ i64 ┆ f64 │ #> ╞═══════╪═══════════╪═══════════╪════════════════╪═══╪═════════════╪════════╪══════╪════════════════════╡ #> │ 1 ┆ Adelie ┆ Torgersen ┆ 39.1 ┆ … ┆ 3750 ┆ male ┆ 2007 ┆ 0.216022 │ #> │ 2 ┆ Adelie ┆ Torgersen ┆ 39.5 ┆ … ┆ 3800 ┆ female ┆ 2007 ┆ 0.212366 │ #> │ 3 ┆ Adelie ┆ Torgersen ┆ 40.3 ┆ … ┆ 3250 ┆ female ┆ 2007 ┆ 0.206667 │ #> │ 4 ┆ Adelie ┆ Torgersen ┆ null ┆ … ┆ null ┆ null ┆ 2007 ┆ null │ #> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ #> │ 341 ┆ Chinstrap ┆ Dream ┆ 43.5 ┆ … ┆ 3400 ┆ female ┆ 2009 ┆ 0.215347 │ #> │ 342 ┆ Chinstrap ┆ Dream ┆ 49.6 ┆ … ┆ 3775 ┆ male ┆ 2009 ┆ 0.256995 │ #> │ 343 ┆ Chinstrap ┆ Dream ┆ 50.8 ┆ … ┆ 4100 ┆ male ┆ 2009 ┆ 0.241905 │ #> │ 344 ┆ Chinstrap ┆ Dream ┆ 50.2 ┆ … ┆ 3775 ┆ female ┆ 2009 ┆ 0.253535 │ #> └───────┴───────────┴───────────┴────────────────┴───┴─────────────┴────────┴──────┴────────────────────┘
|> df mutate(bill_flipper_ratio = bill_length_mm / flipper_length_mm)
Summarize
Quick Reference
dplyr polars Group group_by(df, x)
df.group_by("x")
Summarize summarize(df, x_n = n())
df.agg(pl.count().alias("x_n"))
Example: Group_By, Count, Sort
"species").agg(pl.count().alias("counts")).sort("counts", descending=True) df.group_by( #> shape: (3, 2) #> ┌───────────┬────────┐ #> │ species ┆ counts │ #> │ --- ┆ --- │ #> │ str ┆ u32 │ #> ╞═══════════╪════════╡ #> │ Adelie ┆ 152 │ #> │ Gentoo ┆ 124 │ #> │ Chinstrap ┆ 68 │ #> └───────────┴────────┘
Example: For each species, find the penguin with the lowest body mass
"species").agg(pl.all().sort_by("body_mass_g").drop_nulls().first()) df.group_by( #> shape: (3, 9) #> ┌───────────┬───────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐ #> │ species ┆ rowid ┆ island ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex ┆ year │ #> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ #> │ str ┆ i64 ┆ str ┆ f64 ┆ ┆ i64 ┆ i64 ┆ str ┆ i64 │ #> ╞═══════════╪═══════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡ #> │ Adelie ┆ 4 ┆ Torgersen ┆ 36.5 ┆ … ┆ 181 ┆ 2850 ┆ female ┆ 2007 │ #> │ Chinstrap ┆ 315 ┆ Dream ┆ 46.9 ┆ … ┆ 192 ┆ 2700 ┆ female ┆ 2008 │ #> │ Gentoo ┆ 272 ┆ Biscoe ┆ 42.7 ┆ … ┆ 208 ┆ 3950 ┆ female ┆ 2009 │ #> └───────────┴───────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘
|> df group_by(species) |> arrange(body_mass_g) |> summarize(body_mass_g = first(body_mass_g)) #> # A tibble: 3 × 2 #> species body_mass_g #> <fct> <int> #> 1 Adelie 2850 #> 2 Chinstrap 2700 #> 3 Gentoo 3950
Joins
Quick Reference
dplyr polars Join dataframes left_join(df1, df2, by=x)
df1.join(df2, on="x", how="left")