Polars

Misc

  • Notes from Comparing dplyr with polars

  • Packages

  • Read csv:

    import polars as pl
    
    df = pl.read_csv("https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv", null_values="NA")
  • Chaining operations:

    df.filter(pl.col("species") == "Gentoo").select(pl.col("^bill.*$"))
  • Using with a GPU

    • Can see up to 10x execution time decrease

    • Notes from Polars + NVIDIA GPU Tutorial

    • Installation

      pip install polars[gpu] --extra-index-url=https://pypi.nvidia.com
    • Example

      gpu_engine = pl.GPUEngine(
          device=0,
          raise_on_fail=True,
      )
      
      high_spenders_gpu = (
          transactions
          .group_by("CUST_ID")
          .agg(pl.col("AMOUNT").sum())
          .sort(by="AMOUNT", descending=True)
          .head()
          .collect(engine=gpu_engine)
      )

Columns

  • Quick Reference

    dplyr polars
    Select column select(df, x) df.select(pl.col("x"))
    Select multiple columns select(df, x, y) df.select(pl.col("x", "y"))
    Rename column rename(df, new_x = x) df.rename({"x": "new_x"})
    Select all columns except x select(df, -x) df.select(pl.exclude("x"))
    Select all columns that start with “str” select(df, starts_with("str")) df.select(pl.col("^bill.*$")) or df.select(cs.starts_with("str"))[1]
    Select numeric columns select(df, where(is.numeric)) df.select(cs.float(), cs.integer())[1,2]

Rows

  • Quick Reference

    dplyr polars
    First n rows head(df, n) df.head(n)
    Filter range of values filter(df, between(x, lo, hi)) df.filter(pl.col("x").is_between(lo, hi))
    Conditional filter(df, x > 4) df.filter(pl.col("x") > 4 )
    Sort rows arrange(df, x) df.sort("x")

Mutate

  • Quick Reference

    dplyr polars
    Create new column mutate(df, x_mean = mean(x)) df.with_columns(pl.col("x").mean().alias("x_mean"))
  • Example: Ratio variables

    df.with_columns((pl.col("bill_length_mm") / pl.col("flipper_length_mm")).alias("bill_flipper_ratio"))
    
    #> shape: (344, 10)
    #> ┌───────┬───────────┬───────────┬────────────────┬───┬─────────────┬────────┬──────┬────────────────────┐
    #> │ rowid ┆ species   ┆ island    ┆ bill_length_mm ┆ … ┆ body_mass_g ┆ sex    ┆ year ┆ bill_flipper_ratio │
    #> │ ---   ┆ ---       ┆ ---       ┆ ---            ┆   ┆ ---         ┆ ---    ┆ ---  ┆ ---                │
    #> │ i64   ┆ str       ┆ str       ┆ f64            ┆   ┆ i64         ┆ str    ┆ i64  ┆ f64                │
    #> ╞═══════╪═══════════╪═══════════╪════════════════╪═══╪═════════════╪════════╪══════╪════════════════════╡
    #> │ 1     ┆ Adelie    ┆ Torgersen ┆ 39.1           ┆ … ┆ 3750        ┆ male   ┆ 2007 ┆ 0.216022           │
    #> │ 2     ┆ Adelie    ┆ Torgersen ┆ 39.5           ┆ … ┆ 3800        ┆ female ┆ 2007 ┆ 0.212366           │
    #> │ 3     ┆ Adelie    ┆ Torgersen ┆ 40.3           ┆ … ┆ 3250        ┆ female ┆ 2007 ┆ 0.206667           │
    #> │ 4     ┆ Adelie    ┆ Torgersen ┆ null           ┆ … ┆ null        ┆ null   ┆ 2007 ┆ null               │
    #> │ …     ┆ …         ┆ …         ┆ …              ┆ … ┆ …           ┆ …      ┆ …    ┆ …                  │
    #> │ 341   ┆ Chinstrap ┆ Dream     ┆ 43.5           ┆ … ┆ 3400        ┆ female ┆ 2009 ┆ 0.215347           │
    #> │ 342   ┆ Chinstrap ┆ Dream     ┆ 49.6           ┆ … ┆ 3775        ┆ male   ┆ 2009 ┆ 0.256995           │
    #> │ 343   ┆ Chinstrap ┆ Dream     ┆ 50.8           ┆ … ┆ 4100        ┆ male   ┆ 2009 ┆ 0.241905           │
    #> │ 344   ┆ Chinstrap ┆ Dream     ┆ 50.2           ┆ … ┆ 3775        ┆ female ┆ 2009 ┆ 0.253535           │
    #> └───────┴───────────┴───────────┴────────────────┴───┴─────────────┴────────┴──────┴────────────────────┘
    df |> 
      mutate(bill_flipper_ratio = bill_length_mm / flipper_length_mm)

Summarize

  • Quick Reference

    dplyr polars
    Group group_by(df, x) df.group_by("x")
    Summarize summarize(df, x_n = n()) df.agg(pl.count().alias("x_n"))
  • Example: Group_By, Count, Sort

    df.group_by("species").agg(pl.count().alias("counts")).sort("counts", descending=True)
    
    #> shape: (3, 2)
    #> ┌───────────┬────────┐
    #> │ species   ┆ counts │
    #> │ ---       ┆ ---    │
    #> │ str       ┆ u32    │
    #> ╞═══════════╪════════╡
    #> │ Adelie    ┆ 152    │
    #> │ Gentoo    ┆ 124    │
    #> │ Chinstrap ┆ 68     │
    #> └───────────┴────────┘
  • Example: For each species, find the penguin with the lowest body mass

    df.group_by("species").agg(pl.all().sort_by("body_mass_g").drop_nulls().first())
    
    #> shape: (3, 9)
    #> ┌───────────┬───────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐
    #> │ species   ┆ rowid ┆ island    ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex    ┆ year │
    #> │ ---       ┆ ---   ┆ ---       ┆ ---            ┆   ┆ ---               ┆ ---         ┆ ---    ┆ ---  │
    #> │ str       ┆ i64   ┆ str       ┆ f64            ┆   ┆ i64               ┆ i64         ┆ str    ┆ i64  │
    #> ╞═══════════╪═══════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡
    #> │ Adelie    ┆ 4     ┆ Torgersen ┆ 36.5           ┆ … ┆ 181               ┆ 2850        ┆ female ┆ 2007 │
    #> │ Chinstrap ┆ 315   ┆ Dream     ┆ 46.9           ┆ … ┆ 192               ┆ 2700        ┆ female ┆ 2008 │
    #> │ Gentoo    ┆ 272   ┆ Biscoe    ┆ 42.7           ┆ … ┆ 208               ┆ 3950        ┆ female ┆ 2009 │
    #> └───────────┴───────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘
    df |>
      group_by(species) |>
      arrange(body_mass_g) |>
      summarize(body_mass_g = first(body_mass_g))
    
    #> # A tibble: 3 × 2
    #>   species   body_mass_g
    #>   <fct>           <int>
    #> 1 Adelie           2850
    #> 2 Chinstrap        2700
    #> 3 Gentoo           3950

Joins

  • Quick Reference

    dplyr polars
    Join dataframes left_join(df1, df2, by=x) df1.join(df2, on="x", how="left")