Polars

Misc

  • Also see Python, Polars

  • Packages

    • {polars}
    • {tidypolars} - Allows one to use the tidyverse syntax while using the power of polars
    • {polarssql} - Provides a polars backend for DBI and dbplyr.
  • Resources

  • as.vector(<series>) will remove attributes that might be useful. Therefore, recommended to use <series>$to_r_vector() instead for usual conversions

  • Coerce an existing dataframe into a polars dataframe: as_polars_df or as_polars_lf for a lazy dataframe. For creating a df inside the DataFrame, you need !!!

    pl$DataFrame(!!!data.frame(x = 1, y = "a"))
    #> shape: (1, 2)
    #> ┌─────┬─────┐
    #> │ x   ┆ y   │
    #> │ --- ┆ --- │
    #> │ f64 ┆ str │
    #> ╞═════╪═════╡
    #> │ 1.0 ┆ a   │
    #> └─────┴─────┘
  • if you used to pass a vector of column names or a list of expressions, you need to expand it with !!!

    dat <- as_polars_df(head(mtcars, 3))
    my_exprs <- list(pl$col("drat") + 1, "mpg", "cyl")
    dat$select(!!!my_exprs)
    #> shape: (3, 3)
    #> ┌──────┬──────┬─────┐
    #> │ drat ┆ mpg  ┆ cyl │
    #> │ ---  ┆ ---  ┆ --- │
    #> │ f64  ┆ f64  ┆ f64 │
    #> ╞══════╪══════╪═════╡
    #> │ 4.9  ┆ 21.0 ┆ 6.0 │
    #> │ 4.9  ┆ 21.0 ┆ 6.0 │
    #> │ 4.85 ┆ 22.8 ┆ 4.0 │
    #> └──────┴──────┴─────┘
    
    pl$col(!!!c("foo", "bar"), "baz")
    #> cols(["foo", "bar", "baz"])
    • Same with lazyframe>$cast() and <lazyframe>$group_by()

Mutate

  • Example: 10-day and 50-day Moving Average (source)

    moving_average_pl <- 
      long_pl$with_columns(
        pl$
          col("Price")$
          rolling_mean(10)$
          over("Stock")$
          alias("Price_MA10"),
        pl$
          col("Price")$
          rolling_mean(50)$
          over("Stock")$
          alias("Price_MA50")
    )
    moving_average_pl
    
    moving_average_pl |> 
        as_tibble() |> 
        tidyr::pivot_longer(
            cols = Price:Price_MA50
        ) %>%
        group_by(Stock) |> 
        plot_time_series(as_date(Date), value, .color_var = name, .facet_ncol = 4, .smooth = FALSE)

Summarize

  • Example: min, max, mean by group

    df <- pl$scan_csv(file_name)$
        group_by("state")$
        agg(
            pl$
              col("measurement")$
              min()$
              alias("min_m"),
            pl$
              col("measurement")$
              max()$
              alias("max_m"),
            pl$
              col("measurement")$
              mean()$
              alias("mean_m")
        )$
        collect()

Pivoting

  • Example: pivot_longer (source)

    long_pl = stock_data_pl$unpivot(
        index         = "Date",
        value_name    = "Price",
        variable_name = "Stock"
    )
    long_pl
    
    long_pl %>%
        as_tibble() |> 
        group_by(Stock)  |> 
        timetk::plot_time_series(as_date(Date), Price, .facet_ncol = 4, .smooth = FALSE)

SQL

  • Example: min, max, mean by group

    lf <- pl$LazyFrame(D) 
    
    pl$
      SQLContext(frame = lf)$
      execute(
        "select min(measurement) as min_m, 
                max(measurement) as max_m, 
                avg(measurement) as mean_m 
        from frame 
        group by state"
      )$
      collect()