Polars

Misc

  • API Reference

  • Data Types

    • A signed integer can represent both positive and negative values (e.g. Int16), while an unsigned integer can only represent non-negative values (e.g. UInt16)
    • A Struct is a composite data type that can store multiple fields in a single column
  • Notes from

  • Packages

  • Resources

  • Polars Cloud

    • Docs, Intro
    • Seems to be going after Apache Spark workloads
  • Read csv:

    import polars as pl
    
    df = pl.read_csv("https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv", null_values="NA")
  • Chaining operations:

    df.filter(pl.col("species") == "Gentoo").select(pl.col("^bill.*$"))
  • Using with a GPU

    • Can see up to 10x execution time decrease

    • Notes from Polars + NVIDIA GPU Tutorial

    • Installation

      pip install polars[gpu] --extra-index-url=https://pypi.nvidia.com
    • Example

      gpu_engine = pl.GPUEngine(
          device=0,
          raise_on_fail=True,
      )
      
      high_spenders_gpu = (
          transactions
          .group_by("CUST_ID")
          .agg(pl.col("AMOUNT").sum())
          .sort(by="AMOUNT", descending=True)
          .head()
          .collect(engine=gpu_engine)
      )

Columns

  • Quick Reference

    dplyr polars
    Select column select(df, x) df.select(pl.col("x"))
    Select multiple columns select(df, x, y) df.select(pl.col("x", "y"))
    Rename column rename(df, new_x = x) df.rename({"x": "new_x"})
    Select all columns except x select(df, -x) df.select(pl.exclude("x"))
    Select all columns that start with “str” select(df, starts_with("str")) df.select(pl.col("^bill.*$")) or df.select(cs.starts_with("str"))[1]
    Select numeric columns select(df, where(is.numeric)) df.select(cs.float(), cs.integer())[1,2]
  • Example: Tidyselect

    df \
      .select([
          pl.col('^name|.*name$'),
          pl.when(
              pl.col('name') == pl.col('email_name')).then(pl.lit('yes'))
              .otherwise(pl.lit('no'))
              .alias('same?')
      ])
    • Apparently we cannot currently use look-forward or backward in polars. Such as .*(?=@) to capture the email_name
    df |> 
      select(starts_with("name") | ends_with("name")) |>
      mutate(`same?` = case_when(
        name == email_name ~ "yes",
        TRUE ~ "no"))    

Rows

  • Quick Reference

    dplyr polars
    First n rows head(df, n) df.head(n)
    Filter range of values filter(df, between(x, lo, hi)) df.filter(pl.col("x").is_between(lo, hi))
    Conditional filter(df, x > 4) df.filter(pl.col("x") > 4 )
    Sort rows arrange(df, x) df.sort("x")
  • Example: Basic

    filtered_df = df_pl.filter(pl.col("region") == "Europe")

Mutate

  • Quick Reference

    dplyr polars
    Create new column mutate(df, x_mean = mean(x)) df.with_columns(pl.col("x").mean().alias("x_mean"))
  • map_batches – For applying non-polars vectorized functions

  • Example 1: Ratio variables

    df.with_columns((pl.col("bill_length_mm") / pl.col("flipper_length_mm")).alias("bill_flipper_ratio"))
    
    #> shape: (344, 10)
    #> ┌───────┬───────────┬───────────┬────────────────┬───┬─────────────┬────────┬──────┬────────────────────┐
    #> │ rowid ┆ species   ┆ island    ┆ bill_length_mm ┆ … ┆ body_mass_g ┆ sex    ┆ year ┆ bill_flipper_ratio │
    #> │ ---   ┆ ---       ┆ ---       ┆ ---            ┆   ┆ ---         ┆ ---    ┆ ---  ┆ ---                │
    #> │ i64   ┆ str       ┆ str       ┆ f64            ┆   ┆ i64         ┆ str    ┆ i64  ┆ f64                │
    #> ╞═══════╪═══════════╪═══════════╪════════════════╪═══╪═════════════╪════════╪══════╪════════════════════╡
    #> │ 1     ┆ Adelie    ┆ Torgersen ┆ 39.1           ┆ … ┆ 3750        ┆ male   ┆ 2007 ┆ 0.216022           │
    #> │ 2     ┆ Adelie    ┆ Torgersen ┆ 39.5           ┆ … ┆ 3800        ┆ female ┆ 2007 ┆ 0.212366           │
    #> │ 3     ┆ Adelie    ┆ Torgersen ┆ 40.3           ┆ … ┆ 3250        ┆ female ┆ 2007 ┆ 0.206667           │
    #> │ 4     ┆ Adelie    ┆ Torgersen ┆ null           ┆ … ┆ null        ┆ null   ┆ 2007 ┆ null               │
    #> │ …     ┆ …         ┆ …         ┆ …              ┆ … ┆ …           ┆ …      ┆ …    ┆ …                  │
    #> │ 341   ┆ Chinstrap ┆ Dream     ┆ 43.5           ┆ … ┆ 3400        ┆ female ┆ 2009 ┆ 0.215347           │
    #> │ 342   ┆ Chinstrap ┆ Dream     ┆ 49.6           ┆ … ┆ 3775        ┆ male   ┆ 2009 ┆ 0.256995           │
    #> │ 343   ┆ Chinstrap ┆ Dream     ┆ 50.8           ┆ … ┆ 4100        ┆ male   ┆ 2009 ┆ 0.241905           │
    #> │ 344   ┆ Chinstrap ┆ Dream     ┆ 50.2           ┆ … ┆ 3775        ┆ female ┆ 2009 ┆ 0.253535           │
    #> └───────┴───────────┴───────────┴────────────────┴───┴─────────────┴────────┴──────┴────────────────────┘
    df |> 
      mutate(bill_flipper_ratio = bill_length_mm / flipper_length_mm)
  • Example 2: Case-When

    df \
      .with_columns([  
          pl.when(pl.col('address').str.contains('OH'))
          .then(pl.lit('local'))
          .when(pl.col('address').str.contains('NY'))
          .then(pl.lit('foodie'))
          .otherwise(pl.lit('elsewhere'))
          .alias('familiarity')
        ])    
    df |>
     mutate(familiarity = case_when(
       str_detect(address, "OH") ~ "local",
       str_detect(address, "NY") ~ "foodie",
       TRUE ~ "elsewhere"
     ))
  • Example 3: String Extraction

    df \
      .with_columns(
          [
          pl.col('name').str.to_lowercase(),    
          pl.col('email').str.extract(r'^([\d\w]+)@', group_index = 1)
          .alias('email_name')
          ]
      )
    df |>
      rename(tolower(name) = name) |> 
      mutate(
        email_name = str_extract(email, "^([\\d\\w]+)@", group = 1)
      )
  • Example 4: Basic map_batches (source)

    data_dict = {
      'group': ['a']*4 + ['b']*4,
      'x': np.arange(1,9,1),
      'y': np.arange(8,0,-1), 
      'p': np.arange(1,9,1)/10
    }
    df = pl.DataFrame(data_dict)
    df.glimpse()
    #> Rows: 8
    #> Columns: 4
    #> $ group <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
    #> $ x     <i64> 1, 2, 3, 4, 5, 6, 7, 8
    #> $ y     <i64> 8, 7, 6, 5, 4, 3, 2, 1
    #> $ p     <f64> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8
    
    # one column in, one value out
    df.with_columns(
        coin_flip = pl.col('p')
            .map_batches(function = lambda p: binomial(n = 1, p = p), 
            returns_scalar = True, 
            return_dtype = pl.UInt16)
    )
    #> group      x   y   p coin_flip
    #>   str    i64 i64 f64       u16
    #>   "a"      1   8 0.1         0
    #>   "a"      2   7 0.2         1
    #>   "a"      3   6 0.3         0
    #>   "a"      4   5 0.4         0
    #>   "b"      5   4 0.5         1
    #>   "b"      6   3 0.6         0
    #>   "b"      7   2 0.7         1
    #>   "b"      8   1 0.8         1
    • Simple lambda function that takes 1 column (p) as input to create the column coin_flip
    • Each value of p is looped through binomial. This fits the conditions for map_batches since binomial is non-polars ({numpy}) and vectorized.
  • Example 5: map_batches multiple input columns (source)

    # two column in, one value out - with structs
    df.with_columns(
    1    coin_flip = pl.struct('x','p')
            .map_batches(function = lambda z: binomial(n = z.struct['x'], p = z.struct['p']),
                         returns_scalar = True, 
                         return_dtype = pl.UInt16)
    )
    
    # two columns in, one value out - with exprs
    df.with_columns(
    2    coin_flip = pl.map_batches(exprs = ['x', 'p'],
                                   function = lambda z: binomial(n = z[0], p = z[1]), 
                                   returns_scalar = True, 
                                   return_dtype = pl.UInt16)
    )
    #> group      x   y   p coin_flip
    #>   str    i64 i64 f64       u16
    #>   "a"      1   8 0.1         0
    #>   "a"      2   7 0.2         0
    #>   "a"      3   6 0.3         1
    #>   "a"      4   5 0.4         2
    #>   "b"      5   4 0.5         2
    #>   "b"      6   3 0.6         5
    #>   "b"      7   2 0.7         5
    #>   "b"      8   1 0.8         5
    1
    The struct is a composite data type that is created using multiple columns. After creating the struct using x and p, it’s subsetted inside the lambda function (See Misc >> Data Types)
    2
    This version just uses the exprs argument to supply x and p to the lambda function where it’s subsetted using indices.
  • Example 6: map_batches w/array output (source)

    df.with_columns(
        coin_flip = pl.struct('x','p')
            .map_batches(function = lambda z: binomial(n = z.struct['x'], 
                                                       p = z.struct['p'],
                                                       size = (100,z.shape[0]))
                                              .transpose(), 
                         return_dtype = pl.Array(pl.UInt16, 100)
            )
    ).with_columns( 
        avg_outcome = pl.col('coin_flip').arr.mean(),
        exp_value = pl.col('x') * pl.col('p')
    )
    #> group      x   y   p       coin_flip avg_outcome exp_value
    #>   str    i64 i64 f64 array[u16, 100]         f64       f64
    #>   "a"      1   8 0.1     [0, 0, … 0]        0.06       0.1
    #>   "a"      2   7 0.2     [0, 0, … 1]        0.48       0.4
    #>   "a"      3   6 0.3     [1, 2, … 0]        0.85       0.9
    #>   "a"      4   5 0.4     [2, 2, … 4]        1.69       1.6
    #>   "b"      5   4 0.5     [2, 2, … 2]        2.53       2.5
    #>   "b"      6   3 0.6     [5, 4, … 3]        3.62       3.6
    #>   "b"      7   2 0.7     [6, 5, … 7]        4.81       4.9
    #>   "b"      8   1 0.8     [5, 8, … 5]        6.5        6.4
    • coin_flip is an array of counts of successful events. The number of trials (size) is 100 and the number of experiments per trial is x. So, in each array, there shouldn’t be a count larger than it’s associated x value
    • (100, z.shape[0]) also indicates the shape of the array output

Summarize

  • Quick Reference

    dplyr polars
    Group group_by(df, x) df.group_by("x")
    Summarize summarize(df, x_n = n()) df.agg(pl.count().alias("x_n"))
  • map_groups - Similar to map_batches (non-polar, vectorized functions) but for group aggregations (See Mutate, Examples 4-6)

  • Example 1: Group_By, Count, Sort

    df \
      .group_by("species") \
      .agg(pl.count() \
           .alias("counts")) \
      .sort("counts", descending=True)
    
    #> shape: (3, 2)
    #> ┌───────────┬────────┐
    #> │ species   ┆ counts │
    #> │ ---       ┆ ---    │
    #> │ str       ┆ u32    │
    #> ╞═══════════╪════════╡
    #> │ Adelie    ┆ 152    │
    #> │ Gentoo    ┆ 124    │
    #> │ Chinstrap ┆ 68     │
    #> └───────────┴────────┘
  • Example 2: Group-By, Arrange, Summarize

    For each species, find the penguin with the lowest body mass

    df \
      .group_by("species") \
      .agg(pl.all() \
        .sort_by("body_mass_g") \
        .drop_nulls() \
        .first())
    
    #> shape: (3, 9)
    #> ┌───────────┬───────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐
    #> │ species   ┆ rowid ┆ island    ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex    ┆ year │
    #> │ ---       ┆ ---   ┆ ---       ┆ ---            ┆   ┆ ---               ┆ ---         ┆ ---    ┆ ---  │
    #> │ str       ┆ i64   ┆ str       ┆ f64            ┆   ┆ i64               ┆ i64         ┆ str    ┆ i64  │
    #> ╞═══════════╪═══════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡
    #> │ Adelie    ┆ 4     ┆ Torgersen ┆ 36.5           ┆ … ┆ 181               ┆ 2850        ┆ female ┆ 2007 │
    #> │ Chinstrap ┆ 315   ┆ Dream     ┆ 46.9           ┆ … ┆ 192               ┆ 2700        ┆ female ┆ 2008 │
    #> │ Gentoo    ┆ 272   ┆ Biscoe    ┆ 42.7           ┆ … ┆ 208               ┆ 3950        ┆ female ┆ 2009 │
    #> └───────────┴───────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘
    df |>
      group_by(species) |>
      arrange(body_mass_g) |>
      summarize(body_mass_g = first(body_mass_g))
    
    #> # A tibble: 3 × 2
    #>   species   body_mass_g
    #>   <fct>           <int>
    #> 1 Adelie           2850
    #> 2 Chinstrap        2700
    #> 3 Gentoo           3950
  • Example 3: Select, Filter, Summarize, Across

    df \
     .select(df.columns[0:4]+['^performance.*$','^salary.*$']) \
     .filter(pl.col('age') >= 30) \
     .select(pl.col(pl.Int64).mean().name.prefix('mean_'))
    
    #> shape: (1, 5)
    #> mean_age mean_performance_rating mean_performance_reviews_count  mean_salary mean_salary_increase_percentage
    #>      f64                     f64                            f64          f64                             f64
    #>     46.6                     4.4                            3.4      40000.0                            10.0
    • For some reason, you have to sandwich ^ and $ around the regex in order to return those column names
    df |>
      select(1:3, 
             starts_with("performance"), 
             starts_with("salary")) |> 
      filter(age >= 30) |>
      summarize(across(
        .cols = where(is.numeric), 
        .fns = mean, 
        .names = "mean_{.col}"))
    
    #> # A tibble: 1 × 5
    #>   mean_age mean_performance_rating mean_performance_reviews_count mean_salary
    #>      <dbl>                   <dbl>                          <dbl>       <dbl>
    #> 1     46.6                     4.4                            3.4       40000
    #> # ℹ 1 more variable: mean_salary_increase_percentage <dbl>
  • Example 4: Basic map_groups (source)

    data_dict = {
      'group': ['a']*4 + ['b']*4,
      'truth': [1,1,0,0]*2,
      'mod_bad': [0.25,0.25,0.75,0.75]*2, 
      'mod_bst': [0.99,0.75,0.25,0.01]*2,
      'mod_rnd': [0.5]*8,
      'mod_mix': [0.99,0.75,0.25,0.01]+[0.5]*3+[0.6]
    }
    df = pl.DataFrame(data_dict)
    df.glimpse()
    #> Rows: 8
    #> Columns: 6
    #> $ group   <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
    #> $ truth   <i64> 1, 1, 0, 0, 1, 1, 0, 0
    #> $ mod_bad <f64> 0.25, 0.25, 0.75, 0.75, 0.25, 0.25, 0.75, 0.75
    #> $ mod_bst <f64> 0.99, 0.75, 0.25, 0.01, 0.99, 0.75, 0.25, 0.01
    #> $ mod_rnd <f64> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5
    #> $ mod_mix <f64> 0.99, 0.75, 0.25, 0.01, 0.5, 0.5, 0.5, 0.6
    
    df.group_by('group').agg (
        pl.map_groups(
            exprs = ['truth', 'mod_mix'],
            function = lambda x: roc_auc_score(x[0], x[1]),
            return_dtype = pl.Float64,
            returns_scalar = True
        )
    )
    
    #> group    truth
    #>   str      f64
    #>   "a"     1.0
    #>   "b"     0.25
    
    # import polars_ds as pds
    # 
    # df.group_by('group').agg (
    #     auroc = pds.query_roc_auc('truth', 'mod_bst')
    # )
    • Simulates calculating AUC scores for each group of a binary classification model
    • Each model has 8 probability predictions along with true classification label and a grouping variable
    • exprs specifies columns for the lambda function which get subsetted by indices.
    • {polars-ds} has a one-liner function for this
  • Example 5: Multiple Column Aggregation (source)

    1def auroc_expressions(models):
        for m in models:
            yield pds.query_roc_auc('truth', m).alias(m)
    
    # could also do: [c for c in df.columns if c[:4] == 'mod_']
    2mods = cs.expand_selector(df, cs.starts_with('mod_'))
    
    3df.group_by('group').agg(auroc_expressions(mods))
    #> group    mod_bad mod_bst mod_rnd mod_mix
    #>   str        f64     f64     f64     f64
    #>   "b"       -0.0     1.0     0.5    0.25
    #>   "a"       -0.0     1.0     0.5    1.0
    1
    Write a wrapper function that handles the iteration and acts as a generator yielding the expression of interest (Also see Python, General >> Loops >> Generators)
    2
    Obtain relevant selectors to pass into the function. Here, you can use the cs.expand_selector helper or any raw parsing of the column names (e.g. list comprehension)
    3
    Pass the generator into the standard df.group_by(...).agg(...) flow

Joins

  • Quick Reference

    dplyr polars
    Join dataframes left_join(df1, df2, by=x) df1.join(df2, on="x", how="left")
  • Example: Basic Left Join

    merged_df = df_pl.join(pop_df_pl, on="country", how="left")
  • Example: Left Join

    import random
    
    df \
      .join(df_dept, on="department") \
      .select(['name','dept_id']) \
      .with_columns(
          employee_id = pl.format(
            '{}-{}',
            'dept_id',
            pl.Series([
              random.randint(100000, 999999) for _ in range(len(df))
              ])
              )
      )    
    df |>
      left_join(df_dept, by = "department") |>
      select(name, dept_id) |>
      mutate(employee_id = map_chr(dept_id, ~ paste0(.x, "-", sample(1:9, 1))))    

tidyr

  • Example: Unite

    df \
      .with_columns(
          pl.concat_str(
              pl.col(pl.String), separator=" "
          ).alias('combination_of_character')
      ) \
      .select(pl.col('combination_of_character'))
    
    #> shape: (6, 1)
    #> combination_of_character
    #> str
    #> "Alice New York 123 Main St, On…
    #> "Bob San Francisco 123 Main St,…
    #> "Charlie Tokyo 456-7890, Tokyo,…
    #> "Ken Toronto 49494 Exchange St,…
    #> "Steven Lima 1010 Gb st, Lima, …
    #> "Carlos Cleveland 666 Heaven dr…
    df |> 
      unite(col = "combination_of_character", 
            where(is.character), 
            sep = " ", ) |> 
      select(combination_of_character)
    
    #> # A tibble: 6 × 1
    #>   combination_of_character                                                                                         
    #>   <chr>                                                                                                            
    #> 1 Alice New York 123 Main St, Ontario, OH 123-456-7890 alice@example.com Engineering 2010-01-01 Active 2022-05-01  
    #> 2 Bob San Francisco 123 Main St, Calgary, AB 987-654-3210 bob@example.com Marketing 2012-05-15 Inactive 2021-07-15 
    #> 3 Charlie Tokyo 456-7890, Tokyo, NY 098-765-4332 charlie@example.com Finance 2015-10-01 Active 2022-08-31          
    #> 4 Ken Toronto 49494 Exchange St, Toronto, ON 111-232-4141 ken@stats.org Marketing 2010-04-01 Inactive 2024-10-30   
    #> 5 Steven Lima 1010 Gb st, Lima, OH 505-402-6060 stephencurry@nba.com Marketing 2009-10-30 Active 2023-01-02        
    #> 6 Carlos Cleveland 666 Heaven dr, Cleveland, OH 909-435-1000 carlos@genius.edu Finance 2005-11-12 Active 2024-12-12
  • Example: Pivot Wider

    df \
      .select(['name','address']) \
      .with_columns(
          state = pl.col('address').str.extract(r'([A-Z]{2})$')
      ) \
      .select('name','state') \
      .pivot(on = 'state', index = 'name', values='state', aggregate_function='len') \
      .with_columns(
          pl.col(pl.UInt32).fill_null(0)
      ) 
    
    # Or since this just creates dummies
    df \
      .select(['name','department']) \
      .to_dummies(columns = 'department') 
    df |>
      select(name, department) |>
      pivot_wider(id_cols = "name", 
                  names_from = "department", 
                  values_from = "department", 
                  values_fill = 0, 
                  values_fn = length, 
                  names_prefix = "department_")
    
    #> # A tibble: 6 × 4
    #>   name    department_Engineering department_Marketing department_Finance
    #>   <chr>                    <int>                <int>              <int>
    #> 1 Alice                        1                    0                  0
    #> 2 Bob                          0                    1                  0
    #> 3 Charlie                      0                    0                  1
    #> 4 Ken                          0                    1                  0
    #> 5 Steven                       0                    1                  0
    #> 6 Carlos                       0                    0                  1

UDFs

  • Misc

  • Example 1: Basic pipe at column-level

    data_dict = {
      'group': ['a']*4 + ['b']*4,
      'x': np.arange(1,9,1),
      'y': np.arange(8,0,-1), 
      'p': np.arange(1,9,1)/10
    }
    df = pl.DataFrame(data_dict)
    df.glimpse()
    #> Rows: 8
    #> Columns: 4
    #> $ group <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'
    #> $ x     <i64> 1, 2, 3, 4, 5, 6, 7, 8
    #> $ y     <i64> 8, 7, 6, 5, 4, 3, 2, 1
    #> $ p     <f64> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8
    
    
    def cap(c:pl.Expr, ceil:int = 5) -> pl.Expr: 
      return pl.when(c > ceil).then(ceil).otherwise(c)
    
    df.with_columns(pl.col('x').pipe(cap))
    #> shape: (8, 5)
    #> group    x     y     p     literal
    #>   str    i64 i64   f64       i64
    #>   "a"    1       8     0.1         1
    #>   "a"    2       7     0.2         2
    #>   "a"    3       6     0.3         3
    #>   "a"    4       5     0.4         4
    #>   "b"    5       4     0.5         5
    #>   "b"    6       3     0.6         5
    #>   "b"    7       2     0.7         5
    #>   "b"    8       1     0.8         5
    
    df.with_columns(cs.numeric().pipe(cap).name.keep())
    #>  shape: (8, 4)
    #> group    x     y   p
    #>   str    i64 i64 f64
    #>   "a"      1   5 0.1
    #>   "a"      2   5 0.2
    #>   "a"      3   5 0.3
    #>   "a"      4   5 0.4
    #>   "b"      5   4 0.5
    #>   "b"      5   3 0.6
    #>   "b"      5   2 0.7
    #>   "b"      5   1 0.8
    • The udf is essentially a case_when. It truncates any variable values greater than 5 at 5.
    • pipe takes the unquoted name of the udf as its argument.
    • Use name.keep when selecting more than one column (e.g. cs.numeric) when using pipe. It prevents the output of the manipulated columns from having the same name.
  • Example 2: Basic pipe at df-level

    def calc_diffs(df:pl.DataFrame, threshhold:int = 5) -> pl.DataFrame:
    
        df_out = (
            df
            .with_columns(
                abs = (pl.col('x') - pl.col('y')).abs(),
                abs_gt_t = (pl.col('x') - pl.col('y')).abs() > threshhold,
            )
        )
        return df_out 
    
    df.pipe(calc_diffs)
    #> group      x   y   p abs abs_gt_t
    #>   str    i64 i64 f64 i64     bool
    #>   "a"      1   8 0.1   7     true
    #>   "a"      2   7 0.2   5    false
    #>   "a"      3   6 0.3   3    false
    #>   "a"      4   5 0.4   1    false
    #>   "b"      5   4 0.5   1    false
    #>   "b"      6   3 0.6   3    false
    #>   "b"      7   2 0.7   5    false
    #>   "b"      8   1 0.8   7     true
    • calc_diffs calculates the absolute difference between x and y and stores it in abs. Then tests whether abs is greater than some threshold (abs_gt_t).
  • Example 3: pipe at df-level w/args

    def calc_diffs(
      df:pl.DataFrame, 
      var1:str = 'x', 
      var2:str = 'y', 
      threshhold:int = 5) -> pl.DataFrame:
    
        df_out = (
            df
            .with_columns(
                abs = (pl.col(var1) - pl.col(var2)).abs(),
                abs_gt_t = (pl.col(var1) - pl.col(var2)).abs() > threshhold,
            )
        )
        return df_out 
    
    df.pipe(
        calc_diffs, 
        var1 = 'y', 
        var2 = 'x', 
        threshhold = 3
    )
    #> group      x   y   p abs abs_gt_t
    #>   str    i64 i64 f64 i64     bool
    #>   "a"      1   8 0.1   7     true
    #>   "a"      2   7 0.2   5     true
    #>   "a"      3   6 0.3   3    false
    #>   "a"      4   5 0.4   1    false
    #>   "b"      5   4 0.5   1    false
    #>   "b"      6   3 0.6   3    false
    #>   "b"      7   2 0.7   5     true
    #>   "b"      8   1 0.8   7     true
    • Column names are quoted in the function arguments, but not in the function code.
  • Example 4: Basic map_columns

    df.map_columns(cs.numeric(), cap)
    • See Example 1 for the output and cap udf definition
    • More concise alternative to pipe at the column-level

Snippets

  • Example: Lead and Forward Fill

    • If the department only has 1 row of salary data, do not shift salary
    • If the department has more than 1 row of salary data, shift by -1 of salary column
    • Reason: there was a mistake in entering data for those with more than 1 row of data, apparently the actualy salary data is 1 row more
    df \
      .group_by('department') \
      .agg(
          pl.col('name'),
          pl.when(pl.col('salary').len()==1).then(pl.col('salary'))
          .otherwise(pl.col('salary').shift(-1))
          .alias('salary_shift')) \
      .explode('name','salary_shift') \
      .with_columns(
          pl.col('salary_shift').forward_fill()) 
    df |>
      group_by(department) |>
      summarize(
        name = name,
        salary_shift = case_when(
          n() == 1 ~ salary,
          TRUE ~ lead(salary)
        )
      ) |>
     fill(salary_shift, .direction = "down")
    
    #> # A tibble: 6 × 3
    #> # Groups:   department [3]
    #>   department  name    salary_shift
    #>   <chr>       <chr>          <dbl>
    #> 1 Engineering Alice          50000
    #> 2 Finance     Charlie        30000
    #> 3 Finance     Carlos         30000
    #> 4 Marketing   Bob            20000
    #> 5 Marketing   Ken            40000
    #> 6 Marketing   Steven         40000
  • Example: Map

    import random
    
    df \
      .select(['name','dept_id']) \
      .with_columns(
          employee_id = pl.format(
            '{}-{}',
            'dept_id',
            pl.Series([
              random.randint(100000, 999999) for _ in range(len(df))
              ])
              )
      ) 
    df |>
      select(name, dept_id) |>
      mutate(employee_id = map_chr(dept_id, ~paste0(.x, "-", sample(1000000:9999999, 1))))
    #> # A tibble: 6 × 3
    #>   name    dept_id employee_id
    #>   <chr>     <dbl> <chr>      
    #> 1 Alice        30 30-1694470 
    #> 2 Bob          25 25-1696036 
    #> 3 Charlie      20 20-4463080 
    #> 4 Ken          25 25-6942432 
    #> 5 Steven       25 25-3012223 
    #> 6 Carlos       20 20-8705991
  • Example: Split Dataframe and Loop (source)

    dfs = df.partition_by('group', as_dict = True, include_key = True)
    for k,v in dfs.items():
        print(f"{k}  : {v}")
    #> ('a',)  : shape: (4, 3)
    #> ┌───────┬──────┬──────┐
    #> │ group ┆ x    ┆ y    │
    #> │ ---   ┆ ---  ┆ ---  │
    #> │ str   ┆ f64  ┆ f64  │
    #> ╞═══════╪══════╪══════╡
    #> │ a     ┆ 0.99 ┆ 0.99 │
    #> │ a     ┆ 0.75 ┆ 0.75 │
    #> │ a     ┆ 0.25 ┆ 0.25 │
    #> │ a     ┆ 0.01 ┆ 0.01 │
    #> └───────┴──────┴──────┘
    #> ('b',)  : shape: (4, 3)
    #> ┌───────┬──────┬─────┐
    #> │ group ┆ x    ┆ y   │
    #> │ ---   ┆ ---  ┆ --- │
    #> │ str   ┆ f64  ┆ f64 │
    #> ╞═══════╪══════╪═════╡
    #> │ b     ┆ 0.99 ┆ 0.5 │
    #> │ b     ┆ 0.75 ┆ 0.5 │
    #> │ b     ┆ 0.25 ┆ 0.5 │
    #> │ b     ┆ 0.01 ┆ 0.6 │
    #> └───────┴──────┴─────┘
    
    grps = [k[0] for k in dfs.keys() ] # turn tuple to scalar bcs only one grouping var in key
    mods = [sm.OLS(d['x'].to_numpy(), 
                   sm.add_constant(d['y'].to_numpy())
            ).fit() for k,d in dfs.items()]
    coef = [m.params[1] for m in mods]
    dict(zip(grps, coef))
    #> {'a': np.float64(1.0000000000000004), 'b': np.float64(-6.533333333333339)}
    • partition_by splits the dataframe into a list of dataframes (like split in base R or group_split in {dplyr}). Since as_dict = True, it’s actually a dictionary of dataframes
    • This can also be done within the unsplit dataframe using map_batch and map_elements but it’s less efficient. (See the previous section in the article for details)

Lazy Evaluation

  • Example: Basic

    import polars as pl
    
    df_lazy = (
        pl.scan_csv("sales.csv")
          .filter(pl.col("amount") > 100)
          .groupby("segment")
          .agg(pl.col("amount").mean())
          .sort("amount")
    )
    
    result = df_lazy.collect()
  • Example: Lazily aggregate Delta lake files in AWS S3 (source)

    import polars as pl
    from datetime import datetime
    import os
    
    t1 = datetime.now()
    
    table_path = "s3://confessions-of-a-data-guy/BigTest/social_media_rots_brains"
    
    os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxxx"
    os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxxxxxx"
    os.environ["AWS_DEFAULT_REGION"] = "us-east-1"
    
    lazy_df = pl.scan_delta(table_path)
    
    agg_lazy = (
        lazy_df
          .with_columns(pl.col("created_at").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S").alias("created_at_ts"))
          .with_columns(pl.col("created_at_ts").dt.date().alias("post_date"))
          .group_by("post_date")
          .agg(pl.count().alias("num_posts"))
          .sort("post_date")
    )
    
    agg_lazy.sink_csv("posts_per_day.csv")
    
    t2 = datetime.now()
    print(f"Elapsed time: {t2 - t1}")
    • Uses Polars’ lazy engine with scan_delta() to load the Delta Lake table.
    • Parses timestamps using str.strptime() into a Polars datetime.
    • Extracts date for grouping using .dt.date().
    • Performs lazy aggregation and sorting for efficiency.
    • Writes out the result via sink_csv().