Polars
Misc
-
- A signed integer can represent both positive and negative values (e.g. Int16), while an unsigned integer can only represent non-negative values (e.g. UInt16)
- A Struct is a composite data type that can store multiple fields in a single column
Notes from
Packages
Resources
Polars Cloud
Read csv:
import polars as pl df = pl.read_csv("https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv", null_values="NA")Chaining operations:
df.filter(pl.col("species") == "Gentoo").select(pl.col("^bill.*$"))Using with a GPU
Can see up to 10x execution time decrease
Notes from Polars + NVIDIA GPU Tutorial
Installation
pip install polars[gpu] --extra-index-url=https://pypi.nvidia.comExample
gpu_engine = pl.GPUEngine( device=0, raise_on_fail=True, ) high_spenders_gpu = ( transactions .group_by("CUST_ID") .agg(pl.col("AMOUNT").sum()) .sort(by="AMOUNT", descending=True) .head() .collect(engine=gpu_engine) )
Columns
Quick Reference
dplyr polars Select column select(df, x)df.select(pl.col("x"))Select multiple columns select(df, x, y)df.select(pl.col("x", "y"))Rename column rename(df, new_x = x)df.rename({"x": "new_x"})Select all columns except x select(df, -x)df.select(pl.exclude("x"))Select all columns that start with “str” select(df, starts_with("str"))df.select(pl.col("^bill.*$"))ordf.select(cs.starts_with("str"))[1]Select numeric columns select(df, where(is.numeric))df.select(cs.float(), cs.integer())[1,2]Example: Tidyselect
df \ .select([ pl.col('^name|.*name$'), pl.when( pl.col('name') == pl.col('email_name')).then(pl.lit('yes')) .otherwise(pl.lit('no')) .alias('same?') ])- Apparently we cannot currently use look-forward or backward in polars. Such as .*(?=@) to capture the email_name
df |> select(starts_with("name") | ends_with("name")) |> mutate(`same?` = case_when( name == email_name ~ "yes", TRUE ~ "no"))
Rows
Quick Reference
dplyr polars First nrowshead(df, n)df.head(n)Filter range of values filter(df, between(x, lo, hi))df.filter(pl.col("x").is_between(lo, hi))Conditional filter(df, x > 4)df.filter(pl.col("x") > 4 )Sort rows arrange(df, x)df.sort("x")Example: Basic
filtered_df = df_pl.filter(pl.col("region") == "Europe")
Mutate
Quick Reference
dplyr polars Create new column mutate(df, x_mean = mean(x))df.with_columns(pl.col("x").mean().alias("x_mean"))map_batches– For applying non-polars vectorized functionsExample 1: Ratio variables
df.with_columns((pl.col("bill_length_mm") / pl.col("flipper_length_mm")).alias("bill_flipper_ratio")) #> shape: (344, 10) #> ┌───────┬───────────┬───────────┬────────────────┬───┬─────────────┬────────┬──────┬────────────────────┐ #> │ rowid ┆ species ┆ island ┆ bill_length_mm ┆ … ┆ body_mass_g ┆ sex ┆ year ┆ bill_flipper_ratio │ #> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ #> │ i64 ┆ str ┆ str ┆ f64 ┆ ┆ i64 ┆ str ┆ i64 ┆ f64 │ #> ╞═══════╪═══════════╪═══════════╪════════════════╪═══╪═════════════╪════════╪══════╪════════════════════╡ #> │ 1 ┆ Adelie ┆ Torgersen ┆ 39.1 ┆ … ┆ 3750 ┆ male ┆ 2007 ┆ 0.216022 │ #> │ 2 ┆ Adelie ┆ Torgersen ┆ 39.5 ┆ … ┆ 3800 ┆ female ┆ 2007 ┆ 0.212366 │ #> │ 3 ┆ Adelie ┆ Torgersen ┆ 40.3 ┆ … ┆ 3250 ┆ female ┆ 2007 ┆ 0.206667 │ #> │ 4 ┆ Adelie ┆ Torgersen ┆ null ┆ … ┆ null ┆ null ┆ 2007 ┆ null │ #> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ #> │ 341 ┆ Chinstrap ┆ Dream ┆ 43.5 ┆ … ┆ 3400 ┆ female ┆ 2009 ┆ 0.215347 │ #> │ 342 ┆ Chinstrap ┆ Dream ┆ 49.6 ┆ … ┆ 3775 ┆ male ┆ 2009 ┆ 0.256995 │ #> │ 343 ┆ Chinstrap ┆ Dream ┆ 50.8 ┆ … ┆ 4100 ┆ male ┆ 2009 ┆ 0.241905 │ #> │ 344 ┆ Chinstrap ┆ Dream ┆ 50.2 ┆ … ┆ 3775 ┆ female ┆ 2009 ┆ 0.253535 │ #> └───────┴───────────┴───────────┴────────────────┴───┴─────────────┴────────┴──────┴────────────────────┘df |> mutate(bill_flipper_ratio = bill_length_mm / flipper_length_mm)Example 2: Case-When
df \ .with_columns([ pl.when(pl.col('address').str.contains('OH')) .then(pl.lit('local')) .when(pl.col('address').str.contains('NY')) .then(pl.lit('foodie')) .otherwise(pl.lit('elsewhere')) .alias('familiarity') ])df |> mutate(familiarity = case_when( str_detect(address, "OH") ~ "local", str_detect(address, "NY") ~ "foodie", TRUE ~ "elsewhere" ))Example 3: String Extraction
df \ .with_columns( [ pl.col('name').str.to_lowercase(), pl.col('email').str.extract(r'^([\d\w]+)@', group_index = 1) .alias('email_name') ] )df |> rename(tolower(name) = name) |> mutate( email_name = str_extract(email, "^([\\d\\w]+)@", group = 1) )Example 4: Basic
map_batches(source)data_dict = { 'group': ['a']*4 + ['b']*4, 'x': np.arange(1,9,1), 'y': np.arange(8,0,-1), 'p': np.arange(1,9,1)/10 } df = pl.DataFrame(data_dict) df.glimpse() #> Rows: 8 #> Columns: 4 #> $ group <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b' #> $ x <i64> 1, 2, 3, 4, 5, 6, 7, 8 #> $ y <i64> 8, 7, 6, 5, 4, 3, 2, 1 #> $ p <f64> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 # one column in, one value out df.with_columns( coin_flip = pl.col('p') .map_batches(function = lambda p: binomial(n = 1, p = p), returns_scalar = True, return_dtype = pl.UInt16) ) #> group x y p coin_flip #> str i64 i64 f64 u16 #> "a" 1 8 0.1 0 #> "a" 2 7 0.2 1 #> "a" 3 6 0.3 0 #> "a" 4 5 0.4 0 #> "b" 5 4 0.5 1 #> "b" 6 3 0.6 0 #> "b" 7 2 0.7 1 #> "b" 8 1 0.8 1- Simple lambda function that takes 1 column (p) as input to create the column coin_flip
- Each value of p is looped through
binomial. This fits the conditions formap_batchessincebinomialis non-polars ({numpy}) and vectorized.
Example 5:
map_batchesmultiple input columns (source)# two column in, one value out - with structs df.with_columns( 1 coin_flip = pl.struct('x','p') .map_batches(function = lambda z: binomial(n = z.struct['x'], p = z.struct['p']), returns_scalar = True, return_dtype = pl.UInt16) ) # two columns in, one value out - with exprs df.with_columns( 2 coin_flip = pl.map_batches(exprs = ['x', 'p'], function = lambda z: binomial(n = z[0], p = z[1]), returns_scalar = True, return_dtype = pl.UInt16) ) #> group x y p coin_flip #> str i64 i64 f64 u16 #> "a" 1 8 0.1 0 #> "a" 2 7 0.2 0 #> "a" 3 6 0.3 1 #> "a" 4 5 0.4 2 #> "b" 5 4 0.5 2 #> "b" 6 3 0.6 5 #> "b" 7 2 0.7 5 #> "b" 8 1 0.8 5- 1
-
The
structis a composite data type that is created using multiple columns. After creating the struct using x and p, it’s subsetted inside the lambda function (See Misc >> Data Types) - 2
- This version just uses the exprs argument to supply x and p to the lambda function where it’s subsetted using indices.
Example 6:
map_batchesw/array output (source)df.with_columns( coin_flip = pl.struct('x','p') .map_batches(function = lambda z: binomial(n = z.struct['x'], p = z.struct['p'], size = (100,z.shape[0])) .transpose(), return_dtype = pl.Array(pl.UInt16, 100) ) ).with_columns( avg_outcome = pl.col('coin_flip').arr.mean(), exp_value = pl.col('x') * pl.col('p') ) #> group x y p coin_flip avg_outcome exp_value #> str i64 i64 f64 array[u16, 100] f64 f64 #> "a" 1 8 0.1 [0, 0, … 0] 0.06 0.1 #> "a" 2 7 0.2 [0, 0, … 1] 0.48 0.4 #> "a" 3 6 0.3 [1, 2, … 0] 0.85 0.9 #> "a" 4 5 0.4 [2, 2, … 4] 1.69 1.6 #> "b" 5 4 0.5 [2, 2, … 2] 2.53 2.5 #> "b" 6 3 0.6 [5, 4, … 3] 3.62 3.6 #> "b" 7 2 0.7 [6, 5, … 7] 4.81 4.9 #> "b" 8 1 0.8 [5, 8, … 5] 6.5 6.4coin_flipis an array of counts of successful events. The number of trials (size) is 100 and the number of experiments per trial is x. So, in each array, there shouldn’t be a count larger than it’s associated x value- (100, z.shape[0]) also indicates the shape of the array output
Summarize
Quick Reference
dplyr polars Group group_by(df, x)df.group_by("x")Summarize summarize(df, x_n = n())df.agg(pl.count().alias("x_n"))map_groups- Similar tomap_batches(non-polar, vectorized functions) but for group aggregations (See Mutate, Examples 4-6)Example 1: Group_By, Count, Sort
df \ .group_by("species") \ .agg(pl.count() \ .alias("counts")) \ .sort("counts", descending=True) #> shape: (3, 2) #> ┌───────────┬────────┐ #> │ species ┆ counts │ #> │ --- ┆ --- │ #> │ str ┆ u32 │ #> ╞═══════════╪════════╡ #> │ Adelie ┆ 152 │ #> │ Gentoo ┆ 124 │ #> │ Chinstrap ┆ 68 │ #> └───────────┴────────┘Example 2: Group-By, Arrange, Summarize
For each species, find the penguin with the lowest body mass
df \ .group_by("species") \ .agg(pl.all() \ .sort_by("body_mass_g") \ .drop_nulls() \ .first()) #> shape: (3, 9) #> ┌───────────┬───────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬────────┬──────┐ #> │ species ┆ rowid ┆ island ┆ bill_length_mm ┆ … ┆ flipper_length_mm ┆ body_mass_g ┆ sex ┆ year │ #> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ #> │ str ┆ i64 ┆ str ┆ f64 ┆ ┆ i64 ┆ i64 ┆ str ┆ i64 │ #> ╞═══════════╪═══════╪═══════════╪════════════════╪═══╪═══════════════════╪═════════════╪════════╪══════╡ #> │ Adelie ┆ 4 ┆ Torgersen ┆ 36.5 ┆ … ┆ 181 ┆ 2850 ┆ female ┆ 2007 │ #> │ Chinstrap ┆ 315 ┆ Dream ┆ 46.9 ┆ … ┆ 192 ┆ 2700 ┆ female ┆ 2008 │ #> │ Gentoo ┆ 272 ┆ Biscoe ┆ 42.7 ┆ … ┆ 208 ┆ 3950 ┆ female ┆ 2009 │ #> └───────────┴───────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴────────┴──────┘df |> group_by(species) |> arrange(body_mass_g) |> summarize(body_mass_g = first(body_mass_g)) #> # A tibble: 3 × 2 #> species body_mass_g #> <fct> <int> #> 1 Adelie 2850 #> 2 Chinstrap 2700 #> 3 Gentoo 3950Example 3: Select, Filter, Summarize, Across
df \ .select(df.columns[0:4]+['^performance.*$','^salary.*$']) \ .filter(pl.col('age') >= 30) \ .select(pl.col(pl.Int64).mean().name.prefix('mean_')) #> shape: (1, 5) #> mean_age mean_performance_rating mean_performance_reviews_count mean_salary mean_salary_increase_percentage #> f64 f64 f64 f64 f64 #> 46.6 4.4 3.4 40000.0 10.0- For some reason, you have to sandwich
^and$around the regex in order to return those column names
df |> select(1:3, starts_with("performance"), starts_with("salary")) |> filter(age >= 30) |> summarize(across( .cols = where(is.numeric), .fns = mean, .names = "mean_{.col}")) #> # A tibble: 1 × 5 #> mean_age mean_performance_rating mean_performance_reviews_count mean_salary #> <dbl> <dbl> <dbl> <dbl> #> 1 46.6 4.4 3.4 40000 #> # ℹ 1 more variable: mean_salary_increase_percentage <dbl>- For some reason, you have to sandwich
Example 4: Basic
map_groups(source)data_dict = { 'group': ['a']*4 + ['b']*4, 'truth': [1,1,0,0]*2, 'mod_bad': [0.25,0.25,0.75,0.75]*2, 'mod_bst': [0.99,0.75,0.25,0.01]*2, 'mod_rnd': [0.5]*8, 'mod_mix': [0.99,0.75,0.25,0.01]+[0.5]*3+[0.6] } df = pl.DataFrame(data_dict) df.glimpse() #> Rows: 8 #> Columns: 6 #> $ group <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b' #> $ truth <i64> 1, 1, 0, 0, 1, 1, 0, 0 #> $ mod_bad <f64> 0.25, 0.25, 0.75, 0.75, 0.25, 0.25, 0.75, 0.75 #> $ mod_bst <f64> 0.99, 0.75, 0.25, 0.01, 0.99, 0.75, 0.25, 0.01 #> $ mod_rnd <f64> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5 #> $ mod_mix <f64> 0.99, 0.75, 0.25, 0.01, 0.5, 0.5, 0.5, 0.6 df.group_by('group').agg ( pl.map_groups( exprs = ['truth', 'mod_mix'], function = lambda x: roc_auc_score(x[0], x[1]), return_dtype = pl.Float64, returns_scalar = True ) ) #> group truth #> str f64 #> "a" 1.0 #> "b" 0.25 # import polars_ds as pds # # df.group_by('group').agg ( # auroc = pds.query_roc_auc('truth', 'mod_bst') # )- Simulates calculating AUC scores for each group of a binary classification model
- Each model has 8 probability predictions along with true classification label and a grouping variable
- exprs specifies columns for the lambda function which get subsetted by indices.
- {polars-ds} has a one-liner function for this
Example 5: Multiple Column Aggregation (source)
1def auroc_expressions(models): for m in models: yield pds.query_roc_auc('truth', m).alias(m) # could also do: [c for c in df.columns if c[:4] == 'mod_'] 2mods = cs.expand_selector(df, cs.starts_with('mod_')) 3df.group_by('group').agg(auroc_expressions(mods)) #> group mod_bad mod_bst mod_rnd mod_mix #> str f64 f64 f64 f64 #> "b" -0.0 1.0 0.5 0.25 #> "a" -0.0 1.0 0.5 1.0- 1
- Write a wrapper function that handles the iteration and acts as a generator yielding the expression of interest (Also see Python, General >> Loops >> Generators)
- 2
-
Obtain relevant selectors to pass into the function. Here, you can use the
cs.expand_selectorhelper or any raw parsing of the column names (e.g. list comprehension) - 3
-
Pass the generator into the standard
df.group_by(...).agg(...)flow
Joins
Quick Reference
dplyr polars Join dataframes left_join(df1, df2, by=x)df1.join(df2, on="x", how="left")Example: Basic Left Join
merged_df = df_pl.join(pop_df_pl, on="country", how="left")Example: Left Join
import random df \ .join(df_dept, on="department") \ .select(['name','dept_id']) \ .with_columns( employee_id = pl.format( '{}-{}', 'dept_id', pl.Series([ random.randint(100000, 999999) for _ in range(len(df)) ]) ) )df |> left_join(df_dept, by = "department") |> select(name, dept_id) |> mutate(employee_id = map_chr(dept_id, ~ paste0(.x, "-", sample(1:9, 1))))
tidyr
Example: Unite
df \ .with_columns( pl.concat_str( pl.col(pl.String), separator=" " ).alias('combination_of_character') ) \ .select(pl.col('combination_of_character')) #> shape: (6, 1) #> combination_of_character #> str #> "Alice New York 123 Main St, On… #> "Bob San Francisco 123 Main St,… #> "Charlie Tokyo 456-7890, Tokyo,… #> "Ken Toronto 49494 Exchange St,… #> "Steven Lima 1010 Gb st, Lima, … #> "Carlos Cleveland 666 Heaven dr…df |> unite(col = "combination_of_character", where(is.character), sep = " ", ) |> select(combination_of_character) #> # A tibble: 6 × 1 #> combination_of_character #> <chr> #> 1 Alice New York 123 Main St, Ontario, OH 123-456-7890 alice@example.com Engineering 2010-01-01 Active 2022-05-01 #> 2 Bob San Francisco 123 Main St, Calgary, AB 987-654-3210 bob@example.com Marketing 2012-05-15 Inactive 2021-07-15 #> 3 Charlie Tokyo 456-7890, Tokyo, NY 098-765-4332 charlie@example.com Finance 2015-10-01 Active 2022-08-31 #> 4 Ken Toronto 49494 Exchange St, Toronto, ON 111-232-4141 ken@stats.org Marketing 2010-04-01 Inactive 2024-10-30 #> 5 Steven Lima 1010 Gb st, Lima, OH 505-402-6060 stephencurry@nba.com Marketing 2009-10-30 Active 2023-01-02 #> 6 Carlos Cleveland 666 Heaven dr, Cleveland, OH 909-435-1000 carlos@genius.edu Finance 2005-11-12 Active 2024-12-12Example: Pivot Wider
df \ .select(['name','address']) \ .with_columns( state = pl.col('address').str.extract(r'([A-Z]{2})$') ) \ .select('name','state') \ .pivot(on = 'state', index = 'name', values='state', aggregate_function='len') \ .with_columns( pl.col(pl.UInt32).fill_null(0) ) # Or since this just creates dummies df \ .select(['name','department']) \ .to_dummies(columns = 'department')df |> select(name, department) |> pivot_wider(id_cols = "name", names_from = "department", values_from = "department", values_fill = 0, values_fn = length, names_prefix = "department_") #> # A tibble: 6 × 4 #> name department_Engineering department_Marketing department_Finance #> <chr> <int> <int> <int> #> 1 Alice 1 0 0 #> 2 Bob 0 1 0 #> 3 Charlie 0 0 1 #> 4 Ken 0 1 0 #> 5 Steven 0 1 0 #> 6 Carlos 0 0 1
UDFs
Misc
- Notes from Python Rgonomics: User-defined functions in polars
pipe– Allows piping of expressions and DataFrames into UDFsmap_columns- Accepts a column selector and a single transformation to be applied to all passed columns.
Example 1: Basic
pipeat column-leveldata_dict = { 'group': ['a']*4 + ['b']*4, 'x': np.arange(1,9,1), 'y': np.arange(8,0,-1), 'p': np.arange(1,9,1)/10 } df = pl.DataFrame(data_dict) df.glimpse() #> Rows: 8 #> Columns: 4 #> $ group <str> 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b' #> $ x <i64> 1, 2, 3, 4, 5, 6, 7, 8 #> $ y <i64> 8, 7, 6, 5, 4, 3, 2, 1 #> $ p <f64> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 def cap(c:pl.Expr, ceil:int = 5) -> pl.Expr: return pl.when(c > ceil).then(ceil).otherwise(c) df.with_columns(pl.col('x').pipe(cap)) #> shape: (8, 5) #> group x y p literal #> str i64 i64 f64 i64 #> "a" 1 8 0.1 1 #> "a" 2 7 0.2 2 #> "a" 3 6 0.3 3 #> "a" 4 5 0.4 4 #> "b" 5 4 0.5 5 #> "b" 6 3 0.6 5 #> "b" 7 2 0.7 5 #> "b" 8 1 0.8 5 df.with_columns(cs.numeric().pipe(cap).name.keep()) #> shape: (8, 4) #> group x y p #> str i64 i64 f64 #> "a" 1 5 0.1 #> "a" 2 5 0.2 #> "a" 3 5 0.3 #> "a" 4 5 0.4 #> "b" 5 4 0.5 #> "b" 5 3 0.6 #> "b" 5 2 0.7 #> "b" 5 1 0.8- The udf is essentially a
case_when. It truncates any variable values greater than 5 at 5. pipetakes the unquoted name of the udf as its argument.- Use
name.keepwhen selecting more than one column (e.g.cs.numeric) when usingpipe. It prevents the output of the manipulated columns from having the same name.
- The udf is essentially a
Example 2: Basic
pipeat df-leveldef calc_diffs(df:pl.DataFrame, threshhold:int = 5) -> pl.DataFrame: df_out = ( df .with_columns( abs = (pl.col('x') - pl.col('y')).abs(), abs_gt_t = (pl.col('x') - pl.col('y')).abs() > threshhold, ) ) return df_out df.pipe(calc_diffs) #> group x y p abs abs_gt_t #> str i64 i64 f64 i64 bool #> "a" 1 8 0.1 7 true #> "a" 2 7 0.2 5 false #> "a" 3 6 0.3 3 false #> "a" 4 5 0.4 1 false #> "b" 5 4 0.5 1 false #> "b" 6 3 0.6 3 false #> "b" 7 2 0.7 5 false #> "b" 8 1 0.8 7 truecalc_diffscalculates the absolute difference between x and y and stores it in abs. Then tests whether abs is greater than some threshold (abs_gt_t).
Example 3:
pipeat df-level w/argsdef calc_diffs( df:pl.DataFrame, var1:str = 'x', var2:str = 'y', threshhold:int = 5) -> pl.DataFrame: df_out = ( df .with_columns( abs = (pl.col(var1) - pl.col(var2)).abs(), abs_gt_t = (pl.col(var1) - pl.col(var2)).abs() > threshhold, ) ) return df_out df.pipe( calc_diffs, var1 = 'y', var2 = 'x', threshhold = 3 ) #> group x y p abs abs_gt_t #> str i64 i64 f64 i64 bool #> "a" 1 8 0.1 7 true #> "a" 2 7 0.2 5 true #> "a" 3 6 0.3 3 false #> "a" 4 5 0.4 1 false #> "b" 5 4 0.5 1 false #> "b" 6 3 0.6 3 false #> "b" 7 2 0.7 5 true #> "b" 8 1 0.8 7 true- Column names are quoted in the function arguments, but not in the function code.
Example 4: Basic
map_columnsdf.map_columns(cs.numeric(), cap)- See Example 1 for the output and
capudf definition - More concise alternative to
pipeat the column-level
- See Example 1 for the output and
Snippets
Example: Lead and Forward Fill
- If the department only has 1 row of salary data, do not shift salary
- If the department has more than 1 row of salary data, shift by -1 of salary column
- Reason: there was a mistake in entering data for those with more than 1 row of data, apparently the actualy salary data is 1 row more
df \ .group_by('department') \ .agg( pl.col('name'), pl.when(pl.col('salary').len()==1).then(pl.col('salary')) .otherwise(pl.col('salary').shift(-1)) .alias('salary_shift')) \ .explode('name','salary_shift') \ .with_columns( pl.col('salary_shift').forward_fill())df |> group_by(department) |> summarize( name = name, salary_shift = case_when( n() == 1 ~ salary, TRUE ~ lead(salary) ) ) |> fill(salary_shift, .direction = "down") #> # A tibble: 6 × 3 #> # Groups: department [3] #> department name salary_shift #> <chr> <chr> <dbl> #> 1 Engineering Alice 50000 #> 2 Finance Charlie 30000 #> 3 Finance Carlos 30000 #> 4 Marketing Bob 20000 #> 5 Marketing Ken 40000 #> 6 Marketing Steven 40000Example: Map
import random df \ .select(['name','dept_id']) \ .with_columns( employee_id = pl.format( '{}-{}', 'dept_id', pl.Series([ random.randint(100000, 999999) for _ in range(len(df)) ]) ) )df |> select(name, dept_id) |> mutate(employee_id = map_chr(dept_id, ~paste0(.x, "-", sample(1000000:9999999, 1)))) #> # A tibble: 6 × 3 #> name dept_id employee_id #> <chr> <dbl> <chr> #> 1 Alice 30 30-1694470 #> 2 Bob 25 25-1696036 #> 3 Charlie 20 20-4463080 #> 4 Ken 25 25-6942432 #> 5 Steven 25 25-3012223 #> 6 Carlos 20 20-8705991Example: Split Dataframe and Loop (source)
dfs = df.partition_by('group', as_dict = True, include_key = True) for k,v in dfs.items(): print(f"{k} : {v}") #> ('a',) : shape: (4, 3) #> ┌───────┬──────┬──────┐ #> │ group ┆ x ┆ y │ #> │ --- ┆ --- ┆ --- │ #> │ str ┆ f64 ┆ f64 │ #> ╞═══════╪══════╪══════╡ #> │ a ┆ 0.99 ┆ 0.99 │ #> │ a ┆ 0.75 ┆ 0.75 │ #> │ a ┆ 0.25 ┆ 0.25 │ #> │ a ┆ 0.01 ┆ 0.01 │ #> └───────┴──────┴──────┘ #> ('b',) : shape: (4, 3) #> ┌───────┬──────┬─────┐ #> │ group ┆ x ┆ y │ #> │ --- ┆ --- ┆ --- │ #> │ str ┆ f64 ┆ f64 │ #> ╞═══════╪══════╪═════╡ #> │ b ┆ 0.99 ┆ 0.5 │ #> │ b ┆ 0.75 ┆ 0.5 │ #> │ b ┆ 0.25 ┆ 0.5 │ #> │ b ┆ 0.01 ┆ 0.6 │ #> └───────┴──────┴─────┘ grps = [k[0] for k in dfs.keys() ] # turn tuple to scalar bcs only one grouping var in key mods = [sm.OLS(d['x'].to_numpy(), sm.add_constant(d['y'].to_numpy()) ).fit() for k,d in dfs.items()] coef = [m.params[1] for m in mods] dict(zip(grps, coef)) #> {'a': np.float64(1.0000000000000004), 'b': np.float64(-6.533333333333339)}partition_bysplits the dataframe into a list of dataframes (likesplitin base R orgroup_splitin {dplyr}). Since as_dict = True, it’s actually a dictionary of dataframes- This can also be done within the unsplit dataframe using
map_batchandmap_elementsbut it’s less efficient. (See the previous section in the article for details)
Lazy Evaluation
Example: Basic
import polars as pl df_lazy = ( pl.scan_csv("sales.csv") .filter(pl.col("amount") > 100) .groupby("segment") .agg(pl.col("amount").mean()) .sort("amount") ) result = df_lazy.collect()Example: Lazily aggregate Delta lake files in AWS S3 (source)
import polars as pl from datetime import datetime import os t1 = datetime.now() table_path = "s3://confessions-of-a-data-guy/BigTest/social_media_rots_brains" os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxxx" os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxxxxxx" os.environ["AWS_DEFAULT_REGION"] = "us-east-1" lazy_df = pl.scan_delta(table_path) agg_lazy = ( lazy_df .with_columns(pl.col("created_at").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S").alias("created_at_ts")) .with_columns(pl.col("created_at_ts").dt.date().alias("post_date")) .group_by("post_date") .agg(pl.count().alias("num_posts")) .sort("post_date") ) agg_lazy.sink_csv("posts_per_day.csv") t2 = datetime.now() print(f"Elapsed time: {t2 - t1}")- Uses Polars’ lazy engine with
scan_delta()to load the Delta Lake table. - Parses timestamps using
str.strptime()into a Polars datetime. - Extracts date for grouping using
.dt.date(). - Performs lazy aggregation and sorting for efficiency.
- Writes out the result via
sink_csv().
- Uses Polars’ lazy engine with