Pandas
Misc
- Examples of using numeric indexes to subset dfs
- indexes >> Syntax for using indexes
- select
- filtering >> .loc/.iloc
df.head(8)
to see the first 8 rowsdf.info()
is likestr
in Rdf.describe()
is likesummary
in Rinclude='all'
to include string columns
Series
a vector with an index (ordered data object)
= pd.Series([20, 21, 12], s =['London', 'New York', 'Helsinki']) index>> s 20 London 21 New York 12 Helsinki dtype: int64
Misc
- Two Pandas Series with the same elements in a different order are not the same object
- List vs Series: list can only use numeric indexes, while the Pandas Series also allows textual indexes.
Args
- data: Different data structures can be used, such as a list, a dictionary, or even a single value.
- index: A labeled index for the elements in the series can be defined.
- If not set, the elements will be numbered automatically, starting at zero.
- dtype: Sets the data types of the series
- Useful if all data in the series are of the same data type
- name: Series can be named.
- Useful if the Series is to be part of a DataFrame. Then the name is the corresponding column name in the DataFrame.
- copy: True or False; specifies whether the passed data should be saved as a copy or not
Subset:
series_1["A"]
orseries_1[0]
Find the index of a value:
list(series_1).index(<value>)
Change value of an index:
series_1["A"] = 1
(1 is now the value for index, A)Add a value to a series:
series_1["D"] = "fourth_element"
(D is the next index in the sequence)Filter by condition(s):
series_1[series_1 > 4]
orseries_1[series_1 > 4][series_1 != 8]
dict to Series:
pd.Series(dict_1)
Series to df:
pd.DataFrame([series_1, series_2, series_3])
- Series objects should either all have the same index or no index. Otherwise, a separate column will be created for each different index, for which the other rows have no value
Mutability
= pd.DataFrame({"name": ["bert", "albert"]}) df = df["name"] # shallow copy series 0] = "roberta" # <-- this changes the original DataFrame series[ = df["name"].copy(deep=True) series 0] = "roberta" # <-- this does not change the original DataFrame series[ = df["name"].str.title() # not a copy whatsoever series 0] = "roberta" # <-- this does not change the original DataFrame series[
- Creating a deep copy will allocate new memory, so it’s good to reflect whether your script needs to be memory-efficient.
Categoricals
Misc - Also see - Conversions for converting between types - Optimizations >> Memory Optimizations >> Variable Type
Categorical (docs)
python version of factors in R
- R’s levels are always of type string, while categories in pandas can be of any dtype.
- R allows for missing values to be included in its levels (pandas’ categories). pandas does not allow NaN categories, but missing values can still be in the values.
- See
cat.codes
below
- See
Create a categorical series:
s = pd.Series(["a", "b", "c", "a"], dtype="category")
Create df of categoricals
= pd.DataFrame({"A": list("abca"), "B": list("bccd"){style='color: #990000'}[}]{style='color: #990000'}, dtype="category") df "A"] df[0 a 1 b 2 c 3 a
Specify categories and add to dataframe (also see Set Categories below)
= pd.Categorical( raw_cat "a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False [ )"B"] = raw_cat df[ df A B0 a NaN 1 b b 2 c c 3 a NaN
- Note that categories not in the specification get NaNs
See categories, check if ordered:
cat.categories
,cat.ordered
s.cat.categories61]: Index(['c', 'b', 'a'], dtype='object') Out[ s.cat.ordered62]: False Out[
Set categories for a categorical:
s = s.cat.set_categories(["one", "two", "three", "four"])
Rename categories w/
cat.rename_categories
# with a list of new categories = ["Group %s" % g for g in s.cat.categories] new_categories = s.cat.rename_categories(new_categories) s # with a dict = s.cat.rename_categories({1: "x", 2: "y", 3: "z"}) s s0 Group a 1 Group b 2 Group c 3 Group a
Add a category (doesn’t have to be a string, e.g. 4):
s = s.cat.add_categories([4])
Remove categories
= s.cat.remove_categories([4]) s s.cat.remove_unused_categories()
Ordered
Create ordered categoricals
from pandas.api.types import CategoricalDtype = pd.Series(["a", "b", "c", "a"]) s = CategoricalDtype(categories=["b", "c", "d"], ordered=True) cat_type = s.astype(cat_type) s_cat s_cat 0 NaN 1 b 2 c 3 NaN dtype: category3, object): ['b' < 'c' < 'd'] Categories ( # alt = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True)) s
Reorder:
cat.reorder_categories
= pd.Series([1, 2, 3, 1], dtype="category") s = s.cat.reorder_categories([2, 3, 1], ordered=True) s s0 1 1 2 2 3 3 1 dtype: category3, int64): [2 < 3 < 1] Categories (
Category codes:
.cat.codes
= pd.Series(["a", "b", np.nan, "a"], dtype="category") s s0 a 1 b 2 NaN 3 a dtype: category2, object): ['a', 'b'] Categories ( s.cat.codes0 0 1 1 2 -1 3 0 dtype: int8
Operations
Read
Misc
- For large datasets, better to use data.table::fread (see Python, Misc >> Misc)
CSV
df = pd.read_csv('wb_data.csv', header=0)
- “usecols=[‘col1’, ‘col8’]” for only reading certain columns
Read and process data in chunks
for chunk in pd.read_csv("dummy_dataset.csv", chunksize=50000): print(type(chunk)) # process data <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
- Issue: Cannot perform operations that need the entire DataFrame. For instance, say you want to perform a
groupby()
operation on a column. Here, it is possible that rows corresponding to a group may lie in different chunks.
- Issue: Cannot perform operations that need the entire DataFrame. For instance, say you want to perform a
Write
- CSV:
df.to_csv("file.csv", sep = "|", index = False)
- “sep” - column delimiter (assume comma is default)
- “index=False” - instructs Pandas to NOT write the index of the DataFrame in the CSV file
Create/Copy
- Dataframes
Syntaxes
= pd.DataFrame({ df "first_name": ["John","jane","emily","Matt","Alex"], "last_name": ["Doe","doe","uth","Dan","mir"], "group": ["A-12","B-15","A-18","A-12","C-15"], "salary": ["$75000","$72000","£45000","$77000","£58,000"] }) = pd.DataFrame( df [1, 'A', 10.5, True), (2, 'B', 10.0, False), (3, 'A', 19.2, False) ( ],=['colA', 'colB', 'colC', 'colD'] columns ) = pd.DataFrame([["A", 1], ["A", 2], ["B", 1], data "C", 4], ["A", 10], ["B", 7]], [= ["col1", "col2"]) columns = pd.DataFrame([('bird', 389.0), df 'bird', 24.0), ('mammal', 80.5), ('mammal', np.nan)], (=['falcon', 'parrot', 'lion', 'monkey'], index=('class', 'max_speed')) columns df class max_speed389.0 falcon bird 24.0 parrot bird 80.5 lion mammal monkey mammal NaN
Create a copy of a df:
df2 = df1.copy()
- deep=True (default) means it’s a deep rather than shallow copy. A deep copy is its own object and manipulations to it don’t affect the original.
Indexes
Misc
- Row labels
- DataFrame indexes do NOT have to be unique
- The time it takes to search a dataframe is shorter with unique index
- By default, the index is a numeric row index
Set
= df.set_index('col1') df =True) df.set_index(column, inplace
- previous index is discarded
- inplace=True says modify orginal df
Change indexes but keep previous index
# converts index to a column, index now the default numeric df.reset_index() = df.set_index('col2') df
reset_index
- inplace=True says modify orginal df (default = False)
- drop=True discards old index
Change to new indices or expand indices:
df.reindex
See indexes:
df.index.names
Sort:
df.sort_index(ascending=False)
Syntax for using indexes
- Misc
- “:” is not mandatory as a placeholder for the column position
.loc
df.loc[one_row_label] # Series
df.loc[list_of_row_labels] # DataFrame
df.loc[:, one_column_label] # Series
df.loc[:, list_of_column_labels] # DataFrame
df.loc[first_row:last_row]
.iloc
df.iloc[one_row_position, :] # Series
df.iloc[list_of_row_positions, :] # DataFrame
- Misc
Example:
df.iloc[0:3]
ordf.iloc[:3]
- outputs rows 0, 1, 2 (end point NOT included)
Example:
df.iloc[75:]
- (e.g. 78 rows) outputs rows 75, 76, 77 (end point, i.e., last row, NOT included)
Example:
df.loc['The Fool' : 'The High Priestess']
- outputs all rows from “The Fool” to “The High Priestess” (end point included)
Example: Filtering cells
df.loc['King of Wands', 'The Fool', 'The Devil'], ['image', 'upright keywords'] [ ]
- index values: ‘King of Wands’, etc.
- column names: ‘image’, etc.
Multi-indexes
Use list of variables to create multi-index:
df.set_index(['family', 'vowel_inventories'], inplace=True)
- “family” is the 0th index; “vowel_inventories” is the 1st index
See individual index values:
df.index.get_level_values(0)
- Gets the 0th index values (e.g. “family”)
Sort by specific level:
df.sort_index(level=1)
- Not specifying a level means df will be sorted first by level 0, then level 1, etc.
Replace index values:
df.rename(index={"Arawakan" : "Maipurean", "old_value" : "new_value"})
- Replaces instances in all levels of the multi-index
- For only a specific level, use “levels=‘level_name’”
Subsetting via multi-index
- Basic
= ('Indo-European', '2 Average (5-6)') row_label 3) df.loc[row_label, :].tail(
- Basic
Note: Using “:” or specifying columns may be necessary using a multi-index
With multiple multi-index values
= [ row_labels 'Arawakan', '2 Average (5-6)'), ('Uralic', '2 Average (5-6)'), ( ]= (['Arawakan', 'Uralic'], '2 Average (5-6)') # alternate method row_labels df.loc[row_labels, :]
With only 1 level of the multi-index
= pd.IndexSlice[:, '1 Small (2-4)'] row_labels df.loc[row_labels, :] # drops other level of the multi-index '1 Small (2-4)'] df.loc[ # more verbose '1 Small (2-4)', level = 'vowel_inventories', drop_level=False) df.xs(
- Ignores level 0 of the multi-index and filters only by level 1
- Could also use an alias, e.g.
idx = pd.IndexSlice
if writing the whole thing gets annoying
Select
Also see Indexes
Select by name:
"name","note"]] df[[ filter(["Type", "Price"]) df. "f1", "f2", "f3"]] dat.loc[:, [ = df.columns.str.contains('\+') # boolean array: Trues, Falses target_columns = df.iloc[:, ~target_columns] X = df.iloc[:, target_columns] Y
- “:” is a place holder in this case
Subset
ts_df-3 t-2 t-1 t-0 t10 10.0 11.0 12.0 13.0 11 11.0 12.0 13.0 14.0 12 12.0 13.0 14.0 15.0
All but last column
-1] ts_df.iloc[:, :-3 t-2 t-1 t10 10.0 11.0 12.0 11 11.0 12.0 13.0 12 12.0 13.0 14.0
- For
[:, :-2]
, the last 2 columns would not be included
- For
All but first column
1:] ts_df.iloc[:, -2 t-1 t-0 t10 11.0 12.0 13.0 11 12.0 13.0 14.0 12 13.0 14.0 15.0
- i.e. the 0th indexed column is not included
- For
[:, 2:]
, the 0th and 1st indexed column would not be included
All columns between the 2nd col and 4th col
1:4] ts_df.iloc[:, -2 t-1 t-0 t10 11.0 12.0 13.0 11 12.0 13.0 14.0 12 13.0 14.0 15.0
- left endpoint (1st index) is included but the right endpoint (4th index) is not
- i.e. the 1st, 2nd, and 3rd indexed columns are included
All columns except the first 2 cols and the last col
2:-1] ts_df.iloc[:, -1 t10 12.0 11 13.0 12 14.0
- i.e. 0th and 1st indexed columns not included and “-1” says don’t include the last column
Rename columns
'variable': 'Year', 'value': 'GDP'}, axis=1, inplace=True) df.rename({
- "variable" is renamed to "Year" and "value" is renamed to "GDP"
Delete columns
df.drop(columns = ["col1"])
Filtering
Misc
Also see Indexes
Grouped and regular pandas data frames have different APIs, so it’s not a guarantee that a method with the same name will do the same thing. 🙃
On value:
df_filtered = data[data.col1 == "A"]
Via query:
df_filtered = data.query("col1 == 'A'")
More complicated
purchases"amount <= amount.median() * 10") .query(# or lambda df: df["amount"] <= df["amount"].median() * 10] .loc[
By group
purchases"country") .groupby(apply(lambda df: df[df["amount"] <= df["amount"].median() * 10]) .=True) .reset_index(drop
- FIlters each country by the
median
of its amount times 10. - The result of
apply
is a dataframe, but that dataframe has the index, country (i.e. rownames). The problem is that the result also has a country column. Therefore usingreset_index
would move the index to a column, but since there’s already a country column, it will give you a ValueError. Including drop=True fixes this by dropping the index entirely.
- FIlters each country by the
On multiple values:
df_filtered = data[(data.col1 == "A") | (data.col1 == "B")]
By pattern:
df[df.profession.str.contains("engineer")]
- Only rows of profession variable with values containing “engineer”
- Also available
str.startswith
:df_filtered = data[data.col1.str.startswith("Jo")]
str.endswith
:df_filtered = data[data.col1.str.endswith("n")]
- If column has NAs or NaNs, specify arg, “nan=False” to ignore them, else you’ll get a valueError
- Methods are case-sensitive unless you specify, “case=False:”
By negated pattern:
df[~df.profession.str.contains("engineer")]
By character type:
df_filtered = data[data.col1.str.isnumeric()]
- Also available
- upper-case: isupper()
- lower-case: islower()
- alphabetic: isalpha()
- digits: isdigit()
- decimal: isdecimal()
- whitespace: isspace()
- titlecase: istitle()
- alphanumeric: isalnum()
- Also available
By month of a datetime variable:
df[df.date_of_birth.dt.month==11]
- datetime variable has yy-mm-dd format;
dt.month
accessor used to filter rows of “date_of_birth” variable wit
- datetime variable has yy-mm-dd format;
Conditional:
df[df.note > 90]
- By string length:
df_filtered = data[data.col1.str.len() > 4]
- By string length:
Multi-conditional
> 2000) & df[(df.date_of_birth.dt.year str.contains("engineer"))] (df.profession. "Distance < 2 & Rooms > 2") df.query(
%in%:
df[df.group.isin(["A","C"])]
Smallest 2 values of note column:
df.nsmallest(2, "note")
nlargest
also available
Only rows in a column with NA values:
df[df.profession.isna()]
Only rows in a column that aren’t NA:
df[df.profession.notna()]
Find rows by index value:
df.loc['Tony']
loc
is for the value,iloc
is for position (numerical or logical)- See Select columns Example for iloc logical case
With groupby
= data.groupby("Company Name") data_grp "Amazon") data_grp.get_group(
Return 1st 3 rows (and 2 columns):
df.loc[:3, ["name","note"]]
Return 1st 3 rows and 3rd column:
df.iloc[:3, 2]
- h 11th month
Using index to filter single cell or groups of cells
df.loc['King of Wands', 'The Fool', 'The Devil'], ['image', 'upright keywords'] [ ]
- index values: ‘King of Wands’, etc. (aka row names)
- column names: ‘image’, etc.
Mutate
Looks like all these functions do the same damn thing
assign
e.g. Add columns"col3"] = df["col1"] + df["col2"] df[= df.assign(col3 = df["col1"] + df["col2"]) df
apply
an operation to column(s)Function
'colA', 'colB']] = df[['colA', 'colB']].apply(np.sqrt) df[["col3"] = df.apply(my_func, axis=1) df[
- docs: Series, DataFrame
- axis - arg for applying a function across rowwise or columnwise (default is 0, which is for columnwise)
- The fact that mutated columns are assigned to the columns of the df are what keeps it from being like summarize
- If np.sum were the function, then the output would be two numbers and this probably wouldn’t work.
Within a chain
purchases"country") .groupby(apply(lambda df: (df["amount"] - df["discount"]).sum()) . .reset_index()={0: "total"}) .rename(columns
- When the result of the
apply
function is coerced to a dataframe byreset_index
, the column will be named “0”.rename
is used to rename it total.
- When the result of the
map
'colE'] = df['colE'].map({'Hello': 'Good Bye', 'Hey': 'Bye'}) df[
- Each case of “Hello” is changed to “Good By”, etc.
- If you give it a dict, it acts like a case_when or plyr::mapvalues or maybe recode
- Values in the column but aren’t included in the dict get NaNs
- This also can take a lambda function
- Docs (only applies to Series, so I guess that means only 1 column at a time(?))
applymap
'colA', 'colD']] = df[['colA', 'colD']].applymap(lambda x: x**2) df[[
- Docs
- Says it applies a function elementwise, so its probably performing a loop
- Therefore better to avoid if a vectorized version of the function is available
Pivot
pivot_longer
Example
=list(df.iloc[:, 4:].columns) year_list= pd.melt(df, id_vars=['Country Name','Series Name','Series Code','Country Code'], value_vars=year_list) df
- year_list has the variable names of the columns you want to merge into 1 column
pivot_wider
Example
# Step 1: add a count column to able to summarize when grouping 'count'] = 1 long_df[ # Step 2: group by date and type and sum = long_df.groupby(['date', 'type']).sum().reset_index() grouped_long_df # Step 3: build wide format from long format = grouped_long_df.pivot(index='date', columns='type', values='count').reset_index() wide_df
- “long_df” has two columns: type and date
Bind_Rows
= pd.DataFrame({'strata': 1, 'y': np.random.normal(loc=10, scale=1, size=size){style='color: #990000'}[}]{style='color: #990000'})
df1 = pd.DataFrame({'strata': 2, 'y': np.random.normal(loc=15, scale=2, size=size){style='color: #990000'}[}]{style='color: #990000'})
df2 = pd.DataFrame({'strata': 3, 'y': np.random.normal(loc=20, scale=3, size=size){style='color: #990000'}[}]{style='color: #990000'})
df3 = pd.DataFrame({'strata': 4, 'y': np.random.normal(loc=25, scale=4, size=size){style='color: #990000'}[}]{style='color: #990000'})
df4 = pd.concat([df1, df2, df3, df4])
df
= [df1, df2, df3, df4]
df_ls = pd.concat([df_ls[i] for i in range(8)], axis=0) df_all
- 2 variables are created, “strata” and “y”, then merged into a df
- Make sure no rows are duplicates
= pd.concat([df, df_pdf], verify_integrity=True) df_loans
Bind_Cols
=1) pd.concat([top_df, more_df], axis
Count
value.counts:
df["col_name"].value_counts()
- output arranged in descending order of frequencies
- faster than groupby + size
- args
- normalize=False
- dropna=True
groupby + count
"Product_Category").size() df.groupby("Product_Category").count() df.groupby(
- “size” includes null values
- “count” doesn’t include null values
- arranged by the index column
Arrange
df.sort_values(by = "col_name")
- args
- ascending=True
- ignore_index=False (True will reset the index)
- args
Group_By
To get a dataframe, instead of a series, after performing a grouped calculation, apply
reset_index()
to the result.Find number of groups
= df.groupby("Product_Category") df_group df_group.ngroups
groupby + count
"Product_Category").size() df.groupby("Product_Category").count() df.groupby(
- “size” includes null values
- “count” doesn’t include null values
Only groupby observed categories
"col1", observed=True)["col2"].sum() df.groupby( col149 A 43 B Name: col2, dtype: int64
- col1 is a category type and has 3 levels specified (A, B, C) but the column only has As and Bs
- observed=False (default) would include C and a count of 0.
Count NAs
"col1"] = df["col1"].astype("string") df["col1", dropna=False)["col2"].sum() df.groupby(# output col149.0 A 43.0 B <NA> 30.0 Name: col2, dtype: float64
- ** Will not work with categorical types **
- So categorical variables, you must convert to strings to be able to group and count NAs
- ** Will not work with categorical types **
Combo
"col3").agg({"col1":sum, "col2":max}) df.groupby( col1 col2 col3 1 2 A 8 10 B
Aggregate with only 1 function
"Product_Category")[["UnitPrice(USD)","Quantity"]].mean() df.groupby(
- See summarize for aggregate by more than 1 function
Extract a group category
= df.groupby("Product_Category") df_group 'Healthcare') df_group.get_group(# or "Product_Category"]=='Home'] df[df[
Group objects are iterable
= df.groupby("Product_Category") df_group for name_of_group, contents_of_group in df_group: print(name_of_group) print(contents_of_group)
Get summary stats on each category conditional on a column
"Product_Category")[["Quantity"]].describe() df.groupby(
Summarize
agg
can only operate on one column at time, so for transformations involving multiple columns (e.g. col3 = col1 - col2), see Mutate section.With groupby and agg
df.groupby("cat_col_name").agg(new_col_name = ("num_col_name", "func_name"))
- To reset the index (i.e. ungroup)
- use
.groupby
arg, as_index=False - use
.reset_index()
at the end of the chain (outputs a dataframe instead of a series)- arg: drop=TRUE …does something (maybe drops grouping columns)
- use
- To reset the index (i.e. ungroup)
Group aggregate with more than 1 function
"Product_Category")[["Quantity"]].agg([min,max,sum,'mean']) df.groupby(
- When you mention ‘mean’ (with quotes), .aggregate() searches for a function mean belonging to pd.Series i.e. pd.Series.mean().
- Whereas, if you mention mean (without quotes), .aggregate() will search for function named mean in default Python, which is unavailable and will throw an NameError exception.
Use a different aggregate function on specific columns
= {'OrderID':'count','Quantity':'mean'} function_dictionary "Product_Category").aggregate(function_dictionary) df.groupby(
Filter by Group, then Summarize by Group (link)
Option 1
purchases"country") .groupby(apply(lambda df: df[df["amount"] <= df["amount"].median() * 10]) .=True) .reset_index(drop"country") .groupby(apply(lambda df: (df["amount"] - df["discount"]).sum()) . .reset_index()={0: "total"}) .rename(columns
- FIlters each country by the
median
of its amount times 10. - For each country, subtracts a discount from amount, then
sums
andrenames
the column from “0” to total.
- FIlters each country by the
Option 2
purchases=lambda df: .assign(country_median"country")["amount"].transform("median") df.groupby( )"amount <= country_median * 10") .query("country") .groupby(apply(lambda df: (df["amount"] - df["discount"]).sum()) . .reset_index()={0: "total"}) .rename(columns
- Here the
median
amount per country is calculated first and assigned to each row in purchases. query
is used to filter by each country’s median.- Then,
apply
andgroupby
are used in the same manner as option 1.
- Here the
Join
Using merge:
pd.merge(df1,df2)
- More versatile than
join
- Automatically detects a common column
- Method: “how = ‘inner’” (i.e. default is inner join)
- ‘outer’ , ‘left’ , ‘right’ are available
- On columns with different names
- on = “col_a”
- left_on = ‘left df col name’
- right_on = ‘right df col name’
- copy = True (default)
- More versatile than
Using join
'Course', inplace=True) df1.set_index('Course', inplace=True) df2.set_index(= df1.join(df2, on = 'Course', how = 'left') df3
- instance method that joins on the indexes of the dataframes
- The column that we match on for the left dataframe doesn’t have to be its index. But for the right dataframe, the join key must be its index
- Can use multiple columns as the index by passing a list, e.g. [“Course”, “Student_ID”]
- * indexes do NOT have to be unique *
- Faster than merge
Distinct
- Find number of unique values:
data.Country.nunique()
- Display unique values:
df["col3"].unique()
- Create a boolean column to indicated duplicated rows:
df.duplicated(keep=False)
- Check for duplicate ids:
df_loans[df_loans.duplicated(keep=False)].sort_index()
- Count duplicated ids:
df_check.index.duplicated().sum()
- Drop duplicated rows:
df.drop_duplicates()
Replace values
In the whole df
1 value
= '?', value = np.nan, inplace=True) df.replace(to_replace
- replaces all values == ? with NaN
- faster than loc method
Multiple values
"Male", "Female"], ["M", "F"], inplace=True) # list df.replace(["United States": "USA", "US": "USA"}, inplace=True) # dict df.replace({
Only in specific columns
df.replace( {"education": {"HS-grad": "High school", "Some-college": "College"}, "income": {"<=50K": 0, ">50K": 1}, },=True, inplace )
- replacement only occurs in “education” and “income” columns
Using Indexes
Example (2 rows, 1 col):
df.loc[['Four of Pentacles', 'Five of Pentacles'], 'suit'] = 'Pentacles'
- index values: “Four of Pentacles”, “Five of Pentacles”
- column name: “suit”
- Replaces the values in those cells with the value “Pentacles”
Example (1 row, 2 cols):
'King of Wands', ['suit', 'reversed_keywords']] = [ df.loc['Wands', 'impulsiveness, haste, ruthlessness' ]
- index value: “King of Wands”
- column names: “suit”, :reversed_keywords”
- In “suit,” replaces value with “Wands”
- In “reversed words,” replaces value with “impulsiveness, haste, ruthlessness”
Replace Na/NaNs in a column with a constant value
'col_name'].fillna(value = 0.85, inplace = True) df[
Replaces Na/NaNs in a column with the value of a function/method
'price'].fillna(value = df.price.median(), inplace = True) df[
Replaces Na/NaNs in a column with a group value (e.g. group by fruit, then use price median)
# median 'price'].fillna(df.groupby('fruit')['price'].transform('median'), inplace = True) df[
Forward-Fill
'price'].fillna(method = 'ffill', inplace = True) df['price'] = df.groupby('fruit')['price'].ffill(limit = 1) df['price'] = df.groupby('fruit')['price'].ffill() df[
- forward-fill: fills Na/NaN with previous non-Na/non-NaN value
- forward-fill, limited to 1: only fills with the previous value if there’s a non-Na/non-NaN 1 spot behind it.
- May leave NAs/NaNs
- forward-fill by group
Backward-Fill
'price'].fillna(method = 'bfill', inplace = True) df[
- backward-fill: fills Na/NaN with next non-Na/non-NaN value
Alternating between backward-fill and forward-fill
'price'] = df.groupby('fruit')['price'].ffill().bfill() df[
- alternating methods: for the 1st group a forward fill is performed; for the next group, a backward fill is performed; etc.
Interpolation
'price'].interpolate(method = 'linear', inplace = True) df['price'] = df.groupby('fruit')['price'].apply(lambda x: x.interpolate(method='linear')) # by group df['price'] = df.groupby('fruit')['price'].apply(lambda x: x.interpolate(method='linear')).bfill() # by group with backwards-fill df[
- Interpolation (e.g. linear)
- May leave NAs/NaNs
- Interpolation by group
- Interpolation + backwards-fill
- Interpolation (e.g. linear)
Apply a conditional: says fill na with mean_price where “weekday” column is TRUE; if FALSE, fill with mean_price*1.25
= df.groupby('fruit')['price'].transform('mean') mean_price 'price'].fillna((mean_price).where(cond = df.weekday, other = mean_price*1.25), inplace = True) df[
Strings
Misc
- regex is slow
- Stored as “object” type but “StringDtype” is available. This new Dtype is optional for now but it may be required to do so in the future
- Regex with Examples in python and pandas
Filter (see Filter section)
Replace pattern using regex
'colB'] = df['colB'].str.replace(r'\D', '') df['colB'] = df['colB'].replace(r'\D', r'', regex=True) df[
- “r” indicates you’re using regex
- replaces all non-numeric patterns (e.g. letters, symbols) with an empty string
- Replace pattern with list comprehension (more efficient than loops)
With {{re}}
import re 'colB'] = [re.sub('[^0-9]', '', x) for x in df['colB']] df[
- re is the regular expressions library
- Replaces everything not a number with empty string (carrot inside is a negation)
Split w/list output
>> df["group"].str.split("-") 0 [A, 1B] 1 [B, 1B] 2 [A, 1C] 3 [A, 1B] 4 [C, 1C]
- “-” is the delimiter. A-1B –> [A, 1B]
Split into separate columns
"group1"] = df["group"].str.split("-", expand=True)[0] df["group2"] = df["group"].str.split("-", expand=True)[1] df[
- expand = True splits into separate columns
- BUT you have to manually create the new columns in the old df by assigning each column to a new column name in the old df.
Concatenate strings from a list
= ['word'] * 100000 # ['word', 'word', ...] words = "".join(words) sentence
- More efficient than “+” operator
Concatenate string columns
List comprehension is fastest
'all'] = [p1 + ' ' + p2 + ' ' + p3 + ' ' + p4 for p1, p2, p3, p4 in zip(df['a'], df['b'], df['c'], df['d'])] df[
“+” operator with a space, ” “, as the delimiter
'all'] = df['a'] + ' ' + df['b'] + ' ' + df['c'] + ' ' + df['d'] df[
- Also fast and have read that this is most efficient for larger datasets
df['colE'] = df.colB.str.cat(df.colD)
can be used for relatively small datasets (up to 100–150 rows)= df['owner'].array arr1 = df['gender'].array arr2 = [] arr3 for i in range(len(arr1)): if arr2[i] == 'M': 'Mr. ' + arr1[i]) arr3.append(else: 'Ms. ' + arr1[i]) arr3.append('name5'] = arr3 df[
Vectorizes columns then concantenates with a
+
For-loop w/vectorization was faster than
apply
+ list comprehension or for-loop +itertuples
or for-loop +iterrows
Concatenate string and non-string columns
'colE'] = df['colB'].astype(str) + '-' + df['colD'] df[
Extract pattern with list comprehension
import re 'colB'] = [re.search('[0-9]', x)[0] for x in df['colB']] df[
- re is the regular expressions library
- extracts all numeric characters
Extract all instances of pattern into a list
= re.findall(r'\d+', s) results_ls
- Finds each number in string, “s,” and outputs into a list
- Can also use
re.finditer
- fyi
re.search
only returns the first match
Extract pattern using regex
'colB'] = df['colB'].str.extract(r'(\d+)', expand=False).astype("int") df[
- “r” indicates you’re using regex
- extracts all numeric patterns and changes type to integer
Remove pattern by
map
loopfrom string import ascii_letters 'colB'] = df['colB'].map(lambda x: x.lstrip('+-').rstrip(ascii_letters)) df[
- pluses or minuses are removed from if they are the leading character
lstrip
removes leading characters that match characters providedrstrip
removes trailing characters that match characters provided
Does a string contain one or more digits
any(c.isdigit() for c in s)
Conversions
Convert dict to pandas df (**slow for large lists**)
= pd.DataFrame.from_dict(acct_dict, orient="index", columns=["metrics"]) df = DataFrame.from_dict(search.cv_results_, orient='columns') result_df print(result_df.columns)
- key of the dict is used as the index and value is column named “metrics”
Convert pandas df to ndarray
= df.to_numpy() ndarray # using numpy = np.asarray(df) ndarray
Convert df to list or dict
df ColA ColB ColC0 1 A 4 1 2 B 5 2 3 C 6 = df.values.tolist() result 1, 'A', 4], [2, 'B', 5], [3, 'C', 6]] [[ df.to_dict()'ColA': {0: 1, 1: 2, 2: 3}, {'ColB': {0: 'A', 1: 'B', 2: 'C'}, 'ColC': {0: 4, 1: 5, 2: 6}} "list") df.to_dict('ColA': [1, 2, 3], 'ColB': ['A', 'B', 'C'], 'ColC': [4, 5, 6]} { "split") df.to_dict('index': [0, 1, 2], {'columns': ['ColA', 'ColB', 'ColC'], 'data': [[1, 'A', 4], [2, 'B', 5], [3, 'C', 6]]}
Convert string variable to datetime
= df.date_of_birth.astype("datetime64[ns]") df.date_of_birth 'Year'] = pd.to_datetime(df['Year']) df[
Convert all “object” type columns to “category”
for col in X.select_dtypes(include=['object']): = X[col].astype('category') X[col]
Convert column to numeric
'GDP'] = df['GDP'].astype(float) df[
Sample and Simulate
Simulate
Random normal variable (and group variable named strata)
= pd.DataFrame({'strata': 1, 'y': np.random.normal(loc=10, scale=1, size=size)) df1
- 2 columns “strata” (all 1s) and “y”
- loc = mean, scale = sd, size = number of observations to create
Sample
Rows
= 15, replace = True, random_state=2) # with replacement df.sample(n = df.sample(int(len(tps_df) * 0.2)) # sample 20% of the data sample_df
- This method is faster than sampling using random indices with NumPy
Rows and columns
5, axis=1).sample(7, axis=0) tps.sample(
- 5 columns and 7 rows
Chaining
Need to encapsulate code in parentheses
# to assign to an object = ( new_df melb"Distance < 2 & Rooms > 2") # query equals filter in Pandas .query(filter(["Type", "Price"]) # filter equals select in Pandas ."Type") .groupby("mean", "count"]) # calcs average price and row count for each Type; creates subcolumns mean and count under Price .agg([# converts matrix to df .reset_index() "Type", "averagePrice", "numberOfHouses"], # renames Price to averagePrice and count to numberOfHouses .set_axis([= 1, axis = False) inplace = lambda x: x["averagePrice"] # assign equals mutate in Pandas (?) .assign(averagePriceRounded round(1)) .= ["numberOfHouses"], .sort_values(by = False) ascending )
- if
agg["mean"]
then there wouldn’t be a subcolumn mean, just the values of Price would be the mean
- if
Using
pipe
args
- func: Function to apply to the Series/DataFrame
- args: Positional arguments passed to func
- kwargs: Keyword arguments passed to func
Returns: object, the return type of func
Syntax
def f1(df, arg1): # do something return # a dataframe def f2(df, arg2): # do something return # a dataframe def f3(df, arg3): # do something return # a dataframe = pd.DataFrame(..) # some dataframe df = arg3).pipe(f2, arg2 = arg2).pipe(f1, arg1 = arg1) df.pipe(f3, arg3
- function 3 (f3) is executed then function 2 then function 1
Crosstab
Example
= pd.DataFrame([["A", "X"], df "B", "Y"], ["C", "X"], ["A", "X"]], [= ["col1", "col2"]) columns print(pd.crosstab(df.col1, df.col2)) col2 X Y col1 2 0 A 0 1 B 1 0 C
Pivot Table
Example
print(df) Name Subject Marks0 John Maths 6 1 Mark Maths 5 2 Peter Maths 3 3 John Science 5 4 Mark Science 8 5 Peter Science 10 6 John English 10 7 Mark English 6 8 Peter English 4 pd.pivot_table(df, = ["Name"], index =["Subject"], columns='Marks', values=0) fill_value Subject English Maths Science Name 10 6 5 John 6 5 8 Mark 4 3 10 Peter
Example: drop lowest score for each letter grade, then calculate the average score for each letter grade
='name', grades_df.pivot_table(index='letter grade', columns='score', values= lambda series : (sorted(list(series))[-1] + sorted(list(series))[-2]) / 2) aggfunc letter grade A B name96.5 87.0 Arif 95.5 84.0 Kayla
- grades_df
- 2 names (“name”)
- 6 scores (“score”)
- Only 2 letter grades associated with these scores (“letter grade”)
- index: each row will be a “name”
- columns: each column will be a “letter grade”
- values: value in the cells will be from the “score” column according to each combination columns in the index and columns args
- aggfunc: uses a lambda to compute the aggregated values
- “series” is used a the variable in the lambda function
- sorts series (ascending), takes the top two values (using negative list indexing), and averages them
- Iterate over a df
- Better to use a vectorized solution if possible
- grades_df
Iteration
iterrows
def salary_iterrows(df): = 0 salary_sum for index, row in df.iterrows(): += row['Employee Salary'] salary_sum return salary_sum/df.shape[0] salary_iterrows(data)
iteruples
def salary_itertuples(df): = 0 salary_sum for row in df.itertuples(): += row._4 salary_sum return salary_sum/df.shape[0] salary_itertuples(data)
- Faster than iterrows
Time Series
Misc
- Also see
- Feature Engineering, Time Series >> Misc has a list of python libraries for preprocessing
- A Collection of Must-Know Techniques for Working with Time Series Data in Python
- Collection of preprocessing recipes
- {{datetime}} in bkmks
Operations
Load and set index frequency
Example
#Load the PCE and UMCSENT datasets = pd.read_csv( df ='UMCSENT_PCE.csv', filepath_or_buffer=0, header=0, index_col=True, infer_datetime_format=['DATE'] parse_dates )#Set the index frequency to 'Month-Start' = df.asfreq('MS') df
- “header=0” is default, says 1st row of file is the column names
- “index_col=0” says use the first column as the df index
- “infer_datetime_format=True” says infer the format of the datetime strings in the columns
- “parse_dates=[‘DATE’]” says convert “DATE” to datetime and format
Create date variable
Example: w/
date_range
# DataFrame = pd.date_range('1/2/2022', periods=24, freq='H') date_range = np.random.randint(100, 400, size=24) sales = pd.DataFrame( sales_data sales,= date_range, index = ['Sales'] columns )# Series = pd.date_range("1/1/2012", periods=100, freq="S") rng = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) ts
Methods (article)
- pandas.date_range — Return a fixed frequency DatetimeIndex.
- start: the start date of the date range generated
- end: the end date of the date range generated
- periods: the number of dates generated
- freq: default to “D” (daily), the interval between dates generated, it could be hourly, monthly or yearly
- pandas.bdate_range — Return a fixed frequency DatetimeIndex, with the business day as the default frequency.
- pandas.period_range — Return a fixed frequency PeriodIndex. The day (calendar) is the default frequency.
- pandas.timedelta_range — Return a fixed frequency TimedeltaIndex, with the day as the default frequency.
- pandas.date_range — Return a fixed frequency DatetimeIndex.
Coerce to datetime
Source has month first or day first
# month first # e.g 9/16/2015 --> 2015-09-16 'joining_date'] = pd.to_datetime(df['joining_date']) df[ # day first # e.g 16/9/2015 --> 2015-09-16 'joining_date'] = pd.to_datetime(df['joining_date'], dayfirst=True) df[
- default is month first
- * If the first digit is a number that can NOT be a month (e.g. 25), then it will parse it as a day instead. *
Format conditional on source’s delimiter
'joining_date_clean'] = np.where(df['joining_date'].str.contains('/'), df['joining_date']), pd.to_datetime(df['joining_date'], dayfirst=True) pd.to_datetime(df[ )
- Like an ifelse. Source dates that have “/” separating values are parsed as month-first and everything else as day-first
Transform partial date columns
Example: String
09/2007
will be transformed to date2007-09-01
import datetime def parse_first_brewed(text: str) -> datetime.date: = text.split('/') parts if len(parts) == 2: return datetime.date(int(parts[1]), int(parts[0]), 1) elif len(parts) == 1: return datetime.date(int(parts[0]), 1, 1) else: assert False, 'Unknown date format' >>> parse_first_brewed('09/2007') 2007, 9, 1) datetime.date( >>> parse_first_brewed('2006') 2006, 1, 1) datetime.date(
Extract time components
- Create “year-month” column:
df["year_month"] = df["created_at"].dt.to_period("M")
- “M” is the “offset alias” string for month
- (Docs)
- Create “year-month” column:
Filter a range
gallipoli_data.loc[>= '2008-01-02') (gallipoli_data.DateTime & <= '2008-01-03') (gallipoli_data.DateTime ]
Fill in gaps
Example: hacky way to do it
pd.DataFrame('h').mean() sales_data.Sales.resample( )
- frequency is already hourly (‘h’), so taking the mean doesn’t change the values. But NaNs will be added for datetime values that don’t exist.
.fillna(0)
can be added to.mean()
if you want to fill the NaNs with something meaningful (e.g. 0)
Explode interval between 2 date columns into a column with all the dates in that interval
Example: 1 row
"checkin_date"][0], calendar["checkout_date"][0]) pd.date_range(calendar[# output '2022-06-01', '2022-06-02', '2022-06-03', DatetimeIndex(['2022-06-04', '2022-06-05', '2022-06-06', '2022-06-07'], ='datetime64[ns]', freq='D') dtype
-
"booked_days"] = calendar.apply( calendar.loc[:, lambda x: list( pd.date_range( x.checkin_date, + pd.DateOffset(days=1) x.checkout_date ).date ),= 1 axis )
-
# explode = calendar.explode( calendar ="booked_days", ignore_index=True column"property","booked_days"]] )[[# display the first 5 rows calendar.head()
Calculations
Get the min/max dates of a dataset:
print(df.Date.agg(['min', 'max']))
(“Date” is the date variable)Find difference between to date columns
"days_to_checkin"] = (df["checkin_date"] - df["created_at"]).dt.days df[
- number of days between the check-in date and the date booking was created (i.e. number of days until the customer arrives)
Add 1 day to a subset of observations
"booking_id"]==1001, "checkout_date"] = df.loc[df["booking_id"]==1001, "checkout_date"] + pd.DateOffset(days=1) df.loc[df[
- adds 1 day to the checkout date of the booking with id 1001
Aggregation
- Misc
resample
(docs) requires a datetime type column set as the index for the dataframe:df.index = df[‘DateTime’]
- btw this function doesn’t resample in the bootstrapping sense of the word. Just a function that allows you to do window calculations on time series
- Common time strings (docs)
- s for seconds
- t for minutes
- h for hours
- w for weeks
- m for months
- q for quarter
- Rolling-Window
Example: 30-day rolling average
=30 window_mavg_short'mav_short'] = stock_df['Close'] \ stock_df[=window_mavg_short) \ .rolling(window .mean()
- Step-Window
Example: Mean temperature every 3 hours
= ‘3h’).mean() gallipoli_data.Temperature.resample(rule
- “1.766667” is the average from 03:00:00 to 05:00:00
- “4.600000” is the average from 06:00:00 to 08:00:00
- “h” is the time string from hours
- By default the calculation window starts on the left index (see next Example for right index) and doesn’t include the right index
- e.g. index, “09:00:00”, calculation window includes “09:00:00”, “10:00:00”, and “11:00:00”
Example: Max sunshine every 3 hrs
'Sunshine Duration'].resample(rule = '3h', closed= 'right').max() gallipoli_data[
- “closed=‘right’” says include the right index in the calculation but not the left
- e.g. index, “09:00:00”, calculation window includes “10:00:00”, “11:00:00”, and “12:00:00”
- “closed=‘right’” says include the right index in the calculation but not the left
- Misc
Simulation
-
import numpy as np import matplotlib.pyplot as plt 987) np.random.seed(= [np.random.normal()*0.1, np.random.normal()*0.1] time_series = [0.1, 0.1] sigs for t in range(2000): = np.sqrt(0.1 + 0.24*time_series[-1]**2 + 0.24*time_series[-2]**2 + 0.24*sigs[-1]**2 + 0.24*sigs[-2]**2) sig_t = np.random.normal() * sig_t y_t time_series.append(y_t) sigs.append(sig_t) = np.array(time_series[2:]) y = (16,8)) plt.figure(figsize = "Simulated Time-Series") plt.plot(y, label = 0.5) plt.grid(alpha = 18) plt.legend(fontsize
Standard GARCH time-series that’s frequently encountered in econometrics
-
from scipy.stats import beta 321) np.random.seed(= [beta(0.5,10).rvs()] time_series for t in range(2000): = 0.5 + time_series[-1] * 0.025 * t alpha_t = alpha_t * 20 beta_t = beta(alpha_t, beta_t).rvs() y_t time_series.append(y_t) = np.array(time_series[1:]) y = (16,8)) plt.figure(figsize = "Simulated Time-Series") plt.plot(y, label = 0.5) plt.grid(alpha = 18) plt.legend(fontsize
-
Optimization
Performance
Pandas will typically outperform numpy ndarrays in cases that involve significantly larger volume of data (say >500K rows)
Bad performance by iteratively creating rows in a dataframe
- Better to iteratively append lists then coerce to a dataframe at the end
- Use
itertuples
instead ofiterrows
in loops- Iterates through the data frame by converting each row of data as a list of tuples. Makes comparatively less number of function calls and hence carry less overhead.
tqdm::tqdm
is a progress bar for loops
Libraries
- {{pandarallel}} - A simple and efficient tool to parallelize Pandas operations on all available CPUs.
- {{parallel_pandas}} - A simple and efficient tool to parallelize Pandas operations on all available CPUs.
- {{modin}} - multi-processing package with identical APIs to Pandas, to speed up the Pandas workflow by changing 1 line of code. Modin offers accelerated performance for about 90+% of Pandas API. Modin uses Ray and Dask under the hood for distributed computing.
- {{numba}} - JIT compiler that translates a subset of Python and NumPy code into fast machine code.
works best with functions that involve many native Python loops, a lot of math, and even better, NumPy functions and arrays
Example
@numba.jit def crazy_function(col1, col2, col3): return (col1 ** 3 + col2 ** 2 + col3 * 10) ** 0.5 "f1001"] = crazy_function( massive_df["f1"].values, massive_df["f56"].values, massive_df["f44"].values massive_df[ )201 ms Wall time:
- 9GB dataset
- JIT stands for just in time, and it translates pure Python and NumPy code to native machine instructions
Use numpy arrays
'col1001'] = some_function( dat['col1'].values, dat['col2'].values, dat['col3'].values dat[ )
Adding the
.values
to the column vectors coerces to ndarraysNumPy arrays are faster because they don’t perform additional calls for indexing, data type checking like Pandas Series
eval
for non-mathematical operations (e.g. boolean indexing, comparisons, etc.)Example
eval("col1001 = (col1 ** 3 + col2 ** 2 + col3 * 10) ** 0.5", inplace=True) massive_df.
- Used a mathematical operation for his example for some reason
iloc
vsloc
- iloc faster for filtering rows:
dat.iloc[range(10000)]
- loc faster for selecting columns:
dat.loc[:, ["f1", "f2", "f3"]]
- noticeable as data size increases
- iloc faster for filtering rows:
Memory Optimization
See memory size of an object
= pd.read_csv("dummy_dataset.csv") data = "deep") data.info(memory_usage =True) / 1024 ** 2 # displays col sizes in MBs data.memory_usage(deep # or for just 1 variable data.Country.memory_usage() # a few columns = df.memory_usage(deep=True) / 1024 ** 2 # displays col sizes in MBs memory_usage 7) memory_usage.head(sum() # total memory_usage.
Use inplace transformation (i.e. don’t create new copies of the df) to reduce memory load
0, inplace = True) df.fillna(# instead of = df.fillna(0) df_copy
Load only the columns you need from a file
= ["Employee_ID", "First_Name", "Salary", "Rating", "Company"] col_list = pd.read_csv("dummy_dataset.csv", usecols=col_list) data
-
Convert variables to smaller types when possible
- Variables always receive largest memory types
- Pandas will always assign
int64
as the datatype of the integer-valued column, irrespective of the range of current values in the column.
- Pandas will always assign
- Variables always receive largest memory types
Byte ranges (same bit options for floats)
uint
refers to unsigned, only positive integersint8
: 8-bit-integer that covers integers from [-2⁷, 2⁷].int16
: 16-bit-integer that covers integers from [-2¹⁵, 2¹⁵].int32
: 32-bit-integer that covers integers from [-2³¹, 2³¹].int64
: 64-bit-integer that covers integers from [-2⁶³, 2⁶³].
Convert integer column to smaller type:
data["Employee_ID"] = data.Employee_ID.astype(np.int32)
Convert all “object” type columns to “category”
for col in X.select_dtypes(include=['object']): = X[col].astype('category') X[col]
- object datatype consumes the most memory. Either use str or category if there are few unique values in the feature
pd.Categorical
data type can speed things up to 10 times while using LightGBM’s default categorical handler
For
datetime
ortimedelta
, use the native formats offered in pandas since they enable special manipulation functionsFunction for checking and converting all numerical columns in dataframes to optimal types
def reduce_memory_usage(df, verbose=True): = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"] numerics = df.memory_usage().sum() / 1024 ** 2 start_mem for col in df.columns: = df[col].dtypes col_type if col_type in numerics: = df[col].min() c_min = df[col].max() c_max if str(col_type)[:3] == "int": if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: = df[col].astype(np.int8) df[col] elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: = df[col].astype(np.int16) df[col] elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: = df[col].astype(np.int32) df[col] elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: = df[col].astype(np.int64) df[col] else: if ( > np.finfo(np.float16).min c_min and c_max < np.finfo(np.float16).max ):= df[col].astype(np.float16) df[col] elif ( > np.finfo(np.float32).min c_min and c_max < np.finfo(np.float32).max ):= df[col].astype(np.float32) df[col] else: = df[col].astype(np.float64) df[col] = df.memory_usage().sum() / 1024 ** 2 end_mem if verbose: print( "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format( 100 * (start_mem - end_mem) / start_mem end_mem, ) )return df
- Based on the minimum and maximum value of a numeric column and the above table, the function converts it to the smallest subtype possible
Check memory usage before and after conversion
print("Memory usage before changing the datatype:", data.Country.memory_usage()) "Country"] = data.Country.astype("category") data[print("Memory usage after changing the datatype:", data.Country.memory_usage())
Use sparse types for variables with NaNs
- Example:
data["Rating"] = data.Rating.astype("Sparse[float32]")
- Example:
Specify datatype when loading data
= ["Employee_ID", "First_Name", "Salary", "Rating", "Country_Code"] col_list = pd.read_csv("dummy_dataset.csv", usecols=col_list, data = {"Employee_ID":np.int32, "Country_Code":"category"}) dtype