Polars for Feature Engineering

Polars API are

Simple
Consistent
Grammar

Most of the feature engineering task based on below 7 verbs

7 Verbs Get Most Jobs done
Task	verb
select/slice columns	select
create/transform/assign columns	with_columns
filter/slice/query rows	filter
join/merge other dataframes	join & concat
group dataframe rows	group_by
aggregate groups	agg
sort dataframe	sort

import polars as pl
df = pl.read_csv("StudentsPerformance.csv")
df.head()

shape: (5, 9)

id	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
i64	str	str	str	str	str	i64	i64	i64
1	"female"	"group B"	"bachelor's degree"	"standard"	"none"	72	72	74
2	"female"	"group C"	"some college"	"standard"	"completed"	69	90	88
3	"female"	"group B"	"master's degree"	"standard"	"none"	90	95	93
4	"male"	"group A"	"associate's degree"	"free/reduced"	"none"	47	57	44
5	"male"	"group C"	"some college"	"standard"	"none"	76	78	75

1. Select Columns

Selecting 1 column

df.select(pl.col('gender')).head()

shape: (5, 1)

gender
str
"female"
"female"
"female"
"male"
"male"

Selecting two or more columns

df.select(pl.col(['gender', 'math score'])).head()

shape: (5, 2)

gender	math score
str	i64
"female"	72
"female"	69
"female"	90
"male"	47
"male"	76

Selecting all the columns

df.select(pl.col('*')).head()

shape: (5, 9)

id	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
i64	str	str	str	str	str	i64	i64	i64
1	"female"	"group B"	"bachelor's degree"	"standard"	"none"	72	72	74
2	"female"	"group C"	"some college"	"standard"	"completed"	69	90	88
3	"female"	"group B"	"master's degree"	"standard"	"none"	90	95	93
4	"male"	"group A"	"associate's degree"	"free/reduced"	"none"	47	57	44
5	"male"	"group C"	"some college"	"standard"	"none"	76	78	75

2. Create Columns

Creating a new column “sum” by summing math score and reading score

df.with_columns(
    (pl.col('math score') + pl.col('reading score')).alias('sum')
).head()

shape: (5, 10)

id	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score	sum
i64	str	str	str	str	str	i64	i64	i64	i64
1	"female"	"group B"	"bachelor's degree"	"standard"	"none"	72	72	74	144
2	"female"	"group C"	"some college"	"standard"	"completed"	69	90	88	159
3	"female"	"group B"	"master's degree"	"standard"	"none"	90	95	93	185
4	"male"	"group A"	"associate's degree"	"free/reduced"	"none"	47	57	44	104
5	"male"	"group C"	"some college"	"standard"	"none"	76	78	75	154

3. Filter

Simple filtering

selecting female students

df.filter(pl.col('gender')=='female').head()

shape: (5, 9)

id	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
i64	str	str	str	str	str	i64	i64	i64
1	"female"	"group B"	"bachelor's degree"	"standard"	"none"	72	72	74
2	"female"	"group C"	"some college"	"standard"	"completed"	69	90	88
3	"female"	"group B"	"master's degree"	"standard"	"none"	90	95	93
6	"female"	"group B"	"associate's degree"	"standard"	"none"	71	83	78
7	"female"	"group B"	"some college"	"standard"	"completed"	88	95	92

Multiple filtering

selecting female students those belong to group B

df.filter(
    (pl.col('gender')=='female') & 
    (pl.col('race/ethnicity')=='group B')
).head()

shape: (5, 9)

id	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
i64	str	str	str	str	str	i64	i64	i64
1	"female"	"group B"	"bachelor's degree"	"standard"	"none"	72	72	74
3	"female"	"group B"	"master's degree"	"standard"	"none"	90	95	93
6	"female"	"group B"	"associate's degree"	"standard"	"none"	71	83	78
7	"female"	"group B"	"some college"	"standard"	"completed"	88	95	92
10	"female"	"group B"	"high school"	"free/reduced"	"none"	38	60	50

4. Join

df2 = pl.read_csv('LanguageScore.csv')
df.join(df2, on="id").head()

shape: (5, 10)

id	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score	language score
i64	str	str	str	str	str	i64	i64	i64	i64
1	"female"	"group B"	"bachelor's degree"	"standard"	"none"	72	72	74	74
2	"female"	"group C"	"some college"	"standard"	"completed"	69	90	88	67
3	"female"	"group B"	"master's degree"	"standard"	"none"	90	95	93	34
4	"male"	"group A"	"associate's degree"	"free/reduced"	"none"	47	57	44	33
5	"male"	"group C"	"some college"	"standard"	"none"	76	78	75	75

Concat

df2 = df2.drop("id")
pl.concat([df, df2], how="horizontal").head()

shape: (5, 10)

id	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score	language score
i64	str	str	str	str	str	i64	i64	i64	i64
1	"female"	"group B"	"bachelor's degree"	"standard"	"none"	72	72	74	74
2	"female"	"group C"	"some college"	"standard"	"completed"	69	90	88	67
3	"female"	"group B"	"master's degree"	"standard"	"none"	90	95	93	34
4	"male"	"group A"	"associate's degree"	"free/reduced"	"none"	47	57	44	33
5	"male"	"group C"	"some college"	"standard"	"none"	76	78	75	75

5. Group By

Count total elements for each race/ethnicity

df.group_by('race/ethnicity').count()

/var/folders/qd/nnwwkgqd7m11zrq6syq4q8c80000gn/T/ipykernel_26980/1267365750.py:1: DeprecationWarning: `GroupBy.count` is deprecated. It has been renamed to `len`.
  df.group_by('race/ethnicity').count()

shape: (5, 2)

race/ethnicity	count
str	u32
"group D"	262
"group A"	89
"group C"	319
"group B"	190
"group E"	140

6. Aggregate

average math score for females and males

df.group_by('gender').agg(pl.col('math score').mean().alias('mean_score'))

shape: (2, 2)

gender	mean_score
str	f64
"female"	63.633205
"male"	68.728216

7. Sort

sort the dataframe by math score

df.sort('math score',descending=True).head()

shape: (5, 9)

id	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
i64	str	str	str	str	str	i64	i64	i64
150	"male"	"group E"	"associate's degree"	"free/reduced"	"completed"	100	100	93
452	"female"	"group E"	"some college"	"standard"	"none"	100	92	97
459	"female"	"group E"	"bachelor's degree"	"standard"	"none"	100	100	100
624	"male"	"group A"	"some college"	"standard"	"completed"	100	96	86
626	"male"	"group D"	"some college"	"standard"	"completed"	100	97	99