Polars for Feature Engineering

Polars is a high-performance DataFrame library, designed to provide fast and efficient data processing capabilities. Inspired by the reigning pandas library, Polars takes things to another level, offering a seamless experience for working with large datasets that might not fit into memory.
polars
feature engineering
Author

Vidyasagar Bhargava

Published

January 3, 2024

Polars API are

  1. Simple
  2. Consistent
  3. Grammar

Most of the feature engineering task based on below 7 verbs

7 Verbs Get Most Jobs done
Task verb
select/slice columns select
create/transform/assign columns with_columns
filter/slice/query rows filter
join/merge other dataframes join & concat
group dataframe rows group_by
aggregate groups agg
sort dataframe sort
import polars as pl
df = pl.read_csv("StudentsPerformance.csv")
df.head()
shape: (5, 9)
id gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
i64 str str str str str i64 i64 i64
1 "female" "group B" "bachelor's degree" "standard" "none" 72 72 74
2 "female" "group C" "some college" "standard" "completed" 69 90 88
3 "female" "group B" "master's degree" "standard" "none" 90 95 93
4 "male" "group A" "associate's degree" "free/reduced" "none" 47 57 44
5 "male" "group C" "some college" "standard" "none" 76 78 75

1. Select Columns

  • Selecting 1 column
df.select(pl.col('gender')).head()
shape: (5, 1)
gender
str
"female"
"female"
"female"
"male"
"male"
  • Selecting two or more columns
df.select(pl.col(['gender', 'math score'])).head()
shape: (5, 2)
gender math score
str i64
"female" 72
"female" 69
"female" 90
"male" 47
"male" 76
  • Selecting all the columns
df.select(pl.col('*')).head()
shape: (5, 9)
id gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
i64 str str str str str i64 i64 i64
1 "female" "group B" "bachelor's degree" "standard" "none" 72 72 74
2 "female" "group C" "some college" "standard" "completed" 69 90 88
3 "female" "group B" "master's degree" "standard" "none" 90 95 93
4 "male" "group A" "associate's degree" "free/reduced" "none" 47 57 44
5 "male" "group C" "some college" "standard" "none" 76 78 75

2. Create Columns

  • Creating a new column “sum” by summing math score and reading score
df.with_columns(
    (pl.col('math score') + pl.col('reading score')).alias('sum')
).head()
shape: (5, 10)
id gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score sum
i64 str str str str str i64 i64 i64 i64
1 "female" "group B" "bachelor's degree" "standard" "none" 72 72 74 144
2 "female" "group C" "some college" "standard" "completed" 69 90 88 159
3 "female" "group B" "master's degree" "standard" "none" 90 95 93 185
4 "male" "group A" "associate's degree" "free/reduced" "none" 47 57 44 104
5 "male" "group C" "some college" "standard" "none" 76 78 75 154

3. Filter

  • Simple filtering

    selecting female students

df.filter(pl.col('gender')=='female').head()
shape: (5, 9)
id gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
i64 str str str str str i64 i64 i64
1 "female" "group B" "bachelor's degree" "standard" "none" 72 72 74
2 "female" "group C" "some college" "standard" "completed" 69 90 88
3 "female" "group B" "master's degree" "standard" "none" 90 95 93
6 "female" "group B" "associate's degree" "standard" "none" 71 83 78
7 "female" "group B" "some college" "standard" "completed" 88 95 92
  • Multiple filtering

    selecting female students those belong to group B

df.filter(
    (pl.col('gender')=='female') & 
    (pl.col('race/ethnicity')=='group B')
).head()
shape: (5, 9)
id gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
i64 str str str str str i64 i64 i64
1 "female" "group B" "bachelor's degree" "standard" "none" 72 72 74
3 "female" "group B" "master's degree" "standard" "none" 90 95 93
6 "female" "group B" "associate's degree" "standard" "none" 71 83 78
7 "female" "group B" "some college" "standard" "completed" 88 95 92
10 "female" "group B" "high school" "free/reduced" "none" 38 60 50

4. Join

df2 = pl.read_csv('LanguageScore.csv')
df.join(df2, on="id").head()
shape: (5, 10)
id gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score language score
i64 str str str str str i64 i64 i64 i64
1 "female" "group B" "bachelor's degree" "standard" "none" 72 72 74 74
2 "female" "group C" "some college" "standard" "completed" 69 90 88 67
3 "female" "group B" "master's degree" "standard" "none" 90 95 93 34
4 "male" "group A" "associate's degree" "free/reduced" "none" 47 57 44 33
5 "male" "group C" "some college" "standard" "none" 76 78 75 75

Concat

df2 = df2.drop("id")
pl.concat([df, df2], how="horizontal").head()
shape: (5, 10)
id gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score language score
i64 str str str str str i64 i64 i64 i64
1 "female" "group B" "bachelor's degree" "standard" "none" 72 72 74 74
2 "female" "group C" "some college" "standard" "completed" 69 90 88 67
3 "female" "group B" "master's degree" "standard" "none" 90 95 93 34
4 "male" "group A" "associate's degree" "free/reduced" "none" 47 57 44 33
5 "male" "group C" "some college" "standard" "none" 76 78 75 75

5. Group By

Count total elements for each race/ethnicity

df.group_by('race/ethnicity').count()
/var/folders/qd/nnwwkgqd7m11zrq6syq4q8c80000gn/T/ipykernel_26980/1267365750.py:1: DeprecationWarning: `GroupBy.count` is deprecated. It has been renamed to `len`.
  df.group_by('race/ethnicity').count()
shape: (5, 2)
race/ethnicity count
str u32
"group D" 262
"group A" 89
"group C" 319
"group B" 190
"group E" 140

6. Aggregate

average math score for females and males

df.group_by('gender').agg(pl.col('math score').mean().alias('mean_score'))
shape: (2, 2)
gender mean_score
str f64
"female" 63.633205
"male" 68.728216

7. Sort

sort the dataframe by math score

df.sort('math score',descending=True).head()
shape: (5, 9)
id gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
i64 str str str str str i64 i64 i64
150 "male" "group E" "associate's degree" "free/reduced" "completed" 100 100 93
452 "female" "group E" "some college" "standard" "none" 100 92 97
459 "female" "group E" "bachelor's degree" "standard" "none" 100 100 100
624 "male" "group A" "some college" "standard" "completed" 100 96 86
626 "male" "group D" "some college" "standard" "completed" 100 97 99
Back to top