Skip to content

Commit a86a9e4

Browse files
Copilotmmcky
andcommitted
Apply code review suggestions from @HumphreyYang and @jstac
Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
1 parent ea41ee3 commit a86a9e4

File tree

1 file changed

+12
-8
lines changed

1 file changed

+12
-8
lines changed

lectures/polars.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ More sophisticated statistical functionality is left to other packages, such as
6363
This lecture will provide a basic introduction to Polars.
6464

6565
```{tip}
66-
*Why use Polars over pandas?* One reason is *performance*. As a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars. In addition, Polars is between 10 and 100 times as fast as pandas for common operations. A great article comparing the Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
66+
*Why use Polars over pandas?* One reason is *performance*: as a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars; in addition, Polars is between 10 and 100 times as fast as pandas for common operations; a great article comparing Polars and pandas can be found [in this JetBrains blog post](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/).
6767
```
6868

6969
Throughout the lecture, we will assume that the following imports have taken place
@@ -73,7 +73,6 @@ import polars as pl
7373
import pandas as pd
7474
import numpy as np
7575
import matplotlib.pyplot as plt
76-
import requests
7776
```
7877

7978
Two important data types defined by Polars are `Series` and `DataFrame`.
@@ -97,7 +96,7 @@ s
9796
```
9897

9998
```{note}
100-
You may notice the above series has no indices, unlike in [pd.Series](pandas:series). This is because Polars is column-centric and accessing data is predominantly managed through filtering and boolean masks. Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
99+
You may notice the above series has no indices, unlike in [pd.Series](pandas:series); this is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks; here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413).
101100
```
102101

103102
Polars `Series` are built on top of Apache Arrow arrays and support many similar operations to Pandas `Series`.
@@ -152,13 +151,13 @@ df.filter(pl.col('company') == 'AMZN').select('daily returns').item()
152151

153152
If we want to update the `AMZN` return to 0, you can use the following chain of methods.
154153

155-
Here, `with_columns` is similar to `select` but adds columns to the same `DataFrame`
154+
Here `with_columns` is similar to `select` but adds columns to the same `DataFrame`
156155

157156
```{code-cell} ipython3
158157
df = df.with_columns(
159158
pl.when(pl.col('company') == 'AMZN') # filter for AMZN in company column
160159
.then(0) # set values to 0
161-
.otherwise(pl.col('daily returns')) # otherwise keep the original value
160+
.otherwise(pl.col('daily returns')) # otherwise keep original value
162161
.alias('daily returns') # assign back to the column
163162
)
164163
df
@@ -378,8 +377,8 @@ df.with_columns(
378377
df_modified = df.with_columns(
379378
pl.when(pl.col('cg') == pl.col('cg').max()) # pick the largest cg value
380379
.then(None) # set to null
381-
.otherwise(pl.col('cg')) # otherwise keep the value in the cg column
382-
.alias('cg') # update the column with name cg
380+
.otherwise(pl.col('cg')) # otherwise keep the value
381+
.alias('cg') # update the column
383382
)
384383
df_modified
385384
```
@@ -390,7 +389,7 @@ df_modified
390389
df.with_columns([
391390
pl.when(pl.col('POP') <= 10000) # when population is < 10,000
392391
.then(None) # set the value to null
393-
.otherwise(pl.col('POP')) # otherwise keep the existing value
392+
.otherwise(pl.col('POP')) # otherwise keep existing value
394393
.alias('POP'), # update the POP column
395394
(pl.col('XRAT') / 10).alias('XRAT') # update XRAT in-place
396395
])
@@ -885,9 +884,14 @@ df_pandas = yearly_returns.to_pandas().set_index('year')
885884
886885
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
887886
887+
# Flatten 2-D array to 1-D array
888888
for iter_, ax in enumerate(axes.flatten()):
889889
if iter_ < len(indices_list):
890+
891+
# Get index name per iteration
890892
index_name = list(indices_list.values())[iter_]
893+
894+
# Plot pct change of yearly returns per index
891895
ax.plot(df_pandas.index, df_pandas[index_name])
892896
ax.set_ylabel("percent change", fontsize=12)
893897
ax.set_xlabel("year", fontsize=12)

0 commit comments

Comments
 (0)