Python Data engineering
pandas
Jan 25, 2019     9 minutes read

1. What is pandas and why would you use it?

Personally I find pandas a tool that:

Even though I use pandas almost every day, there are certain solutions that I constantly forget about. Here’s this short blog post with useful snippets.

2. Snippets

datetime

Let’s work on a simple dataframe:

import pandas as pd

df = pd.DataFrame(dict(event_id=[1, 2], time=["2020-01-01 12:00", "2020-01-02 12:00"]))
print(df)
##    event_id              time
## 0         1  2020-01-01 12:00
## 1         2  2020-01-02 12:00
print(df.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2 entries, 0 to 1
## Data columns (total 2 columns):
##  #   Column    Non-Null Count  Dtype 
## ---  ------    --------------  ----- 
##  0   event_id  2 non-null      int64 
##  1   time      2 non-null      object
## dtypes: int64(1), object(1)
## memory usage: 160.0+ bytes
## None

The first thing that you should do is converting the string in to datetime:

df["time"] = pd.to_datetime(df["time"])
print(df)
##    event_id                time
## 0         1 2020-01-01 12:00:00
## 1         2 2020-01-02 12:00:00
print(df.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2 entries, 0 to 1
## Data columns (total 2 columns):
##  #   Column    Non-Null Count  Dtype         
## ---  ------    --------------  -----         
##  0   event_id  2 non-null      int64         
##  1   time      2 non-null      datetime64[ns]
## dtypes: datetime64[ns](1), int64(1)
## memory usage: 160.0 bytes
## None

which provides interesting methods available with .dt

print(df["time"].dt.date)
## 0    2020-01-01
## 1    2020-01-02
## Name: time, dtype: object
print(df["time"].dt.year)
## 0    2020
## 1    2020
## Name: time, dtype: int64

to_datetime creates objects of type pandas.Timestamp, which have interesting methods described in pandas docs, like replace:

print(type(df["time"][0]))
## <class 'pandas._libs.tslibs.timestamps.Timestamp'>
new_times = df["time"].map(lambda x: x.replace(month=2, year=2021))
print(new_times)
## 0   2021-02-01 12:00:00
## 1   2021-02-02 12:00:00
## Name: time, dtype: datetime64[ns]

Filtering is rather straightforward:

print(df["time"] > "2020-01-02 00:01")
# but if you want to be meticulous, you can do it with
## 0    False
## 1     True
## Name: time, dtype: bool
print(df["time"] > pd.Timestamp(2020, 1, 2, 0, 1))
## 0    False
## 1     True
## Name: time, dtype: bool

To sum up: you should convert your datetime strings to pandas.Timestamp with pd.to_datetime. Since then you have two possible types of operations:

  • using .dt property, which may be useful for filtering,

  • using .map and lambda for manipulating pd.Timestamp objects one by one.

grouping and aggregation

or a task that you would expect should be easy enough, yet in pandas it is not.

Let’s define a simple dataframe…

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "event_id": [1, 1, 2, 2, 2],
    "result": [12, 10, 15, 11, np.nan],
    "comment": ["good", "bad", "good", "good", "bad"]
})
print(df)
##    event_id  result comment
## 0         1    12.0    good
## 1         1    10.0     bad
## 2         2    15.0    good
## 3         2    11.0    good
## 4         2     NaN     bad

…and group it by the column “event_id”:

grouped = df.groupby("event_id")
print(type(grouped))
## <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

The resulting object is of type DataFrameGroupBy and has an interesting property: we can iterate over it:

for g in grouped:
    print(g)
## (1,    event_id  result comment
## 0         1    12.0    good
## 1         1    10.0     bad)
## (2,    event_id  result comment
## 2         2    15.0    good
## 3         2    11.0    good
## 4         2     NaN     bad)

Interestingly, each item is a tuple of (id, df_sub), where id is the name of the column, by which we group, and df_sub is a subset of df, where df[df.id == i] and i is one of set(df["id"]). For many grouping columns though, the things get even more interesting:

grouped = df.groupby(["event_id", "comment"])
for (event_id, comment), df_sub in grouped:
    print(f"df for event_id: {event_id} and comment: {comment}:")
    print(df_sub, "\n")
## df for event_id: 1 and comment: bad:
##    event_id  result comment
## 1         1    10.0     bad 
## 
## df for event_id: 1 and comment: good:
##    event_id  result comment
## 0         1    12.0    good 
## 
## df for event_id: 2 and comment: bad:
##    event_id  result comment
## 4         2     NaN     bad 
## 
## df for event_id: 2 and comment: good:
##    event_id  result comment
## 2         2    15.0    good
## 3         2    11.0    good

It seems that grouping only divides the dataset into subsets based on the argument by of the function.

But there is one minor issue with grouping: I often get frustrated with multiindexes, so I use as_index=False

grouped = df.groupby(["event_id", "comment"], as_index=False)
for (event_id, comment), df_sub in grouped:
    print(f"df for event_id: {event_id} and comment: {comment}:")
    print(df_sub, "\n")
## df for event_id: 1 and comment: bad:
##    event_id  result comment
## 1         1    10.0     bad 
## 
## df for event_id: 1 and comment: good:
##    event_id  result comment
## 0         1    12.0    good 
## 
## df for event_id: 2 and comment: bad:
##    event_id  result comment
## 4         2     NaN     bad 
## 
## df for event_id: 2 and comment: good:
##    event_id  result comment
## 2         2    15.0    good
## 3         2    11.0    good

that’s better.

When the data is grouped, we can easily obtain sizes/counts of each group. Counting can be done just like that:

print(grouped.size(), "\n")
##    event_id comment  size
## 0         1     bad     1
## 1         1    good     1
## 2         2     bad     1
## 3         2    good     2
print(grouped.count(), "\n")  # excludes missing values - result for event 2 has count = 2
##    event_id comment  result
## 0         1     bad       1
## 1         1    good       1
## 2         2     bad       0
## 3         2    good       2
print(df.count(), "\n")  # count() works also for the whole df
## event_id    5
## result      4
## comment     5
## dtype: int64

aggregating

.agg may be used for a dataframe or a groupby object. Here are some examples for a dataframe:

print(df.agg(min), "\n")
## event_id       1
## result      10.0
## comment      bad
## dtype: object
print(df.agg([min, max, 'mean', np.std]), "\n")
##       event_id     result comment
## min   1.000000  10.000000     bad
## max   2.000000  15.000000    good
## mean  1.600000  12.000000     NaN
## std   0.547723   2.160247     NaN
print(df.agg({'event_id': ['max', 'mean'], 'result': 'std'}), "\n")  # some funny formatting, different to the one returned by groupby.agg
##       event_id    result
## max        2.0       NaN
## mean       1.6       NaN
## std        NaN  2.160247

and for grouped data

grouped = df.groupby(["event_id", "comment"])
print(grouped.agg(min), "\n")
##                   result
## event_id comment        
## 1        bad        10.0
##          good       12.0
## 2        bad         NaN
##          good       11.0
print(grouped.agg([min, max, 'mean', np.std]), "\n")
# useful for grouping many columns
##                  result                      
##                     min   max  mean       std
## event_id comment                             
## 1        bad       10.0  10.0  10.0       NaN
##          good      12.0  12.0  12.0       NaN
## 2        bad        NaN   NaN   NaN       NaN
##          good      11.0  15.0  13.0  2.828427
print(grouped.agg({'result': [min, max, 'mean', np.std]}), "\n")
##                  result                      
##                     min   max  mean       std
## event_id comment                             
## 1        bad       10.0  10.0  10.0       NaN
##          good      12.0  12.0  12.0       NaN
## 2        bad        NaN   NaN   NaN       NaN
##          good      11.0  15.0  13.0  2.828427

As you can see, we obtained multiindexes in rows, which are IMHO redundant (they could be columns). Unfortunately as_index=False stops working when aggregating, but we can use .reset_index() instead.

grouped = df.groupby(["event_id", "comment"])
print(grouped.agg({'result': [min, 'mean', np.std]}), "\n")
##                  result                
##                     min  mean       std
## event_id comment                       
## 1        bad       10.0  10.0       NaN
##          good      12.0  12.0       NaN
## 2        bad        NaN   NaN       NaN
##          good      11.0  13.0  2.828427

Besides, for many metrics we get a multicolumn index, which may be problematic, so it is a good idea to flatten it:

res = grouped.agg([min, max, 'mean', np.std]).reset_index()
res.columns = ["_".join(x) for x in res.columns.values]
print(res)
##    event_id_ comment_  result_min  result_max  result_mean  result_std
## 0          1      bad        10.0        10.0         10.0         NaN
## 1          1     good        12.0        12.0         12.0         NaN
## 2          2      bad         NaN         NaN          NaN         NaN
## 3          2     good        11.0        15.0         13.0    2.828427

Lat but not least, you .agg can takes as arguments functions in various forms:

def mad(x):
    """ mean absolut deviation of vector x """
    return np.sum(np.abs(x - np.median(x))) / len(x)

functions = [min, 'max', 'std', np.median, lambda x: np.std(x) / np.mean(x), mad]
print(grouped.agg({'result': functions}))
##                  result                                       
##                     min   max       std median <lambda_0>  mad
## event_id comment                                              
## 1        bad       10.0  10.0       NaN   10.0   0.000000  0.0
##          good      12.0  12.0       NaN   12.0   0.000000  0.0
## 2        bad        NaN   NaN       NaN    NaN        NaN  0.0
##          good      11.0  15.0  2.828427   13.0   0.153846  2.0

To sum up, grouping and aggregating works a little like list comprehensions (do sth for each item) or map-reduce: groupby divides the dataset into items, and agg runs a function on each of the items.

3. Other interesting commands/arguments:

4. References

I highly recommend this book: Pandas in action, to all the readers.