greysweater42's cookbook

Python

pandas

Jan 25, 2019 9 minutes read

1. What is `pandas` and why would you use it?

pandas is probably your first choice for working with tabular data in Python. True, there is also datatable, but it’s nowhere near as popular as pandas.
Effectively it is the only reasonable Python package for this purpose, which makes Python a little modest comparing to R (base, data.table, dplyr - every one of them has a better interface than pandas) for table processing.

Personally I find pandas a tool that:

makes any task moderately-simple. This may be an advantage for difficult tasks, but for everyday use is slightly frustrating.
has weird defaults, e.g.:
- write_csv has index=True by default. Who would want that? Especially that read_csv does not read the column index as index.
- groupby has as_index=True by default, which creates MultiIndexes, which seem to be completely useless (or is it just me who doesn’t use them?)

Even though I use pandas almost every day, there are certain solutions that I constantly forget about. Here’s this short blog post with useful snippets.

2. Snippets

datetime

Let’s work on a simple dataframe:

import pandas as pd

df = pd.DataFrame(dict(event_id=[1, 2], time=["2020-01-01 12:00", "2020-01-02 12:00"]))
print(df)

##    event_id              time
## 0         1  2020-01-01 12:00
## 1         2  2020-01-02 12:00

print(df.info())

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2 entries, 0 to 1
## Data columns (total 2 columns):
##  #   Column    Non-Null Count  Dtype 
## ---  ------    --------------  ----- 
##  0   event_id  2 non-null      int64 
##  1   time      2 non-null      object
## dtypes: int64(1), object(1)
## memory usage: 160.0+ bytes
## None

The first thing that you should do is converting the string in to datetime:

df["time"] = pd.to_datetime(df["time"])
print(df)

##    event_id                time
## 0         1 2020-01-01 12:00:00
## 1         2 2020-01-02 12:00:00

print(df.info())

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2 entries, 0 to 1
## Data columns (total 2 columns):
##  #   Column    Non-Null Count  Dtype         
## ---  ------    --------------  -----         
##  0   event_id  2 non-null      int64         
##  1   time      2 non-null      datetime64[ns]
## dtypes: datetime64[ns](1), int64(1)
## memory usage: 160.0 bytes
## None

which provides interesting methods available with .dt

print(df["time"].dt.date)

## 0    2020-01-01
## 1    2020-01-02
## Name: time, dtype: object

print(df["time"].dt.year)

## 0    2020
## 1    2020
## Name: time, dtype: int64

to_datetime creates objects of type pandas.Timestamp, which have interesting methods described in pandas docs, like replace:

print(type(df["time"][0]))

## <class 'pandas._libs.tslibs.timestamps.Timestamp'>

new_times = df["time"].map(lambda x: x.replace(month=2, year=2021))
print(new_times)

## 0   2021-02-01 12:00:00
## 1   2021-02-02 12:00:00
## Name: time, dtype: datetime64[ns]

Filtering is rather straightforward:

print(df["time"] > "2020-01-02 00:01")
# but if you want to be meticulous, you can do it with

## 0    False
## 1     True
## Name: time, dtype: bool

print(df["time"] > pd.Timestamp(2020, 1, 2, 0, 1))

## 0    False
## 1     True
## Name: time, dtype: bool

To sum up: you should convert your datetime strings to pandas.Timestamp with pd.to_datetime. Since then you have two possible types of operations:

using .dt property, which may be useful for filtering,
using .map and lambda for manipulating pd.Timestamp objects one by one.

grouping and aggregation

or a task that you would expect should be easy enough, yet in pandas it is not.

Let’s define a simple dataframe…

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "event_id": [1, 1, 2, 2, 2],
    "result": [12, 10, 15, 11, np.nan],
    "comment": ["good", "bad", "good", "good", "bad"]
})
print(df)

##    event_id  result comment
## 0         1    12.0    good
## 1         1    10.0     bad
## 2         2    15.0    good
## 3         2    11.0    good
## 4         2     NaN     bad

…and group it by the column “event_id”:

grouped = df.groupby("event_id")
print(type(grouped))

## <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

The resulting object is of type DataFrameGroupBy and has an interesting property: we can iterate over it:

for g in grouped:
    print(g)

## (1,    event_id  result comment
## 0         1    12.0    good
## 1         1    10.0     bad)
## (2,    event_id  result comment
## 2         2    15.0    good
## 3         2    11.0    good
## 4         2     NaN     bad)

Interestingly, each item is a tuple of (id, df_sub), where id is the name of the column, by which we group, and df_sub is a subset of df, where df[df.id == i] and i is one of set(df["id"]). For many grouping columns though, the things get even more interesting:

grouped = df.groupby(["event_id", "comment"])
for (event_id, comment), df_sub in grouped:
    print(f"df for event_id: {event_id} and comment: {comment}:")
    print(df_sub, "\n")

## df for event_id: 1 and comment: bad:
##    event_id  result comment
## 1         1    10.0     bad 
## 
## df for event_id: 1 and comment: good:
##    event_id  result comment
## 0         1    12.0    good 
## 
## df for event_id: 2 and comment: bad:
##    event_id  result comment
## 4         2     NaN     bad 
## 
## df for event_id: 2 and comment: good:
##    event_id  result comment
## 2         2    15.0    good
## 3         2    11.0    good

It seems that grouping only divides the dataset into subsets based on the argument by of the function.

But there is one minor issue with grouping: I often get frustrated with multiindexes, so I use as_index=False

grouped = df.groupby(["event_id", "comment"], as_index=False)
for (event_id, comment), df_sub in grouped:
    print(f"df for event_id: {event_id} and comment: {comment}:")
    print(df_sub, "\n")

## df for event_id: 1 and comment: bad:
##    event_id  result comment
## 1         1    10.0     bad 
## 
## df for event_id: 1 and comment: good:
##    event_id  result comment
## 0         1    12.0    good 
## 
## df for event_id: 2 and comment: bad:
##    event_id  result comment
## 4         2     NaN     bad 
## 
## df for event_id: 2 and comment: good:
##    event_id  result comment
## 2         2    15.0    good
## 3         2    11.0    good

that’s better.

When the data is grouped, we can easily obtain sizes/counts of each group. Counting can be done just like that:

print(grouped.size(), "\n")

##    event_id comment  size
## 0         1     bad     1
## 1         1    good     1
## 2         2     bad     1
## 3         2    good     2

print(grouped.count(), "\n")  # excludes missing values - result for event 2 has count = 2

##    event_id comment  result
## 0         1     bad       1
## 1         1    good       1
## 2         2     bad       0
## 3         2    good       2

print(df.count(), "\n")  # count() works also for the whole df

## event_id    5
## result      4
## comment     5
## dtype: int64

aggregating

.agg may be used for a dataframe or a groupby object. Here are some examples for a dataframe:

print(df.agg(min), "\n")

## event_id       1
## result      10.0
## comment      bad
## dtype: object

print(df.agg([min, max, 'mean', np.std]), "\n")

##       event_id     result comment
## min   1.000000  10.000000     bad
## max   2.000000  15.000000    good
## mean  1.600000  12.000000     NaN
## std   0.547723   2.160247     NaN

print(df.agg({'event_id': ['max', 'mean'], 'result': 'std'}), "\n")  # some funny formatting, different to the one returned by groupby.agg

##       event_id    result
## max        2.0       NaN
## mean       1.6       NaN
## std        NaN  2.160247

and for grouped data

grouped = df.groupby(["event_id", "comment"])
print(grouped.agg(min), "\n")

##                   result
## event_id comment        
## 1        bad        10.0
##          good       12.0
## 2        bad         NaN
##          good       11.0

print(grouped.agg([min, max, 'mean', np.std]), "\n")
# useful for grouping many columns

##                  result                      
##                     min   max  mean       std
## event_id comment                             
## 1        bad       10.0  10.0  10.0       NaN
##          good      12.0  12.0  12.0       NaN
## 2        bad        NaN   NaN   NaN       NaN
##          good      11.0  15.0  13.0  2.828427

print(grouped.agg({'result': [min, max, 'mean', np.std]}), "\n")

##                  result                      
##                     min   max  mean       std
## event_id comment                             
## 1        bad       10.0  10.0  10.0       NaN
##          good      12.0  12.0  12.0       NaN
## 2        bad        NaN   NaN   NaN       NaN
##          good      11.0  15.0  13.0  2.828427

As you can see, we obtained multiindexes in rows, which are IMHO redundant (they could be columns). Unfortunately as_index=False stops working when aggregating, but we can use .reset_index() instead.

grouped = df.groupby(["event_id", "comment"])
print(grouped.agg({'result': [min, 'mean', np.std]}), "\n")

##                  result                
##                     min  mean       std
## event_id comment                       
## 1        bad       10.0  10.0       NaN
##          good      12.0  12.0       NaN
## 2        bad        NaN   NaN       NaN
##          good      11.0  13.0  2.828427

Besides, for many metrics we get a multicolumn index, which may be problematic, so it is a good idea to flatten it:

res = grouped.agg([min, max, 'mean', np.std]).reset_index()
res.columns = ["_".join(x) for x in res.columns.values]
print(res)

##    event_id_ comment_  result_min  result_max  result_mean  result_std
## 0          1      bad        10.0        10.0         10.0         NaN
## 1          1     good        12.0        12.0         12.0         NaN
## 2          2      bad         NaN         NaN          NaN         NaN
## 3          2     good        11.0        15.0         13.0    2.828427

Lat but not least, you .agg can takes as arguments functions in various forms:

def mad(x):
    """ mean absolut deviation of vector x """
    return np.sum(np.abs(x - np.median(x))) / len(x)

functions = [min, 'max', 'std', np.median, lambda x: np.std(x) / np.mean(x), mad]
print(grouped.agg({'result': functions}))

##                  result                                       
##                     min   max       std median <lambda_0>  mad
## event_id comment                                              
## 1        bad       10.0  10.0       NaN   10.0   0.000000  0.0
##          good      12.0  12.0       NaN   12.0   0.000000  0.0
## 2        bad        NaN   NaN       NaN    NaN        NaN  0.0
##          good      11.0  15.0  2.828427   13.0   0.153846  2.0

To sum up, grouping and aggregating works a little like list comprehensions (do sth for each item) or map-reduce: groupby divides the dataset into items, and agg runs a function on each of the items.

3. Other interesting commands/arguments:

select_dtypes: df.select_dtypes(include="object") or df.select_dtypes(exlude="object") chooses onlt those columns, which fulfils criterium
rename: df.rename(columns={"current_name": "new_name", ...}) - instead of saying rename(..., axis=1), you can say df.rename(columns={...}), which is a little bit more readable `
categorical data: df['some_class'].astype('category') - keeping categorial data as category type is more memory-efficient, as values are stored as integers, and their names ar stored in something similar to a dict
replace - not str.replace: series.replace('other', 'unknown') - exact synonym to series.loc[series == "other"] = "unknown", but shorter, and you can also use regex like this: pd.Series(["ab", "ac", "bc"]).replace("a.*", "zz", regex=True), which replaces ab and ac with zz
str.get(): series.str.get(-1) - gets the last element of an itarable object, which is in the series: if it is a string - takes the last letter, if a list/tuple - the last element of that list/tuple
str.split: series.str.split('_', n=2, expand=True) - this command splits the strings by underscore, but only to the second (n=2) occurence of underscore, and expands the result into several (in this case: 3) columns. An interesting variation of expand is explode: series.explode(), which converts a series of lists into a series by concatenating lists, so the result is a series, in opposite to str.split(..., expand=True), where the result is a dataframe
str.replace: series.str.replace("a.*", "zz", regex=True) - replace everything that starts with a with zz; regex=True is mandatory in order to use regex
str.match: series.str.match("a.*") - marks all the values that start with a as True, the rest is False. In contrast to str.replace, str.match uses rregex by default
nlargest: df.nlargest(5, columns="class1") - returns 5 rows with the largest values of class1
read_csv: pd.read_csv has milions of parameters, but one of them is particularily interesting: pd.read_csv(..., dtype={"col1": int}), which automatically reads the data in a proper format. This can be useful if you don’t vae an ORM to store information on data types, yet you want the columns to have proper dtypes, even if the whole column is empty (which may happen if you download data from spark)
axis: it is possible to write axis=0 or axis=1, but axis="rows" and axis="columns" are much more readable

4. References

I highly recommend this book: Pandas in action, to all the readers.