data science tutorials and snippets prepared by greysweater42
pandas
and why would you use it?pandas
is probably your first choice for working with tabular data in Python. True, there is also datatable, but it’s nowhere near as popular as pandas.
Effectively it is the only reasonable Python package for this purpose, which makes Python a little modest comparing to R (base, data.table, dplyr - every one of them has a better interface than pandas) for table processing.
Personally I find pandas
a tool that:
makes any task moderately-simple. This may be an advantage for difficult tasks, but for everyday use is slightly frustrating.
has weird defaults, e.g.:
write_csv
has index=True
by default. Who would want that? Especially that read_csv
does not read the column index
as index.
groupby
has as_index=True
by default, which creates MultiIndexes, which seem to be completely useless (or is it just me who doesn’t use them?)
Even though I use pandas almost every day, there are certain solutions that I constantly forget about. Here’s this short blog post with useful snippets.
Let’s work on a simple dataframe:
import pandas as pd
df = pd.DataFrame(dict(event_id=[1, 2], time=["2020-01-01 12:00", "2020-01-02 12:00"]))
print(df)
## event_id time
## 0 1 2020-01-01 12:00
## 1 2 2020-01-02 12:00
print(df.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2 entries, 0 to 1
## Data columns (total 2 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 event_id 2 non-null int64
## 1 time 2 non-null object
## dtypes: int64(1), object(1)
## memory usage: 160.0+ bytes
## None
The first thing that you should do is converting the string in to datetime:
df["time"] = pd.to_datetime(df["time"])
print(df)
## event_id time
## 0 1 2020-01-01 12:00:00
## 1 2 2020-01-02 12:00:00
print(df.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2 entries, 0 to 1
## Data columns (total 2 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 event_id 2 non-null int64
## 1 time 2 non-null datetime64[ns]
## dtypes: datetime64[ns](1), int64(1)
## memory usage: 160.0 bytes
## None
which provides interesting methods available with .dt
print(df["time"].dt.date)
## 0 2020-01-01
## 1 2020-01-02
## Name: time, dtype: object
print(df["time"].dt.year)
## 0 2020
## 1 2020
## Name: time, dtype: int64
to_datetime
creates objects of type pandas.Timestamp
, which have interesting methods described in pandas docs, like replace
:
print(type(df["time"][0]))
## <class 'pandas._libs.tslibs.timestamps.Timestamp'>
new_times = df["time"].map(lambda x: x.replace(month=2, year=2021))
print(new_times)
## 0 2021-02-01 12:00:00
## 1 2021-02-02 12:00:00
## Name: time, dtype: datetime64[ns]
Filtering is rather straightforward:
print(df["time"] > "2020-01-02 00:01")
# but if you want to be meticulous, you can do it with
## 0 False
## 1 True
## Name: time, dtype: bool
print(df["time"] > pd.Timestamp(2020, 1, 2, 0, 1))
## 0 False
## 1 True
## Name: time, dtype: bool
To sum up: you should convert your datetime strings to pandas.Timestamp
with pd.to_datetime
. Since then you have two possible types of operations:
using .dt
property, which may be useful for filtering,
using .map
and lambda
for manipulating pd.Timestamp
objects one by one.
or a task that you would expect should be easy enough, yet in pandas it is not.
Let’s define a simple dataframe…
import numpy as np
import pandas as pd
df = pd.DataFrame({
"event_id": [1, 1, 2, 2, 2],
"result": [12, 10, 15, 11, np.nan],
"comment": ["good", "bad", "good", "good", "bad"]
})
print(df)
## event_id result comment
## 0 1 12.0 good
## 1 1 10.0 bad
## 2 2 15.0 good
## 3 2 11.0 good
## 4 2 NaN bad
…and group it by the column “event_id”:
grouped = df.groupby("event_id")
print(type(grouped))
## <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
The resulting object is of type DataFrameGroupBy
and has an interesting property: we can iterate over it:
for g in grouped:
print(g)
## (1, event_id result comment
## 0 1 12.0 good
## 1 1 10.0 bad)
## (2, event_id result comment
## 2 2 15.0 good
## 3 2 11.0 good
## 4 2 NaN bad)
Interestingly, each item is a tuple of (id, df_sub)
, where id
is the name of the column, by which we group, and df_sub
is a subset of df
, where df[df.id == i]
and i
is one of set(df["id"])
. For many grouping columns though, the things get even more interesting:
grouped = df.groupby(["event_id", "comment"])
for (event_id, comment), df_sub in grouped:
print(f"df for event_id: {event_id} and comment: {comment}:")
print(df_sub, "\n")
## df for event_id: 1 and comment: bad:
## event_id result comment
## 1 1 10.0 bad
##
## df for event_id: 1 and comment: good:
## event_id result comment
## 0 1 12.0 good
##
## df for event_id: 2 and comment: bad:
## event_id result comment
## 4 2 NaN bad
##
## df for event_id: 2 and comment: good:
## event_id result comment
## 2 2 15.0 good
## 3 2 11.0 good
It seems that grouping only divides the dataset into subsets based on the argument by
of the function.
But there is one minor issue with grouping: I often get frustrated with multiindexes, so I use as_index=False
grouped = df.groupby(["event_id", "comment"], as_index=False)
for (event_id, comment), df_sub in grouped:
print(f"df for event_id: {event_id} and comment: {comment}:")
print(df_sub, "\n")
## df for event_id: 1 and comment: bad:
## event_id result comment
## 1 1 10.0 bad
##
## df for event_id: 1 and comment: good:
## event_id result comment
## 0 1 12.0 good
##
## df for event_id: 2 and comment: bad:
## event_id result comment
## 4 2 NaN bad
##
## df for event_id: 2 and comment: good:
## event_id result comment
## 2 2 15.0 good
## 3 2 11.0 good
that’s better.
When the data is grouped, we can easily obtain sizes/counts of each group. Counting can be done just like that:
print(grouped.size(), "\n")
## event_id comment size
## 0 1 bad 1
## 1 1 good 1
## 2 2 bad 1
## 3 2 good 2
print(grouped.count(), "\n") # excludes missing values - result for event 2 has count = 2
## event_id comment result
## 0 1 bad 1
## 1 1 good 1
## 2 2 bad 0
## 3 2 good 2
print(df.count(), "\n") # count() works also for the whole df
## event_id 5
## result 4
## comment 5
## dtype: int64
.agg
may be used for a dataframe or a groupby object. Here are some examples for a dataframe:
print(df.agg(min), "\n")
## event_id 1
## result 10.0
## comment bad
## dtype: object
print(df.agg([min, max, 'mean', np.std]), "\n")
## event_id result comment
## min 1.000000 10.000000 bad
## max 2.000000 15.000000 good
## mean 1.600000 12.000000 NaN
## std 0.547723 2.160247 NaN
print(df.agg({'event_id': ['max', 'mean'], 'result': 'std'}), "\n") # some funny formatting, different to the one returned by groupby.agg
## event_id result
## max 2.0 NaN
## mean 1.6 NaN
## std NaN 2.160247
and for grouped data
grouped = df.groupby(["event_id", "comment"])
print(grouped.agg(min), "\n")
## result
## event_id comment
## 1 bad 10.0
## good 12.0
## 2 bad NaN
## good 11.0
print(grouped.agg([min, max, 'mean', np.std]), "\n")
# useful for grouping many columns
## result
## min max mean std
## event_id comment
## 1 bad 10.0 10.0 10.0 NaN
## good 12.0 12.0 12.0 NaN
## 2 bad NaN NaN NaN NaN
## good 11.0 15.0 13.0 2.828427
print(grouped.agg({'result': [min, max, 'mean', np.std]}), "\n")
## result
## min max mean std
## event_id comment
## 1 bad 10.0 10.0 10.0 NaN
## good 12.0 12.0 12.0 NaN
## 2 bad NaN NaN NaN NaN
## good 11.0 15.0 13.0 2.828427
As you can see, we obtained multiindexes in rows, which are IMHO redundant (they could be columns). Unfortunately as_index=False
stops working when aggregating, but we can use .reset_index() instead.
grouped = df.groupby(["event_id", "comment"])
print(grouped.agg({'result': [min, 'mean', np.std]}), "\n")
## result
## min mean std
## event_id comment
## 1 bad 10.0 10.0 NaN
## good 12.0 12.0 NaN
## 2 bad NaN NaN NaN
## good 11.0 13.0 2.828427
Besides, for many metrics we get a multicolumn index, which may be problematic, so it is a good idea to flatten it:
res = grouped.agg([min, max, 'mean', np.std]).reset_index()
res.columns = ["_".join(x) for x in res.columns.values]
print(res)
## event_id_ comment_ result_min result_max result_mean result_std
## 0 1 bad 10.0 10.0 10.0 NaN
## 1 1 good 12.0 12.0 12.0 NaN
## 2 2 bad NaN NaN NaN NaN
## 3 2 good 11.0 15.0 13.0 2.828427
Lat but not least, you .agg can takes as arguments functions in various forms:
def mad(x):
""" mean absolut deviation of vector x """
return np.sum(np.abs(x - np.median(x))) / len(x)
functions = [min, 'max', 'std', np.median, lambda x: np.std(x) / np.mean(x), mad]
print(grouped.agg({'result': functions}))
## result
## min max std median <lambda_0> mad
## event_id comment
## 1 bad 10.0 10.0 NaN 10.0 0.000000 0.0
## good 12.0 12.0 NaN 12.0 0.000000 0.0
## 2 bad NaN NaN NaN NaN NaN 0.0
## good 11.0 15.0 2.828427 13.0 0.153846 2.0
To sum up, grouping and aggregating works a little like list comprehensions (do sth for each item) or map-reduce: groupby
divides the dataset into items, and agg
runs a function on each of the items.
df.select_dtypes(include="object")
or df.select_dtypes(exlude="object")
chooses onlt those columns, which fulfils criteriumdf.rename(columns={"current_name": "new_name", ...})
- instead of saying rename(..., axis=1)
, you can say df.rename(columns={...})
, which is a little bit more readable
`categorical data: df['some_class'].astype('category')
- keeping categorial data as category
type is more memory-efficient, as values are stored as integers, and their names ar stored in something similar to a dict
replace - not str.replace: series.replace('other', 'unknown')
- exact synonym to series.loc[series == "other"] = "unknown"
, but shorter, and you can also use regex like this: pd.Series(["ab", "ac", "bc"]).replace("a.*", "zz", regex=True)
, which replaces ab
and ac
with zz
str.get(): series.str.get(-1)
- gets the last element of an itarable object, which is in the series: if it is a string - takes the last letter, if a list/tuple - the last element of that list/tuple
str.split: series.str.split('_', n=2, expand=True)
- this command splits the strings by underscore, but only to the second (n=2
) occurence of underscore, and expand
s the result into several (in this case: 3) columns. An interesting variation of expand
is explode
: series.explode()
, which converts a series of lists into a series by concatenating lists, so the result is a series, in opposite to str.split(..., expand=True)
, where the result is a dataframe
str.replace: series.str.replace("a.*", "zz", regex=True)
- replace everything that starts with a
with zz
; regex=True
is mandatory in order to use regex
str.match: series.str.match("a.*")
- marks all the values that start with a
as True, the rest is False. In contrast to str.replace
, str.match
uses rregex by default
nlargest: df.nlargest(5, columns="class1")
- returns 5 rows with the largest values of class1
read_csv: pd.read_csv
has milions of parameters, but one of them is particularily interesting: pd.read_csv(..., dtype={"col1": int})
, which automatically reads the data in a proper format. This can be useful if you don’t vae an ORM to store information on data types, yet you want the columns to have proper dtypes, even if the whole column is empty (which may happen if you download data from spark)
axis: it is possible to write axis=0
or axis=1
, but axis="rows"
and axis="columns"
are much more readable
I highly recommend this book: Pandas in action, to all the readers.