DOKK Library

Pandas Worksheets

Authors Enthought Inc.

License CC-BY-NC-ND-4.0

1                              Reading and Writing
                               Data with Pandas

Functions to read data are all named pd.read_* where *
is the file type. Series and DataFrames can be saved to disk
using their to_* method.

Usage Patterns
Use pd.read_clipboard() for one-off data extractions.
Use the other pd.read_* methods in scripts for repeatable analyses.

Reading Text Files into a DataFrame
Colors highlight how different arguments map from the data file to a

Other arguments:                                                                               Possible values of parse_dates:
• names: Set or override column names                                                          [0, 2]: Parse columns 0 and 2 as separate dates
• parse_dates: Accepts multiple argument types                                                 [[0, 2]]: Group columns 0 and 2 and parse as single date
• converters: Manually process each element in a column                                        {'Date': [0, 2]}: Group columns 0 and 2, parse as single date in a
• comment: Character indicating commented line                                                 column named Date
• chunksize: Read only a certain number of rows each time                                      Dates are parsed after the converters have been applied.

Parsing Tables from the Web

Writing Data Structures to Disk                                                                Writing Data Structures from and to a Database
Write data structures to disk:                                                                 Read, using SQLAlchemy. Supports multiple databases:
> s_df.to_csv(filename)                                                                        > from sqlalchemy import create_engine
> s_df.to_excel(filename)                                                                      > engine = create_engine(database_url)
Write multiple DataFrames to single Excel file:                                                > conn = engine.connect()
> writer = pd.ExcelWriter(filename)                                                            > df = pd.read_sql(query_str_or_table_name, conn)
> df1.to_excel(writer, sheet_name='First')                                                     Write:
> df2.to_excel(writer, sheet_name='Second')                                                    > df.to_sql(table_name, conn)

Take your Pandas skills to the next level! Register at
©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit                               
DataFrames (df).
                               Pandas Data Structures:
                               Series and DataFrames
Within Pandas, there are two primary data structures: Series (s) and

            A Series, which maps an index to values. It can be thought of as
            an ordered dictionary or a Numpy array with row labels and a
                                                                                                s_df.head() and s_df.tail()
                                                                                                                                   Number of rows
                                                                                                                                   First/last rows
                                                                                                                                   Series of unique values
                                                                                                                                   Summary stats
                                                                                                                                   Memory usage
df          A DataFrame, which maps index and column labels to values. It                      Indexing and Slicing
            is like a dictionary of Series (columns) sharing the same index,                   Use these attributes on Series and DataFrames for indexing, slicing,
            or like a 2D Numpy array with row and column labels.                               and assignments:
s_df        Applies to both Series and DataFrames.                                                       s_df.loc[ ]               Refers only to the index labels
Manipulations of Pandas objects usually return copies.                                                  s_df.iloc[ ]               Refers only to the integer location,
                                                                                               		                                  similar to lists or Numpy arrays
Creating Series and DataFrames                                                                 s_df.xs(key, level=L)               Select rows with label key in level L
                                                                                               		                                  of an object with MultiIndex.
 n1 'Cary'        0                                                                            Masking and Boolean Indexing
                      > pd.Series(values, index=index, name=name)
 n2 'Lynn'        1                                                                            Create masks with comparisons:
                      > pd.Series({'idx1' : val1,'idx2' : val2}
 n3      'Sam     2                                                                            mask = df['X'] < 0
                      Where values, index, and name are sequences or arrays.
                                                                                               Or isin, for membership mask:
DataFrame		                                                                                    mask = df['X'].isin(list_of_valid_values)
                                                                   Age   Gender Coulmns
> pd.DataFrame(values,                                                                         Use masks for indexing:
                                                          'Cary'   32       M
    index=index, columns=col_names)                                                            df.loc[mask] = 0
                                                         'Lynn'    18       F
> pd.DataFrame({'col1' :
    series1_or _seq,
                                                          'Sam     26       M                  Combine multiple masks with bitwise operators — and (&), or (|), or (^),

'col2': series2_or _seq})
                                                         Index                                 not (~) — and group them with parentheses:

Where values is a sequence of sequences or a 2D array.                                         mask = (df['X'] < 0) & (df['Y'] == 0)

                                                                                               Common Indexing and Slicing Patterns
Manipulating Series and DataFrames
                                                                                               rows and cols can be values, lists, Series, or masks.
Manipulating Columns
                                                                                                               s_df.loc[rows]      Some rows (all columns in a DataFrame)
df.rename(columns={old_name:new_name})                        Renames column
                                                                                                         df.loc[:, cols_list]      All rows, some columns
df.drop(name_or_names, axis='columns')                        Drops column name
                                                                                                           df.loc[rows, cols]      Subset of rows and columns
Manipulating Index                                                                                             s_df.loc[mask]      Boolean mask of rows (all columns)
s_df.reindex(new_index)                             Conform to new index                                   df.loc[mask, cols]      Boolean mask of rows, some columns
s_df.drop(labels_to_drop)                           Drops index labels
s_df.rename                                                                                    Using [ ] on Series and DataFrames
   (index={old_label: new_label})                   Renames index labels
s_df.sort_index()                                   Sorts index labels                         On Series, [] refers to the index labels, or to a slice:
df.set_index(column_name_or_names)                                                                                       s['a']    Value
s_df.reset_index()                                  Inserts index into columns, resets                                    s[:2]    Series, first two rows
                                                    index to default integer index             On DataFrames, [ ] refers to columns labels:
                                                                                                                      df['X']      Series
Manipulating Values
                                                                                                               df[['X', 'Y']]      DataFrame
All row values and the index will follow:
                                                                                               df['new_or_old_col'] = series_or_array
df.sort_values(col_name, ascending=True)
df.sort_values(['X','Y'], ascending=[False, True])                                             Except with a slice or mask, as shown below:
                                                                                                                       df[:2]      DataFrame, first two rows
Important Attributes and Methods                                                                                     df[mask]      DataFrame, rows where mask is True

                        s_df.index        Array-like row labels
                                                                                               Never chain brackets
                        df.columns        Array-like column labels
                                                                                               NO     > df[mask]['X'] = 1
                       s_df.values        Numpy array, data
                        s_df.shape        (n_rows, n_cols)
                                                                                               YES    > df.loc[mask, 'X'] = 1
                s.dtype, df.dtypes        Type of Series or of each column

Take your Pandas skills to the next level! Register at
©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit                                     
3                                      Computation with
                                       Series and DataFrames
Pandas objects do not behave exactly like Numpy arrays. They follow three main rules of binary operations.

Rule 1: Operations between multiple Pandas objects
implement auto-alignment based on index first.
                                                                                                      Apply a Function to Each Series
                                                                                                      Apply series_to_* function to every column by default
                  > s1 + s2		                                 > s1.add(s2, fill_value=0)              (across rows):
     s1               s2                                      s1                 s2                   df.apply(series_to_value)                 →         Series
 a        1            NaN             a   NaN            a        1                  0       a   1   df.apply(series_to_series)                →         DataFrame

 b        2   +   b        4   =       b   4              b        2   +     b        4   =   b   6
                                                                                                      To apply the function to every row (across columns), set axis=1:
      NaN         c        5           c   NaN                     0         c        5       c   5
                                                                                                      df.apply(series_to_series, axis=1)
Use add, sub, mul, and div, to set fill value.
                                                                                                      Apply a Function to a DataFrame
Rule 2: Mathematical operators (+ - * / exp, log, ...)                                                Apply a function that receives a DataFrame and returns a Series,
apply element by element on the values.
                                                                                                      a DataFrame, or a single value:
                  df + 1                       df.abs()                    np.log(df)
                                                                                                      df.pipe(df_to_series)                     →         Series
                                                                                                      df.pipe(df_to_df)                         →         DataFrame
              X    Y                   X    Y                  X       Y                  X   Y
                                                                                                      df.pipe(df_to_value)                      →         Value
          a   -2 -2             a      -1 -1            a      1       1          a       0   0
          b   -2 -2             b      -1 -1            b      1       1          b       0   0       What Happens with Missing Values?
          c   -2 -2                c   -1 -1            c      1       1          c       0   0       Missing values are represented by NaN (not a number) or NaT
                                                                                                      (not a time).
                                                                                                      • They propagate in operations across Pandas objects
Rule 3: Reduction operations (mean, std, skew, kurt,
sum, prod, ...) are applied column by column by default.                                                (1 + NaN → NaN).
                           > df.sum()                                  Series                         • They are ignored in a "sensible" way in computations;
                                                                                                        They equal 0 in sum, they're ignored in mean, etc.
                               X       Y         > df.sum()            X
                           a                        →                  Y
                                                                                                      • They stay NaN with mathematical operations such
                                                                                                        as np.log(NaN) → NaN.
                                                                                                                       count:   Number of non-null observations
Operates across rows by default (axis=0, or axis='rows').
                                                                                                                         sum:   Sum of values
Operate across columns with axis=1 or axis='columns'.
                                                                                                                        mean:   Mean of values
                                                                                                                         mad:   Mean absolute deviation
Differences Between
                                                                                                                      median:   Arithmetic median of values
Pandas Objects and Numpy Arrays
                                                                                                                         min:   Minimum
When it comes to Pandas objects and Numpy arrays, aligning objects
                                                                                                                         max:   Maximum
on the index (or columns) before calculations might be the most
                                                                                                                        mode:   Mode
important difference. There are built-in methods for most common
                                                                                                                        prod:   Product of values
statistical operations, such as mean or sum, and they apply across
                                                                                                                         std:   Bessel-corrected sample standard deviation
one-dimension at a time. To apply custom functions, use one of
                                                                                                                         var:   Unbiased variance
three methods to do tablewise (pipe), row or column-wise (apply),
                                                                                                                         sem:   Standard error of the mean
or elementwise (applymap) operations.
                                                                                                                        skew:   Sample skewness (3rd moment)
Apply a Function to Each Value                                                                                          kurt:   Sample kurtosis (4th moment)
Apply a function to each value in a Series or DataFrame:                                                        quartile:       Sample quantile (Value at %)
s.apply(value_to_value)                             →       Series                                        value_counts:         Count of unique values
df.applymap(value_to_value)                         →       DataFrame

Take your Pandas skills to the next level! Register at
©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit                                         
4                              Plotting with Pandas
                               Series and DataFrames
Pandas uses Matplotlib to generate figures. Once a figure is generated with Pandas, all of Matplotlib's functions can be used
to modify the title, labels, legend, etc. In a Jupyter notebook, all plotting calls for a given plot should be in the same cell.

Parts of a Figure                                                                              Setup
An Axes object is what we                                                                      Import packages:
think of as a “plot”. It has a                                                                 > import pandas as pd
title and two Axis objects                                                                     > import matplotlib.pyplot as plt
that define data limits. Each                                                                  Execute this at IPython prompt to display figures in new windows:
Axis can have a label. There                                                                   > %matplotlib
can be multiple Axes objects                                                                   Use this in Jupyter notebooks to display static images inline:
in a Figure.                                                                                   > %matplotlib inline
                                                                                               Use this in Jupyter notebooks to display zoomable images inline:
                                                                                               > %matplotlib notebook

Plotting with Pandas Objects
                        Series                                                       DataFrame                                                Labels

With a Series, Pandas plots values against                     With a DataFrame, Pandas creates one line                  Use Matplotlib to override or add annotations:
the index:                                                     per column:                                                > ax.set_xlabel('Time')
> ax = s.plot()                                                > ax = df.plot()                                           > ax.set_ylabel('Value')
                                                                                                                          > ax.set_title('Experiment A')
Note: When plotting the results of complex manipulations with groupby, it's often useful to                               Pass labels if you want to override the column
stack/unstack the resulting DataFrame to fit the one-line-per-column assumption.                                          names and set the legend location:
                                                                                                                          > ax.legend(labels, loc='best')

Useful Arguments to Plot
                                                                                               • subplots=True: One subplot per column, instead of one line
                                                                                               • figsize: Set figure size, in inches
                                                                                               • x and y: Plot one column against another

Kinds of Plots

df.plot.scatter(x, y)                                    df.plot.hist()         

Take your Pandas skills to the next level! Register at
©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit                                  
5                              Manipulating
                               Dates and Times
Use a Datetime index for easy time-based indexing and slicing, as well as for powerful resampling and data alignment.
Pandas makes a distinction between timestamps, called Datetime objects, and time spans, called Period objects.

Converting Objects to Time Objects                                                             Creating Ranges of Timestamps
Convert different types like strings, lists, or arrays to Datetime with:                       > pd.date_range(start=None, end=None,
> pd.to_datetime(value)                                                                        		              periods=None, freq=offset,
Convert timestamps to time spans and set the period “duration” with                            		 tz='Europe/London')
frequency offset.                                                                              Specify either a start or end date, or both. Set number of "steps" with
> date_obj.to_period(freq=freq_offset)                                                         periods. Set "step size" with freq. Specify time zones with tz.

Frequency Offsets                                                                              Save Yourself Some Pain: Use ISO 8601 Format
Used by date_range, period_range and resample:                                                 To be consistent and minimize the risk of error or confusion, use ISO

• B: Business day                      • A: Year end                                           format YYYY-MM-DD when entering dates:

• D: Calendar day                      • AS: Year start                                        NO       > pd.to_datetime('12/01/2000') # 1st December

• W: Weekly                            • H: Hourly                                                      Timestamp('2000-12-01 00:00:00')

• M: Month end                         • S: Secondly                                           NO       > pd.to_datetime('13/01/2000') # 13th January!

• MS: Month start                      • L, ms: Milliseconds                                            Timestamp('2000-01-13 00:00:00')

• BM: Business month end               • U, us: Microseconds                                   YES      > pd.to_datetime('2000-01-13') # 13th January

• Q: Quarter end                       • N: Nanoseconds                                                 Timestamp('2000-01-13 00:00:00')

                                                                                               Creating Ranges of Periods
For more, look up "Pandas Offset Aliases" or check out the pandas.
                                                                                               > pd.period_range(start=None, end=None,
tseries.offsets and modules.
                                                                                               		                       periods=None, freq=offset)
Timestamps vs Periods
                                                                                               > s_df.resample(freq_offset).mean()

                                                                                               resample returns a groupby-like object that must be aggregated
                                                                                               with mean, sum, std, apply, etc. (See also the Split-Apply-Combine
                                                                                               cheat sheet.)


Pandas implements vectorized string operations named                                                            Splitting and Replacing
after Python's string methods. Access them through the                                                          Split returns a Series of lists:
str attribute of string Series.                                                                                 > s.str.split()
Some String Methods                                                                                             Access an element of each list with get:
                                                                                                                > s.str.split(char).str.get(1)
> s.str.lower()		                                > s.str.strip()
> s.str.isupper()		                              > s.str.normalize()                                            Return a DataFrame instead of a list:
                                                                                                                > s.str.split(expand=True)
> s.str.len()
                                                                                                                Find and replace with string or regular expressions:
Index by character position:
                                                                                                                > s.str.replace(str_or_regex, new)
> s.str[0]
                                                                                                                > s.str.extract(regex)
True if a regular expression pattern or string is in a Series:
                                                                                                                > s.str.findall(regex)
> s.str.contains(str_or_pattern)

Take your Pandas skills to the next level! Register at
©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit                                  
6                              Combining
There are numerous tools for combining Series and
DataFrames together, with SQL-type joins and concatena-
tion. Use join if merging on indices, otherwise use merge.
                                                                                               Join on Index
                                                                                               > df.join(other)
                                                                                               Merge DataFrames on indexes. Set on=columns to join on index of
                                                                                               other and on columns of df. join uses pd.merge under the covers.
Merge on Column Values
> pd.merge(left, right, how='inner', on='id')                                                  Concatenating DataFrames
Ignores index, unless on=None. See the section on the how keyword.                             > pd.concat(df_list)

Use on if merging on same column in both DataFrames, otherwise use                             “Stacks” DataFrames on top of each other.

left_on, right_on.                                                                             Set ignore_index=True to replace index with RangeIndex.
                                                                                               Note: Faster than repeated df.append(other_df).


                                                                    left                                left_on='X'     right_on='Y'                      right
                                                               LONG           X                     LONG         X      Y       SHORT                     Y       SHORT
                                                         0      aaaa          a                0     aaaa        a      —          —               0      b        bb
      left       right          how="outer"
                                                         1      bbbb          b                1     bbbb        b      b          bb              1      c        cc
                                                                                               2       —         —       c         cc

                                                               LONG           X                     LONG         X      Y       SHORT                     Y       SHORT
      left       right          how="inner"              0      aaaa          a                0     bbbb        b      b          bb              0      b        bb
                                                         1     bbbb           b                                                                    1      c        cc

                                                               LONG           X                     LONG         X      Y       SHORT                     Y       SHORT
      left       right          how="left"               0      aaaa          a                0     aaaa        a      —          —               0      b        bb
                                                         1      bbbb          b                1     bbbb        b      b          bb              1      c        cc

                                                               LONG           X                     LONG         X      Y       SHORT                     Y       SHORT
      left       right          how="right"              0      aaaa          a                0     bbbb        b      b          bb              0      b        bb
                                                         1      bbbb          b                1       —         —       c         cc              1      c        ctc


Pandas represents missing values                               To find missing values, use:
as NaN (Not a Number), which comes                             > s_df.isnull() or > pd.isnull(obj)
from Numpy and is of type float64.                             > s_df.notnull() or > pd.notnull(obj)
To find and replace these missing                              To replace missing values, use:
values, you can use any number                                 s_df.loc[s_df.isnull()] = 0                              Use mask to replace NaN
of methods.                                                    s_df.interpolate(method='linear')                        Interpolate using different methods
                                                               s_df.fillna(method='ffill')                              Fill forward (last valid value)
                                                               s_df.fillna(method='bfill')                              Or backward (next valid value)
                                                               s_df.dropna(how='any')                                   Drop rows if any value is NaN
                                                               s_df.dropna(how='all')                                   Drop rows if all values are NaN
                                                               s_df.dropna(how='all', axis=1)                           Drop across columns instead of rows

Take your Pandas skills to the next level! Register at
©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit                                  
1. Split the data based on
   some criteria.
                               Split / Apply / Combine
                               with DataFrames

2. Apply a function to each group
   to aggregate, transform, or filter.
                                                                                                                      Apply/Combine: Aggregation
                                                                                                                      Perform computations on each group. The
                                                                                                                      shape changes; the categories in the grouping
                                                                                                                      columns become the index. Can use built-
3. Combine the results.                                                                                               in aggregation methods: mean, sum, size,
                                                                                                                      count, std, var, sem, describe, first,
The apply and combine steps are                                                                                       last, nth, min, max, for example:
typically done together in Pandas.                                                                                    > g.mean()
                                                               Apply/Combine: Filtering                               ... or aggregate using custom function:
Split: Group By                                                                                                       > g.agg(series_to_value)
                                                               Returns a group only if condition is true.
Group by a single column:                                      > g.filter(lambda x: len(x)>1)                         ... or aggregate with multiple functions at once:
> g = df.groupby(col_name)                                                                                            > g.agg([s_to_v1, s_to_v2])
Grouping with list of column names creates a                                                                          ... or use different functions on different
DataFrame with a MultiIndex:                                                                                          columns:
> g = df.groupby(list_col_names)                                                                                      > g.agg({'Y': s_to_v1, 'Z': s_to_
Pass a function to group based on the index:                                                                          v2})
> g = df.groupby(function)
                                                                                                                            0 a
                                                                                                                            2 a

                                                                                                                            1 b                            a
                                                               Split/Apply/Combine                                          3 b                            b
                                                                                                                            4 ca

                                                                                                                      Other Groupby-Like Operations:
                                                                                                                      Window Functions
Apply/Combine:                                                                                                        • resample, rolling, and ewm (exponential
General Tool: apply
                                                                                                                        weighted function) methods behave like
apply is more general than agg, transform,
                                                                                                                        GroupBy objects. They keep track of which
and filter. It can aggregate, transform or                       Split                   Apply            Combine
                                                                                                                        row is in which “group.” Results must be
filter. The resulting dimensions can change,                     • Groupby               • Apply
                                                                 • Window                • Group-specific               aggregated with sum, mean, count, etc.
for example:
                                                                   Functions               transformations            • resample is often used before rolling,
> g.apply(lambda x: x.describe())                                                        • Aggregation
                                                                                                                        expanding, and ewm when using a Date-
                                                                                         • Group-specific Filtering
                                                                                                                        Time index.
Apply/Combine: Transformation
The shape and the index do not change.                         Split: What’s a GroupBy Object?
> g.transform(df_to_df)                                        It keeps track of which rows are part of
Example, normalization:                                        which group.
> def normalize(grp):                                          > g.groups → Dictionary, where keys are
.      return (                                                group names, and values are indices of rows in
.           (grp - grp.mean())                                 a given group.
.           / grp.var()                                        It is iterable:
.      )                                                       > for group, sub_df in g:
                                                               .         ...
> def normalize(grp):
.     return ((grp - grp.mean())
.         / grp.var())

Take your Pandas skills to the next level! Register at
©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit                               
8                              Reshaping DataFrames
                               and Pivot Tables
Let’s explore some tools for reshaping DataFrames from the wide to the long format and back. The long format can be tidy,
which means that each variable is a column, each observation is a row. It is easier to filter, aggregate, transform, sort, and
pivot. Reshaping operations often produces multi-level indices or columns, which can be sliced and indexed.

MultiIndex: A Multi-Level Hierarchical Index                                                   Long to Wide Format and Back with stack()
                                                                                               and unstack()
Often created as a result of:
> df.groupby(list_of_columns)                                                                  Pivot column level to index, i.e. “stacking the columns” (wide to long):
> df.set_index(list_of_columns)                                                                > df.stack()
Contiguous labels are displayed together but apply to each row. The                            Pivot index level to columns, “unstack the columns” (long to wide):
concept is similar to multi-level columns.                                                     > df.unstack()
A MultiIndex allows indexing and slicing one or multiple levels at once.                       If there are multiple indices or column levels, use level number or
Using the Long example from the right:                                                         name to stack/unstack:
                              long.loc[1900]             All 1900 rows                         > df.unstack(1) or > df.unstack('Month')
            long.loc[(1900, 'March')]                    Value 2                               A common use case for unstacking, plotting group data vs index
  long.xs('March', level='Month')                        All March rows                        after groupby:
                                                                                               > (df.groupby(['A', 'B])['relevant'].mean()
Simpler than using boolean indexing, for example:                                                     .unstack().plot())
> long[long.Month == 'March']

Pivot Tables
    > pd.pivot_table(df,
    .     index=cols,                    keys to group by for index
    .     columns=cols2,                 keys to group by for columns
    .     values=cols3,                  columns to aggregate

    .     aggfunc='mean')                what to do with repeated values                       From Wide to Long with melt
                                                                                               Specify which columns are identifiers (id_vars, values will be repeated
Omitting index, columns, or values will use all remaining
                                                                                               for each row) and which are “measured variables” (value_vars, will
columns of df. You can “pivot” a table manually using groupby, stack,
                                                                                               become values in variable column. All remaining columns by default).
and unstack.
                                                                                               > pd.melt(df, id_vars=id_cols, value_vars=value_columns)
                                                                                               > pd.melt(team, id_vars=['Color'],
                                                                                               .          value_vars=['A', 'B', 'C'],
                                                                                               .          var_name='Team',
                                                                                               .          value_name='Score')

df.pivot() vs pd.pivot_table
df.pivot()                     Does not deal with repeated values in index.
                               It's a declarative form of stack and unstack.
pd.pivot_table()               Use if you have repeated values in index
                               (specify aggfunc argument).

Take your Pandas skills to the next level! Register at
©2020 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit