DOKK Library

Training with Pandas

Authors Enthought Inc.

License CC-BY-NC-ND-4.0

Plaintext
1                             Reading and Writing
                              Data with Pandas

Functions to read data are all named pd.read_* where
* is the file type. Series and DataFrames can be saved to
disk using their to_* method.



Usage Patterns
Use pd.read_clipboard() for one-off data extractions.
Use the other pd.read_* methods in scripts for repeatable
analyses.



Reading Text Files into a DataFrame
Colors highlight how different arguments map from the data file
to a DataFrame.




Other arguments:                                                                           Possible values of parse_dates:
• names: Set or override column names                                                      [0, 2]: Parse columns 0 and 2 as separate dates
• parse_dates: Accepts multiple argument types                                             [[0, 2]]: Group columns 0 and 2 and parse as single date
• converters: Manually process each element in a column                                    {'Date': [0, 2]}: Group columns 0 and 2, parse as single date
• comment: Character indicating commented line                                             in a column named Date
• chunksize: Read only a certain number of rows each time                                  Dates are parsed after the converters have been applied.


Parsing Tables from the Web




Writing Data Structures to Disk                                                          Writing Data Structures from and to a Database
Write data structures to disk:                                                           Read, using SQLAlchemy. Supports multiple databases:
>>> s_df.to_csv(filename)                                                                >>> from sqlalchemy import create_engine
>>> s_df.to_excel(filename)                                                              >>> engine = create_engine(database_url)
Write multiple DataFrames to single Excel file:                                          >>> conn = engine.connect()
>>> writer = pd.ExcelWriter(filename)                                                    >>> df = pd.read_sql(query_str_or_table_name, conn)
>>> df1.to_excel(writer, sheet_name='First')                                             Write:
>>> df2.to_excel(writer, sheet_name='Second')                                            >>> df.to_sql(table_name, conn)
>>> writer.save()



Take your Pandas skills to the next level! Register at enthought.com/pandas-mastery-workshop
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/                              www.enthought.com
2
s
                              Pandas Data Structures:
                              Series and DataFrames
Within Pandas, there are two primary data structures: Series (s)
and DataFrames (df).
           A Series, which maps an index to values. It can be
           thought of as an ordered dictionary or a Numpy array
                                                                                                            len(s_df)
                                                                                          s_df.head() and s_df.tail()
                                                                                                           s.unique()
                                                                                                         s_df.describ
                                                                                                            df.info()
                                                                                                                               Number of rows
                                                                                                                               First/last rows
                                                                                                                               Series of unique values
                                                                                                                               Summary stats
                                                                                                                               Memory usage
           with row labels and a name.
df         A DataFrame, which maps index and column labels to                             Indexing and Slicing
           values. It is like a dictionary of Series (columns) sharing                    Use these attributes on Series and DataFrames for indexing,
           the same index, or like a 2D Numpy array with row and                          slicing, and assignments:
           column labels.                                                                           s_df.loc[ ]                Refers only to the index labels
                                                                                                   s_df.iloc[ ]                Refers only to the integer location,
s_df       Applies to both Series and DataFrames.
                                                                                          		                                   similar to lists or Numpy arrays
Manipulations of Pandas objects usually return copies.                                    s_df.xs(key, level=L)                Select rows with label key in level L
                                                                                          		                                   of an object with MultiIndex.
Creating Series and DataFrames
        Values                                                                            Masking and Boolean Indexing
                     Series                                                               Create masks with comparisons:
 n1 'Cary'       0
                     >>> pd.Series(values, index=index, name=name)                        mask = df['X'] < 0
 n2 'Lynn'       1
                     >>> pd.Series({'idx1' : val1,'idx2' : val2})
 n3 'Sam         2                                                                        Or isin, for membership mask:
                     Where values, index, and name are sequences or arrays.
Index                                                                                     mask = df['X'].isin(list_of_valid_values)

DataFrame		                                                                               Use masks for indexing:
                                                            Age   Gender Columns
>>> pd.DataFrame(values,                                                                  df.loc[mask] = 0
                                                   'Cary'    32     M
        index=index, columns=col_names)
                                                   'Lynn'    18     F                     Combine multiple masks with bitwise operators — and (&), or (|),
>>> pd.DataFrame({'col1' :
                                                   'Sam      26     M                     or (^), not (~) — and group them with parentheses:
        series1_or _seq,
                                                   Index                                  mask = (df['X'] < 0) & (df['Y'] == 0)
    'col2': series2_or _seq})
Where values is a sequence of sequences or a 2D array.                                    Common Indexing and Slicing Patterns
                                                                                          rows and cols can be values, lists, Series, or masks.
Manipulating Series and DataFrames
                                                                                                         s_df.loc[rows]        Some rows (all columns in a DataFrame)
Manipulating Columns                                                                               df.loc[:, cols_list]        All rows, some columns
df.rename(columns={old_name:new_name})                      Renames column                           df.loc[rows, cols]        Subset of rows and columns
df.drop(name_or_names, axis='columns')                      Drops column name                            s_df.loc[mask]        Boolean mask of rows (all columns)
Manipulating Index                                                                                   df.loc[mask, cols]        Boolean mask of rows, some columns
s_df.reindex(new_index)                          Conform to new index
s_df.drop(labels_to_drop)                        Drops index labels                       Using [ ] on Series and DataFrames
s_df.rename
                                                                                          On Series, [] refers to the index labels, or to a slice:
   (index={old_label: new_label})                Renames index labels
                                                                                                                      s['a']   Value
s_df.sort_index()                                Sorts index labels
                                                                                                                       s[:2]   Series, first two rows
df.set_index(column_name_or_names)
s_df.reset_index()                               Inserts index into columns, resets       On DataFrames, [] refers to columns labels:
                                                 index to default integer index                                      df['X']   Series
                                                                                                              df[['X', 'Y']]   DataFrame
Manipulating Values                                                                       df['new_or_old_col'] = series_or_array
All row values and the index will follow:
df.sort_values(col_name, ascending=True)                                                  Except with a slice or mask, as shown below:
df.sort_values(['X','Y'], ascending=[False, True])                                                                    df[:2]   DataFrame, first two rows
                                                                                                                    df[mask]   DataFrame, rows where mask is True
Important Attributes and Methods
                                                                                          Never chain brackets
                    s_df.index         Array-like row labels
                                                                                          NO >>> df[mask]['X'] = 1
                    df.columns         Array-like column labels
                                                                                               SettingWithCopyWarning
                   s_df.values         Numpy array, data
                                                                                          YES >>> df.loc[mask, 'X'] = 1
                    s_df.shape         (n_rows, n_cols)
            s.dtype, df.dtypes         Type of Series or of each column



Take your Pandas skills to the next level! Register at enthought.com/pandas-mastery-workshop
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/                                        www.enthought.com
3                                    Computation with
                                     Series and DataFrames
Pandas objects do not behave exactly like Numpy arrays. They follow three main rules of binary operations.


Rule 1: Operations between multiple Pandas objects
implement auto-alignment based on index first.
                                                                                                         Apply a Function to Each Series
                                                                                                         Apply series_to_* function to every column by default
                   > s1 + s2		                               > s1.add(s2, fill_value=0)                  (across rows):
     s1                s2                                    s1                     s2                   df.apply(series_to_value)              →        Series
 a        11            NaN
                        NaN          a    NaN
                                          NaN            a        1                      0       a   1
                                                                                                         df.apply(series_to_series)             →        DataFrame

 b        2
          2    +
               +   b        4
                            4   =
                                =    b    4
                                          4              b        2   +         b        4   =   b   6
                                                                                                         To apply the function to every row (across columns), set axis=1:
      NaN
      NaN          c        5
                            5        c    NaN
                                          NaN                     0             c        5       c   5   df.apply(series_to_series, axis=1)
Use add, sub, mul, and div, to set fill value.
                                                                                                         Apply a Function to a DataFrame
Rule 2: Mathematical operators (+ - * / exp, log, ...)                                                   Apply a function that receives a DataFrame and returns a Series,
apply element by element on the values.
                                                                                                         a DataFrame, or a single value:
                   df + 1                     df.abs()                        np.log(df)
                                                                                                         df.pipe(df_to_series)                  →        Series
                                                                                                         df.pipe(df_to_df)                      →        DataFrame
               X    Y                X     Y                  X       Y                      X   Y
                                                                                                         df.pipe(df_to_value)                   →        Value
           a -2 -2               a   -1    -1          a      1       1              a       0   0
           b -2 -2               b   -1    -1          b      1       1              b       0   0       What Happens with Missing Values?
                                                                                                         Missing values are represented by NaN (not a number) or NaT
           c   -2 -2             c   -1    -1          c      1       1              c       0   0
                                                                                                         (not a time).
                                                                                                         • They propagate in operations across Pandas objects
Rule 3: Reduction operations (mean, std, skew, kurt,
sum, prod, ...) are applied column by column by default.                                                   (1 + NaN → NaN).
                            > df.sum()                                Series                             • They are ignored in a "sensible" way in computations;
                                                                                                           They equal 0 in sum, they're ignored in mean, etc.
                                X    Y          > df.sum()                X
                        a                          →                      Y
                                                                                                         • They stay NaN with mathematical operations such
                                                                                                           as np.log(NaN) → NaN.
                        b
                            c
                                                                                                                         count:   Number of non-null observations
Operates across rows by default (axis=0, or axis='rows').
                                                                                                                           sum:   Sum of values
Operate across columns with axis=1 or axis='columns'.
                                                                                                                          mean:   Mean of values

Differences Between                                                                                                        mad:   Mean absolute deviation

Pandas Objects and Numpy Arrays                                                                                     median:       Arithmetic median of values

When it comes to Pandas objects and Numpy arrays, aligning                                                                 min:   Minimum

objects on the index (or columns) before calculations might be                                                             max:   Maximum

the most important difference. There are built-in methods for                                                             mode:   Mode

most common statistical operations, such as mean or sum, and                                                              prod:   Product of values

they apply across one-dimension at a time. To apply custom                                                                 std:   Bessel-corrected sample standard deviation

functions, use one of three methods to do tablewise (pipe), row                                                            var:   Unbiased variance

or column-wise (apply), or elementwise (applymap) operations.                                                              sem:   Standard error of the mean
                                                                                                                          skew:   Sample skewness (3rd moment)
Apply a Function to Each Value                                                                                            kurt:   Sample kurtosis (4th moment)
Apply a function to each value in a Series or DataFrame:                                                          quartile:       Sample quantile (Value at %)
s.apply(value_to_value)                            →       Series                                            value_counts:        Count of unique values
df.applymap(value_to_value)                        →       DataFrame




Take your Pandas skills to the next level! Register at enthought.com/pandas-mastery-workshop
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/                                                 www.enthought.com
4                             Plotting with Pandas
                              Series and DataFrames
Pandas uses Matplotlib to generate figures. Once a figure is generated with Pandas, all of Matplotlib's functions can be
used to modify the title, labels, legend, etc. In a Jupyter notebook, all plotting calls for a given plot should be in the same cell.



Parts of a Figure                                                                         Setup
An Axes object is what we                                                                 Import packages:
think of as a “plot”. It has a                                                            >>> import pandas as pd
title and two Axis objects                                                                >>> import matplotlib.pyplot as plt
that define data limits. Each                                                             Execute this at IPython prompt to display figures in new windows:
Axis can have a label. There                                                              >>> %matplotlib
can be multiple Axes objects                                                              Use this in Jupyter notebooks to display static images inline:
in a Figure.                                                                              >>> %matplotlib inline
                                                                                          Use this in Jupyter notebooks to display zoomable images inline:
                                                                                          > %matplotlib notebook

Plotting with Pandas Objects
                       Series                                                  DataFrame                                              Labels




With a Series, Pandas plots values against                  With a DataFrame, Pandas creates one                   Use Matplotlib to override or add annotations:
the index:                                                  line per column:                                       >>> ax.set_xlabel('Time')
>>> ax = s.plot()                                           >>> ax = df.plot()                                     >>> ax.set_ylabel('Value')
                                                                                                                   >>> ax.set_title('Experiment A')
Note: When plotting the results of complex manipulations with groupby, it's often useful to                        Pass labels if you want to override the col-
stack/unstack the resulting DataFrame to fit the one-line-per-column assumption.                                   umn names and set the legend location:
                                                                                                                   >>> ax.legend(labels, loc='best')

Useful Arguments to Plot
                                                                                         • subplots=True: One subplot per column, instead of one line
                                                                                         • figsize: Set figure size, in inches
                                                                                         • x and y: Plot one column against another




Kinds of Plots




                                                                                                                                           +
df.plot.scatter(x, y)                   df.plot.bar()                      df.plot.hist()                     df.plot.box()



Take your Pandas skills to the next level! Register at enthought.com/pandas-mastery-workshop
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/                                  www.enthought.com
5                             Manipulating
                              Dates and Times
Use a Datetime index for easy time-based indexing and slicing, as well as for powerful resampling and data alignment.
Pandas makes a distinction between timestamps, called Datetime objects, and time spans, called Period objects.


Converting Objects to Time Objects                                                        Creating Ranges of Timestamps
Convert different types like strings, lists, or arrays to Datetime with:                  >>> pd.date_range(start=None, end=None,
>>> pd.to_datetime(value)                                                                                   periods=None, freq=offset,
Convert timestamps to time spans and set the period “duration”                            		                tz='Europe/London')
with frequency offset.                                                                    Specify either a start or end date, or both. Set number of "steps"
>>> date_obj.to_period(freq=freq_offset)                                                  with periods. Set "step size" with freq. Specify time zones with tz.


Frequency Offsets                                                                         Save Yourself Some Pain: Use ISO 8601 Format
Used by date_range, period_range and resample:                                            To be consistent and minimize the risk of error or confusion, use

• B: Business day                   • A: Year end                                         ISO format YYYY-MM-DD when entering dates:

• D: Calendar day                   • AS: Year start                                      NO     >>> pd.to_datetime('12/01/2000') # 1st December

• W: Weekly                         • H: Hourly                                                     Timestamp('2000-12-01 00:00:00')

• M: Month end                      • S: Secondly                                         NO     >>> pd.to_datetime('13/01/2000') # 13th January!

• MS: Month start                   • L, ms: Milliseconds                                           Timestamp('2000-01-13 00:00:00')

• BM: Business month end            • U, us: Microseconds                                 YES    >>> pd.to_datetime('2000-01-13') # 13th January

• Q: Quarter end                    • N: Nanoseconds                                                Timestamp('2000-01-13 00:00:00')

                                                                                          Creating Ranges of Periods
For more, look up "Pandas Offset Aliases" or check out the pandas.
                                                                                          >>> pd.period_range(start=None, end=None,
tseries.offsets and pandas.tseries.holiday modules.
                                                                                          		                            periods=None, freq=offset)
Timestamps vs Periods
                                                                                          Resampling
                                                                                          >>> s_df.resample(freq_offset).mean()

                                                                                          resample returns a groupby-like object that must be aggregated
                                                                                          with mean, sum, std, apply, etc. (See also the Split-Apply-Combine
                                                                                          cheatsheet.)




VECTORIZED STRING OPERATIONS

Pandas implements vectorized string operations named                                                          Splitting and Replacing
after Python's string methods. Access them through the                                                        Split returns a Series of lists:
str attribute of string Series.                                                                               >>> s.str.split()
Some String Methods                                                                                           Access an element of each list with get:
                                                                                                              >>> s.str.split(char).str.get(1)
>>> s.str.lower()		                           >>> s.str.strip()
>>> s.str.isupper()		                         >>> s.str.normalize()                                           Return a DataFrame instead of a list:
                                                                                                              >>> s.str.split(expand=True)
>>> s.str.len()
                                                                                                              Find and replace with string or regular expressions:
Index by character position:
                                                                                                              >>> s.str.replace(str_or_regex, new)
>>> s.str[0]
                                                                                                              >>> s.str.extract(regex)
True if a regular expression pattern or string is in a Series:
                                                                                                              >>> s.str.findall(regex)
>>> s.str.contains(str_or_pattern)



Take your Pandas skills to the next level! Register at enthought.com/pandas-mastery-workshop
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/                                          www.enthought.com
6                             Combining
                              DataFrames
There are numerous tools for combining Series and
DataFrames together, with SQL-type joins and concat-
enation. Use join if merging on indices, otherwise use
merge.
                                                                                          Join on Index
                                                                                          >>> df.join(other)
                                                                                          Merge DataFrames on indexes. Set on=columns to join on index of
                                                                                          other and on columns of df. join uses pd.merge under the covers.

Merge on Column Values                                                                    Concatenating DataFrames
>>> pd.merge(left, right, how='inner', on='id')
                                                                                          >>> pd.concat(df_list)
Ignores index, unless on=None. See the section on the how keyword.
                                                                                          “Stacks” DataFrames on top of each other.
Use on if merging on same column in both DataFrames, otherwise use                        Set ignore_index=True to replace index with RangeIndex.
left_on, right_on.                                                                        Note: Faster than repeated df.append(other_df).


MERGE TYPES: THE HOW KEYWORD

                                                                left                             left_on='X'      right_on='Y'                      right
                                                           LONG          X                    LONG            X    Y     SHORT                      Y       SHORT
                                                      0    aaaa          a               0    aaaa            a   —         —             0         b        bb
      left      right         how="outer"
                                                      1    bbbb          b               1    bbbb            b    b       bb             1         c        cc
                                                                                         2       —            —    c        cc

                                                           LONG          X                    LONG            X    Y     SHORT                      Y       SHORT
      left      right         how="inner"             0    aaaa          a               0    bbbb            b    b       bb             0         b        bb
                                                      1    bbbb          b                                                                1         c        cc


                                                           LONG          X                    LONG            X    Y     SHORT                      Y       SHORT
      left      right         how="left"              0    aaaa          a               0    aaaa            a   —         —             0         b        bb
                                                      1    bbbb          b               1    bbbb            b    b       bb             1         c        cc


                                                           LONG          X                    LONG            X    Y     SHORT                      Y       SHORT
      left      right         how="right"             0    aaaa          a               0    bbbb            b    b       bb             0         b        bb
                                                      1    bbbb          b               1       —            —    c        cc            1         c        ctc



CLEANING DATA WITH MISSING VALUES

Pandas represents missing values                            To find missing values, use:
as NaN (Not a Number), which comes                          >>> s_df.isnull() or >>> pd.isnull(obj)
from Numpy and is of type float64.                          >>> s_df.notnull() or >>> pd.notnull(obj)
To find and replace these missing                           To replace missing values, use:
values, you can use any number                              s_df.loc[s_df.isnull()] = 0                           Use mask to replace NaN
of methods.                                                 s_df.interpolate(method='linear')                     Interpolate using different methods
                                                            s_df.fillna(method='ffill')                           Fill forward (last valid value)
                                                            s_df.fillna(method='bfill')                           Or backward (next valid value)
                                                            s_df.dropna(how='any')                                Drop rows if any value is NaN
                                                            s_df.dropna(how='all')                                Drop rows if all values are NaN
                                                            s_df.dropna(how='all', axis=1)                        Drop across columns instead of rows



Take your Pandas skills to the next level! Register at enthought.com/pandas-mastery-workshop
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/                                  www.enthought.com
7
1. Split the data based on
   some criteria.
                              Split / Apply / Combine
                              with DataFrames


2. Apply a function to each group
   to aggregate, transform, or filter.
                                                                                                              Apply/Combine: Aggregation
                                                                                                              Perform computations on each group.
                                                                                                              The shape changes; the categories in the
                                                                                                              grouping columns become the index. Can
3. Combine the results.                                                                                       use built-in aggregation methods: mean,
                                                                                                              sum, size, count, std, var, sem, describe,
The apply and combine steps are                                                                               first, last, nth, min, max, for example:
typically done together in Pandas.                                                                            >>> g.mean()
                                                            Apply/Combine: Filtering                          ... or aggregate using custom function:
Split: Group By                                             Returns a group only if condition is true.        >>> g.agg(series_to_value)
Group by a single column:                                   >>> g.filter(lambda x: len(x)>1)                  ... or aggregate with multiple functions at
>>> g = df.groupby(col_name)                                                                                  once:
Grouping with list of column names cre-                                                                       >>> g.agg([s_to_v1, s_to_v2])
ates a DataFrame with a MultiIndex:                                                                           ... or use different functions on different
>>> g = df.groupby(list_col_names)                                                                            columns:
Pass a function to group based on the index:                                                                  >>> g.agg({'Y': s_to_v1,
>>> g = df.groupby(function)                                                                                  ...        'Z': s_to_v2})

                                                                                                                           Z
                                                                                                                    0 a
                                                                                                                    2 a

                                                            Split/Apply/Combine                                            Z
                                                                                                                    1 b                          a
                                                                                                                    3 b                          b
                                                                                                                                                 c
                                                                                                                           Z
                                                                                                                   4 c


                                                                                                              Other Groupby-Like Operations:
Apply/Combine:                                                                                                Window Functions
General Tool: apply
                                                                                                              • resample, rolling, and ewm (exponen-
apply is more general than agg, trans-
                                                                                                                tial weighted function) methods behave
form, and filter. It can aggregate,                          Split                 Apply           Combine
                                                                                                                like GroupBy objects. They keep track of
transform or filter. The resulting dimen-                    • Groupby             • Apply                      which row is in which “group.” Results
sions can change, for example:                               • Window              • Group-specific
                                                               Functions             transformations            must be aggregated with sum, mean,
>>> g.apply(lambda x: x.describe())                                                • Aggregation                count, etc.
                                                                                   • Group-specific Filter-
Apply/Combine: Transformation                                                        ing                      • resample is often used before roll-
                                                                                                                ing, expanding, and ewm when using a
The shape and the index do not change.
>>> g.transform(df_to_df)
                                                            Split: What’s a GroupBy Object?                     DateTime index.

Example, normalization:                                     It keeps track of which rows are part of
>>> def normalize(grp):                                     which group.
...      return (                                           >>> g.groups → Dictionary, where keys
...          (grp - grp.mean())                             are group names, and values are indices
...          / grp.var()                                    of rows in a given group.
...      )                                                  It is iterable:
                                                            >>> for group, sub_df in g:
>>> def normalize(grp):                                     ...         ...
...     return ((grp - grp.mean())
...         / grp.var())




Take your Pandas skills to the next level! Register at enthought.com/pandas-mastery-workshop
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/                              www.enthought.com
8                             Reshaping DataFrames
                              and Pivot Tables
Let’s explore some tools for reshaping DataFrames from the wide to the long format and back. The long format can be
tidy, which means that each variable is a column, each observation is a row. It is easier to filter, aggregate, transform,
sort, and pivot. Reshaping operations often produces multi-level indices or columns, which can be sliced and indexed.


MultiIndex: A Multi-Level Hierarchical Index                                              Long to Wide Format and Back with stack()
                                                                                          and unstack()
Often created as a result of:
>>> df.groupby(list_of_columns)                                                           Pivot column level to index, i.e. “stacking the columns” (wide to long):
>>> df.set_index(list_of_columns)                                                         >>> df.stack()
Contiguous labels are displayed together but apply to each row.                           Pivot index level to columns, “unstack the columns” (long to wide):
The concept is similar to multi-level columns.                                            >>> df.unstack()
A MultiIndex allows indexing and slicing one or multiple levels at                        If there are multiple indices or column levels, use level number or
once. Using the Long example from the right:                                              name to stack/unstack:
                            long.loc[1900]            All 1900 rows                       >>> df.unstack(1) or >>> df.unstack('Month')
           long.loc[(1900, 'March')]                  Value 2                             A common use case for unstacking, plotting group data vs index
  long.xs('March', level='Month')                     All March rows                      after groupby:
                                                                                          >>> (df.groupby(['A', 'B])['relevant'].mean()
Simpler than using boolean indexing, for example:                                         ...       .unstack().plot())
>>> long[long.Month == 'March']

Pivot Tables
   >>> pd.pivot_table(df,
   ...       index=cols,              keys to group by for index
   ...       columns=cols2,           keys to group by for columns
   ...       values=cols3,            columns to aggregate

   ...       aggfunc='mean') what to do with repeated values                              From Wide to Long with melt
                                                                                          Specify which columns are identifiers (id_vars, values will be
Omitting index, columns, or values will use all remaining
                                                                                          repeated for each row) and which are “measured variables”
columns of df. You can “pivot” a table manually using groupby,
                                                                                          (value_vars, will become values in variable column. All remain-
stack, and unstack.
                                                                                          ing columns by default).
                                                                                          >>> pd.melt(df, id_vars=id_cols, value_vars=value_columns)
                                                                                          >>> pd.melt(team, id_vars=['Color'],
                                                                                          ...          value_vars=['A', 'B', 'C'],
                                                                                          ...          var_name='Team',
                                                                                          ...          value_name='Score')




df.pivot() vs pd.pivot_table
df.pivot()                    Does not deal with repeated values in index.
                              It's a declarative form of stack and unstack.
pd.pivot_table()              Use if you have repeated values in index
                              (specify aggfunc argument).




Take your Pandas skills to the next level! Register at enthought.com/pandas-mastery-workshop
©2022 Enthought, Inc., licensed under the Creative Commons Attribution – Non-Commercial, No Derivatives 4.0
International License. To view a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0/                                   www.enthought.com