# Pandas
- Pandas was originally developed by Wes McKinney from AQR Capital Management,
primarily to address issues in business analytics and quantitative trading.
- The name `pandas` comes from `panel data`.
- Unlike [[numpy|NumPy]], Pandas is designed for tabular and heterogeneous data.
## Topics
- [[serialize|Serialize]] pandas
- [[parquet|Parquet]] - multiple files - this has best support for Pandas
metadata, multi indexes, extension types, etc.
- [[hdf5|HDF5]]
- [[pickle|Pickle]] - preserves absolutely everything, but has potential
security and backward compatibility issues.
- [[feather|Feather]] - not in a single file
## Usage
- Pandas data structures have _index_, a crucial component. Data is aligned with
respect to the index, for example, when adding two series.
- `Series` can be considered as ordered dictionary.
- Both index and series/DataFrame can have name, very handy in analysis.
- `Index` is designed to be immutable, take advantage of it.
- When performing arithmetics, if dimension doesn't match the result would be a
union of the indices, `NaN` will be produced.
- The builtin _extension_ data types in Pandas works better since they handle
missing elements better.
- `df.explode('col_name)` turns a list element into multiple rows while
preserving other column values. This duplicates indexes. So we may need to
perform `df.reset_index(drop=True)`
- `series.array` converts it to a NumPy like array.
- `df.reindex(columns=columns)` can be used to reindex, or to effectively drop
columns!
- `frame.sub(series, axis='index')` -- match on rows, broadcast over columns!
- `df.apply()`
- Can pass `axis='columns'`
- Can return a `Series`
- `applymap` should be used for element-wise functions
- Use `dim` (dimensional table, or categories) and `values` (category codes)
series, with `dim.take(values)` to reconstruct the categorical series.
- `df.reset_index()` moves index value back to column!
- `df.plot()` can be used to plot directly, there are sub-methods such as
`plt.plot.bar()`. They also take `ax` object.
- The `patsy` library
- can be used to to construct matrices from dataframe for modeling.
`DesignMatrix` is basically [[numpy|NumPy]] arrays with additional metadata.
- `y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)), data`
- `new_X = patsy.build_design_matrices([X.design_info], new_data)`
### Indexing
- Regular `df[index]` index is ambiguous, it treats integers as labels if index
contains integers.. Prefer `df.loc[]` when indexing with labels and
`df.iloc[]` with integers to avoid any ambiguity.
- Single index selects columns, the exception is `df[:n]` selects the first `n`
rows.
- With `df.loc[]`, first index selects rows, second selects columns.
- Chained indexing cannot be used for assignment.
- `argsort` returns indexes as a result of sorting (indexer), which can then be
used for `df.take(indexer)`.
### GroupBy
- `df.groupby()` is a "split-apply-combine" process.
- The `GroupBy` object returned by `df.groupby()` has some optimized aggregation
methods builtin.
- You can also call non-optimized versions on it via `df.groupby().agg(func)`.
Passing a list of aggregation methods we obtain them all. Passing two tuples
we give names to the agg functions.
- Passing dictionary to `df.agg()` can run different aggregations on different
columns.
- The returned object is iterable. We index it to get only the split we need.
- Similarly, we can use `transform` method on the groups, which either returns
an object of the same size, or a scalar value to be broadcasted.
- For builtin aggregations, we can pass a string instead of a function/lambda.
- `pd.crosstab(df['col1'], df['col2'])` is a simple way to compute a frequency
table on two columns, this is easier than `groupby`.
### [[time-series|Time Series]]
- There are many `DateOffset` objects to be used! They also have `rollforward`
and `rollbackward` methods.
- `resample` is better than `groupby` for aggregation.
- `Period` object is also very handy!
- `pd.Grouper()` object can be created to facilitate `groupby` operations, but
time must be the index.
- `rolling`, and `ewm`, etc, for rolling window operations.
<!-- cSpell:words dmatrices rollbackward rollforward -->