# Pandas - Pandas was originally developed by Wes McKinney from AQR Capital Management, primarily to address issues in business analytics and quantitative trading. - The name `pandas` comes from `panel data`. - Unlike [[numpy|NumPy]], Pandas is designed for tabular and heterogeneous data. ## Topics - [[serialize|Serialize]] pandas - [[parquet|Parquet]] - multiple files - this has best support for Pandas metadata, multi indexes, extension types, etc. - [[hdf5|HDF5]] - [[pickle|Pickle]] - preserves absolutely everything, but has potential security and backward compatibility issues. - [[feather|Feather]] - not in a single file ## Usage - Pandas data structures have _index_, a crucial component. Data is aligned with respect to the index, for example, when adding two series. - `Series` can be considered as ordered dictionary. - Both index and series/DataFrame can have name, very handy in analysis. - `Index` is designed to be immutable, take advantage of it. - When performing arithmetics, if dimension doesn't match the result would be a union of the indices, `NaN` will be produced. - The builtin _extension_ data types in Pandas works better since they handle missing elements better. - `df.explode('col_name)` turns a list element into multiple rows while preserving other column values. This duplicates indexes. So we may need to perform `df.reset_index(drop=True)` - `series.array` converts it to a NumPy like array. - `df.reindex(columns=columns)` can be used to reindex, or to effectively drop columns! - `frame.sub(series, axis='index')` -- match on rows, broadcast over columns! - `df.apply()` - Can pass `axis='columns'` - Can return a `Series` - `applymap` should be used for element-wise functions - Use `dim` (dimensional table, or categories) and `values` (category codes) series, with `dim.take(values)` to reconstruct the categorical series. - `df.reset_index()` moves index value back to column! - `df.plot()` can be used to plot directly, there are sub-methods such as `plt.plot.bar()`. They also take `ax` object. - The `patsy` library - can be used to to construct matrices from dataframe for modeling. `DesignMatrix` is basically [[numpy|NumPy]] arrays with additional metadata. - `y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)), data` - `new_X = patsy.build_design_matrices([X.design_info], new_data)` ### Indexing - Regular `df[index]` index is ambiguous, it treats integers as labels if index contains integers.. Prefer `df.loc[]` when indexing with labels and `df.iloc[]` with integers to avoid any ambiguity. - Single index selects columns, the exception is `df[:n]` selects the first `n` rows. - With `df.loc[]`, first index selects rows, second selects columns. - Chained indexing cannot be used for assignment. - `argsort` returns indexes as a result of sorting (indexer), which can then be used for `df.take(indexer)`. ### GroupBy - `df.groupby()` is a "split-apply-combine" process. - The `GroupBy` object returned by `df.groupby()` has some optimized aggregation methods builtin. - You can also call non-optimized versions on it via `df.groupby().agg(func)`. Passing a list of aggregation methods we obtain them all. Passing two tuples we give names to the agg functions. - Passing dictionary to `df.agg()` can run different aggregations on different columns. - The returned object is iterable. We index it to get only the split we need. - Similarly, we can use `transform` method on the groups, which either returns an object of the same size, or a scalar value to be broadcasted. - For builtin aggregations, we can pass a string instead of a function/lambda. - `pd.crosstab(df['col1'], df['col2'])` is a simple way to compute a frequency table on two columns, this is easier than `groupby`. ### [[time-series|Time Series]] - There are many `DateOffset` objects to be used! They also have `rollforward` and `rollbackward` methods. - `resample` is better than `groupby` for aggregation. - `Period` object is also very handy! - `pd.Grouper()` object can be created to facilitate `groupby` operations, but time must be the index. - `rolling`, and `ewm`, etc, for rolling window operations. <!-- cSpell:words dmatrices rollbackward rollforward -->