parquet - Samuel's Vault

# Parquet [Apache Parquet](https://parquet.apache.org) [Databricks Parquet](https://databricks.com/glossary/what-is-parquet) [[databricks]] [[apache]] ## Goal Parquet is built: - from ground up with complex [[nested]] data in mind. - for efficient comression and encoding. - for everyone. ## Benefits for using Parquet Parquet has the benefits of [[columnar]] data source. - Free and open source. - Open to any project in the Hadoop ecosystem. [[hadoop]] - Language agnostic. - Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases. - Supports complex data, big data, and nested data - since Parquet is built from ground up using [record shredding and assembly algorithm](https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper), with all complex [[nested]] data in mind - using this algorithm is superior to simply flattening nested name spaces. - [[record-shredding-and-assembly-algorithm]] - Particularly efficient in compression and encoding. ![[parquet-cost-reduction.png]] ## File Format ![Parquet File Format](https://parquet.apache.org/images/FileLayout.gif) ### Concepts - _Block_ is a block in hdfs, on which _file_ is designed. - _File_ must include the metadata, that is, one or more row groups. - _Row group_ is merely logical, each row group contains exactly one column chunk per column. - _Column chunk_ is a chunk of data for a particular column. They are physically contiguous on the disk. - _Page_ are divisions of column chunks. Parallelism can be at level of: - MapReduce - File/Row Group - IO - Column chunk - Encoding/Compression - Page ![[parquet-file-format.svg]] ### Type ``` BOOLEAN: 1 bit boolean INT32: 32 bit signed ints INT64: 64 bit signed ints INT96: 96 bit signed ints FLOAT: IEEE 32-bit floating point values DOUBLE: IEEE 64-bit floating point values BYTE_ARRAY: arbitrarily long byte arrays. ``` Logical types can be built based on the primitive types. For example, string as Byte Array with UTF8 annotation. ### Data Pages [[database/dremel]] encoding with definition and repetition levels. Inside the page, definition levels data, repetition levels data and encoded values are included. If the data is not nested. i.e. the _path depth_ to the column is 1, only values themselves are required. The compression and encoding is specified in the page metadata. See [[dremel|Dremel Paper]] for detailed explanation. ## Modules - [parquet-mr](https://github.com/apache/parquet-mr) core components of Parquet, provide hadoop IO formats, Pig loaders, and other Java-based utilities - [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet-rs](https://github.com/apache/parquet-rs) to read-write Parquet files in [[c++|C++]] or [[rust|Rust]]. - [Parquet and Pandas DataFrame](https://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet) Looks like when reading parquet files in [[python-pandas|Pandas]], always read into the memory first. ## Others [Delta Lake project based on Parquet](https://delta.io)