# Parquet
[Apache Parquet](https://parquet.apache.org)
[Databricks Parquet](https://databricks.com/glossary/what-is-parquet)
[[databricks]] [[apache]]
## Goal
Parquet is built:
- from ground up with complex [[nested]] data in mind.
- for efficient comression and encoding.
- for everyone.
## Benefits for using Parquet
Parquet has the benefits of [[columnar]] data source.
- Free and open source. - Open to any project in the Hadoop ecosystem.
[[hadoop]]
- Language agnostic.
- Used for analytics (OLAP) use cases, typically in conjunction with traditional
OLTP databases.
- Supports complex data, big data, and nested data - since Parquet is built from
ground up using
[record shredding and assembly algorithm](https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper),
with all complex [[nested]] data in mind - using this algorithm is superior to
simply flattening nested name spaces.
- [[record-shredding-and-assembly-algorithm]]
- Particularly efficient in compression and encoding.
![[parquet-cost-reduction.png]]
## File Format

### Concepts
- _Block_ is a block in hdfs, on which _file_ is designed.
- _File_ must include the metadata, that is, one or more row groups.
- _Row group_ is merely logical, each row group contains exactly one column
chunk per column.
- _Column chunk_ is a chunk of data for a particular column. They are physically
contiguous on the disk.
- _Page_ are divisions of column chunks.
Parallelism can be at level of:
- MapReduce - File/Row Group
- IO - Column chunk
- Encoding/Compression - Page
![[parquet-file-format.svg]]
### Type
```
BOOLEAN: 1 bit boolean
INT32: 32 bit signed ints
INT64: 64 bit signed ints
INT96: 96 bit signed ints
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays.
```
Logical types can be built based on the primitive types. For example, string as
Byte Array with UTF8 annotation.
### Data Pages
[[database/dremel]] encoding with definition and repetition levels.
Inside the page, definition levels data, repetition levels data and encoded
values are included. If the data is not nested. i.e. the _path depth_ to the
column is 1, only values themselves are required.
The compression and encoding is specified in the page metadata.
See [[dremel|Dremel Paper]] for detailed explanation.
## Modules
- [parquet-mr](https://github.com/apache/parquet-mr) core components of Parquet,
provide hadoop IO formats, Pig loaders, and other Java-based utilities
- [parquet-cpp](https://github.com/apache/parquet-cpp) and
[parquet-rs](https://github.com/apache/parquet-rs) to read-write Parquet files
in [[c++|C++]] or [[rust|Rust]].
- [Parquet and Pandas DataFrame](https://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet)
Looks like when reading parquet files in [[python-pandas|Pandas]], always read
into the memory first.
## Others
[Delta Lake project based on Parquet](https://delta.io)