# Data Management
- Data components
- Data objects
- Attributes
- Values
## Description
- Types
- Tabular
- Categorical: Qualitative
- Nominal
- Ordinal
- Numerical: Quantitative
- Discrete
-
- Text
- Time Series
- Image
- Network
- _Statistics_ ($\bar x$) of the _sample_ is an _estimate_ of _parameter_
($\mu$) of the _population_.
## Pre-processing
- Replace missing values
- Substitute missing values with dummy values, mean
- Substitute missing values with the most frequent values
- Reduce Data
- Attribute Selection (select the most useful attributes/variables)
- Remove outliers
- Record sampling (randomly, or by defined rules)
- Create new features from existing ones
- Discretize data
- Data normalization
- min-max scaling (normalization)
- Standardization (to bring the outliers closer)
- Correlation/Covariance analysis
## Reduction
- "Curse of Dimensionality", exponentially many training points are needed as
dimensionality increases.
- High dimensionality causes **sparsity**, while good models need to cover as
many regions as possible.
- Numerosity Reduction
- Simply sampling
- Adopt _stratified sampling_ in sparse dataset.
- Dimensionality Reduction
- Feature selection (Heuristic search)
- Remove redundant attributes
- Remove irrelevant attributes
- Methods
- Best single attribute under independent assumption
- Forward step-wise selection (addition)
- Backward step-wise selection (elimination)
- (Latent) Feature extraction
- Principal Component Analysis (PCA), identifying eigenvectors/values.
- Singular Value Decomposition (SVD), reducing the number of observations.
## Visualization
See [[visualization]].
## Validation
- Data Quality
- Accuracy
- Completeness
- Consistency
- Timelines
- Frequent Tests
- Null Test
- Distribution Test
- Volume Test
- Uniqueness Test
- Correlation Analysis