# Data Management - Data components - Data objects - Attributes - Values ## Description - Types - Tabular - Categorical: Qualitative - Nominal - Ordinal - Numerical: Quantitative - Discrete - - Text - Time Series - Image - Network - _Statistics_ ($\bar x$) of the _sample_ is an _estimate_ of _parameter_ ($\mu$) of the _population_. ## Pre-processing - Replace missing values - Substitute missing values with dummy values, mean - Substitute missing values with the most frequent values - Reduce Data - Attribute Selection (select the most useful attributes/variables) - Remove outliers - Record sampling (randomly, or by defined rules) - Create new features from existing ones - Discretize data - Data normalization - min-max scaling (normalization) - Standardization (to bring the outliers closer) - Correlation/Covariance analysis ## Reduction - "Curse of Dimensionality", exponentially many training points are needed as dimensionality increases. - High dimensionality causes **sparsity**, while good models need to cover as many regions as possible. - Numerosity Reduction - Simply sampling - Adopt _stratified sampling_ in sparse dataset. - Dimensionality Reduction - Feature selection (Heuristic search) - Remove redundant attributes - Remove irrelevant attributes - Methods - Best single attribute under independent assumption - Forward step-wise selection (addition) - Backward step-wise selection (elimination) - (Latent) Feature extraction - Principal Component Analysis (PCA), identifying eigenvectors/values. - Singular Value Decomposition (SVD), reducing the number of observations. ## Visualization See [[visualization]]. ## Validation - Data Quality - Accuracy - Completeness - Consistency - Timelines - Frequent Tests - Null Test - Distribution Test - Volume Test - Uniqueness Test - Correlation Analysis