Curriculum

Data Mining – Adv

Summary and MCQs

0/8

Exam Mock Tests

0/3

Text lesson

Unit 1: Summary – Data Mining

Introduction to Data Warehousing

Data warehousing forms the foundational backbone of modern data analysis and decision-support systems. It involves collecting, storing, and managing large volumes of historical data in a central repository, enabling efficient querying and analysis. Unlike operational databases that are designed for transactional processing (OLTP), data warehouses are designed for analytical processing (OLAP) – supporting decision-making, trend analysis, and data mining.

1. Components of a Data Warehouse

A data warehouse is not a single software or hardware product, but rather a blend of various components and tools that together support data analysis and decision-making. Key components include:

a. Data Sources

These are internal and external systems that generate or store data. They include transactional databases (like sales or inventory systems), flat files, Excel sheets, APIs, and even data from web services or social platforms.

b. ETL (Extract, Transform, Load) Tools

These tools extract raw data from source systems, transform it into a suitable format, and load it into the data warehouse. The transformation may include data cleansing, de-duplication, normalization, or encoding.

c. Staging Area

This is a temporary storage area where data is held during the ETL process. It ensures that the data can be validated, cleansed, and formatted before being loaded into the warehouse.

d. Data Storage/Repository

This is the central location where processed data resides, typically modeled using specialized DBMS schemas (like star or snowflake schemas) optimized for read-intensive operations.

e. Metadata

Metadata is “data about data”. It defines the structure, source, transformations, data lineage, and meaning of the data. Metadata helps users understand the origin, context, and use of data elements in the warehouse.

f. Query and Reporting Tools

These tools help users perform ad-hoc queries, generate reports, create dashboards, and conduct data analysis. Examples include Tableau, Power BI, SAP BusinessObjects, and Oracle BI.

g. OLAP Engine

Online Analytical Processing (OLAP) engines support fast analysis of multidimensional data through data cubes, allowing slicing, dicing, drilling down, and rolling up data across dimensions.

2. Building a Data Warehouse: Steps and Strategies

Constructing a data warehouse is a complex process that requires strategic planning and execution. Here are the typical phases:

a. Requirement Analysis

This involves identifying business needs, understanding user expectations, and determining what data needs to be collected and how it will be used.

b. Data Modeling

Data models (conceptual, logical, and physical) are created to define how data will be stored and accessed. The star schema and snowflake schema are commonly used.

· Star Schema: Central fact table connected to multiple dimension tables.

· Snowflake Schema: More normalized, with dimensions broken into sub-dimensions.

c. Selection of Hardware & Software

The underlying infrastructure must support high-performance computation and storage. This includes selecting a relational DBMS (like Oracle, Teradata), hardware configurations (servers, storage), and network setups.

d. Designing the ETL Process

A strong ETL strategy is critical. ETL jobs must be scheduled, monitored, and optimized for minimal data loss and high efficiency.

e. Data Loading and Testing

Once the ETL pipeline is ready, data is loaded into the warehouse, and comprehensive testing (data integrity, consistency, accuracy) is conducted.

f. Deployment and Maintenance

After successful validation, the warehouse is deployed to end-users. Regular maintenance (patches, performance tuning, metadata updates) is necessary.

3. Mapping the Data Warehouse to a Multiprocessor Architecture

Data warehouses are typically built for high-performance analysis, requiring parallelism and scalability. Multiprocessor architectures help achieve these goals.

a. Shared Memory Architecture

All processors access a single memory pool. It’s suitable for smaller systems but has scalability limitations.

b. Shared Disk Architecture

Each processor has its own memory but shares common disk storage. This model offers high reliability and moderate performance.

c. Shared Nothing Architecture

Each processor has its own memory and disk. It’s the most scalable and is commonly used in distributed data warehouse systems like Hadoop and MPP (Massively Parallel Processing) databases such as Amazon Redshift.

d. Benefits of Multiprocessor Architectures

· Parallel Query Execution: Tasks are divided among processors for faster response times.

· Scalability: New nodes can be added with minimal configuration.

· Fault Tolerance: Failures in one node don’t crash the whole system.

4. DBMS Schemas for Decision Support

Schemas define how data is logically organized. For data warehouses, schemas are optimized for analytical operations rather than transactions.

a. Star Schema

A central fact table is connected to dimension tables. It offers:

· Simplified queries

· Faster aggregations

· Good for intuitive user understanding

b. Snowflake Schema

Dimension tables are normalized into sub-dimensions. It reduces redundancy but increases query complexity.

c. Fact Constellation (Galaxy Schema)

Multiple fact tables share dimension tables. It supports complex business models like multi-fact analysis (e.g., sales and shipments).

5. Data Extraction, Cleanup, and Transformation Tools

ETL processes are critical in ensuring that only clean, accurate, and consistent data is entered into the data warehouse.

a. Data Extraction

This involves accessing data from various source systems. It can be full extraction or incremental (only new/updated records). Common tools include:

· Apache Nifi

· Informatica

· Talend

b. Data Cleansing

Cleansing ensures data quality and consistency by:

· Removing duplicates

· Standardizing formats (e.g., date, currency)

· Filling in missing values

· Correcting errors (e.g., typos)

c. Data Transformation

Transforming data prepares it for storage and analysis. Typical transformations include:

· Aggregation

· Normalization

· Encoding

· Filtering

These operations are often automated through scripts or tools such as Microsoft SSIS or IBM DataStage.

6. Metadata: The Hidden Backbone

Metadata provides context to data and makes the data warehouse navigable and usable.

Types of Metadata:

1. Technical Metadata: Information about data structures, data types, schemas, and transformation logic.

2. Business Metadata: Describes data in business terms (e.g., “Customer Lifetime Value”).

3. Operational Metadata: Logs about data loads, job execution times, failure records, etc.

Why Metadata Matters:

· Enhances user understanding

· Supports lineage tracking

· Aids in impact analysis

· Facilitates troubleshooting and audits

Conclusion

Data warehousing is a critical component of modern data ecosystems. It enables informed business decision-making through efficient data storage, transformation, and access.