Differences in Kimball vs. Inmon Approach in Data Warehouse Design

When working on a data warehouse project, there are two well-known methodologies for data warehouse system development including the Corporate Information Factory (CIF) and Business Dimensional Lifecycle (BDL). Which one is better for business? The following summary reviews each the advantages and disadvantages of each approach.

Corporate Information Factory Definition and Main Principles

This approach, defined by Bill Inmon, is top-down, data is normalized to 3rd normal form, and the enterprise data warehouse creates data marts. It is a single repository of enterprise data and creates a framework for Decision Support Systems (DSS). For this top-down approach, the data integration requirements are enterprise-wide.

Corporate Information Factory Pros and Cons

Pros:

  • Maintenance is fairly easy
  • Subsequent projects costs lower

Cons:

  • Building the data warehouse can be time consuming
  • There can be a high initial cost
  • Longer time for start-up
  • Specialist team required

Business Dimensional Lifecycle Definition and Main Principles:

This approach, defined by Ralph Kimball, is bottoms, up where data marts are created to provide reporting. The data architecture is a collection of confirmed dimensions and confirmed facts that are shared between facts in two or more data marts. The data integration requirements for this bottom up approach includes data integration requirements for individual business areas.

Business Dimensional Lifecycle Pros and Cons

Pros:

  • Takes less time to build the data warehouse
  • Low initial cost with fairly predictable subsequent costs
  • Fast initial set up
  • Only a generalist team is required

Cons:

  • Maintenance can be difficult, redundant and subject to revisions

In the top-down approach, unlike the bottom-up approach, there is an enterprise data warehouse, relational tools, normalized data model, complexity in design, and a discrete time frame. In the bottom-up approach, unlike the top-down approach, there are dimensional tools, process orientation and a slowly changing time frame.

Both CIF and BDL use Extract, Transform and Load (ETL) to load the data warehouse. But, how the data is modeled, loaded and stored is different. The different architecture impacts the delivery time of the data warehouse and the ability to accommodate changes in ETL design.

#DataWarehouse #DataAnalytics

 

Comparison of Persistence Mechanisms (XML, RDBMS, NXD)

Two illustrative approaches to support the persistency of eXtensible Markup Language (XML) data include the relational database management system (RDBMS) and the native XML database (NXD).

A relationship database that most people are familiar with includes tables with fields and rows.  Examples include Oracle, Sybase, IBM DB2, Microsoft SqlServer and Informix.  A RDBMS stores data in tables that are organized in columns.  The XML documents can be converted in an relational database and queried with tools like SQL. In a RDBMS the persistence framework contains the logic for mapping application classes to the database or other persistent storage sources.  In refactoring the database, the meta data may need to be updated which describes the mappings.  The approach to storing XML data in a RDBMS includes flat storage and shredded storage.  Flat storage takes an XML document in a cell whereas shredded storage normalizes in into millions of parts.

While RDBMSs are the most commonly used type of persistence mechanism, there are also other options including native XML databases.  NXD examples include dbXML, OpenLink Software’s Viruoso, Sofrware AG’s Tamino and X-Hive/BD.  NXDs have XML nodes and documents.  A native XML database is based on “containers” that are designed to work with XML data.  Generally speaking, NXDs are not intended to replace existing databases but provide a tool for storage and operations on XML documents.  In terms of storage, NXDs are good a storing XHTML or DockBook type data where the data is less rigid.  NXDs usually store a modified form of the entire XML document in the file system in a compressed or pre-parsed binary form.  It is also a possibility to map the structures accede.

The following summaries the general pros and cons of each mechanism:

Relational Databases

Pros:

  • Mature technology
  • Dominate the persistence mechanism market
  • Several well-established vendors
  • Standards, such as Structured Query Language (SQL) and JDBC well defined and accepted
  • Significant experience base of developers

Cons:

  • Object-relational impedance mismatch can be a significant challenge
  • Mapping objects to relational databases takes time to learn

XML Databases

Pros:

  • Native support for persisting XML data structures
  • For XML intensive applications it removes the need for marshalling between XML structures and the database structure

Cons:

  • Emerging technology
  • Standards (the XML equivalent of SQL, not yet in place for XML data access)
  • Not well suited for transactional systems

There is no one right answer on how to approach XML persistence.  While there are many new tools in the marketplace, having a consistent and well documented approach is important. The next decade will illustrate whether or not XML can live up to the hype of promises for data storage.

#Persistence #XML #RDBMS #NXD

 

Implementing Database Internet Connectivity

The Internet is fundamentally designed to move information from one place to another. For an organization wishing to implement database connectivity, there are several different considerations.

One option for developing a program that sends and receives information to databases online includes ADO.Net which is a set of libraries included in the Microsoft .NET framework that can connect to a data source and modify data stored in a relational database.

Advantages of ADO.Net

  • Fast performance
  • Ease of use
  • Cross database compatibility
  • Optimized SQL provider
  • XML support and reliance
  • Ability to work with data offline
  • Supports multiple views
  • Lots of features
  • Rich object model
  • Features bridge gap between traditional data access and XML development
  • Supported by many compilers

Disadvantages of ADO.Net

  • Managed-only access
  • Customization is complex
  • No direct calls via ADO.Net or stored procedures to a database
  • Slow interest connection results in load time delays
  • Steep learning curve

Another relatively common option for implementing database Internet connectivity includes the use of Fusion Middleware.  Fusion Middleware (FMW) consists of software products from Oracle that are intended to simplify integration.  Just like a good plumbing system, Fusion Middleware is the fixture and pipes that allows information to pass through.

Advantages of FMW

  • Flow of real-time information access within and among systems
  • Streamlined processes result in improved efficiency
  • Can be used in many software systems
  • Ability to maintain integrity of information across many systems

Disadvantages of FMW

  • High development costs
  • Limited skilled workers with FMW experience
  • Potential to jeopardize real time performance
  • Limited benchmarking for performance
  • Some platforms not covered by Middleware

Choosing the right data architecture is critical to leveraging an organization’s data assets.  When adopting an approach, it depends on the situation what kind of application is needed.  Thus, defining the need is the first step.  Compared to the current state, how are the company’s needs being met or not being met?  It is essential to talk to diverse stakeholders to ensure the needs of the people using the system are represented.

The next step can include narrowing down the options which can be done in a variety of ways from reviewing these advantages and disadvantages to selecting your product by leveraging a Quality Function Deployment (QFD) matrix that is customized to your particular needs.

Once the list is narrowed, vendors can demo the product.  Good vendor communication is also something to pay attention to as most road bumps can be resolved with thoughtful communication.

A sound process will help successful data integrations which in turn can help organizations maintain a competitive edge in the digital world.

#DataArchitecture #InternetConnectivity #ComputerScience

 

 

Implications of Brewer’s Cap Theorem

This isn’t about my favorite baseball cap representing my team which is the Colorado Rockies. This is about CAP Theorem which is a really important topic in the field of computer science that was proposed in 2000.

New advances in computer science and mobile communications have provided users with the opportunity to better access information and services regardless of their physical location. Mobile users can now query and update databases virtually. Understanding the CAP Theorem is vital to understand key tradeoffs that need to be made in the design and implementation.

The challenges of distributed systems, especially as it relates to scaling up and down, are described by Brewer’s CAP Theorem. The CAP Theorem stands for consistency, availability, and partition tolerance. Consistency means that only factual information is stored. Availability implies there will always be a response by the system and there will never be an error or timeout regardless of load or network failures. Partition tolerance suggests the network is not 100% guaranteed.

As society has moved from mainframes to distributed servers, there is a problem where some servers are up but the network connecting them is not. If both sides are doing the wrong thing, the CAP theorem helps the practitioner know what can be changed and what cannot. Also, when the system is designed, the intent and use of database need to be thought through to ensure the right tradeoffs are made.

Figure 1.1 shows a diagram of the CAP Theorem where only two of the three conditions can be achieved in a given moment. For example, if a delay is okay, consistency might be able to be sacrificed. Or, for instance, if there is a sales processing system, availability and partition tolerance may not be the right trade-off. Some systems guarantee strong consistency and provide best effort availability, other systems guarantee availability and provide best effort consistency while other sacrifice both consistency and availability.

An example of consistency and availability includes standard databases like SQL Server, MySQL, Oracle, and PostgreSQL. An example of availability and partition tolerance includes databases like Cassandra, CouchDB, and DynamoDB. An example of partition tolerance and consistency includes databases like MongoDB, HBase, Memcache, and Redis.

The goal of understanding CAP Theorem includes understanding how to maximize consistency and the availability requirements. The professional can leverage CAP Theorem to create innovative strategies for partitions and recovery solutions.

#CapTheorem #ComputerScience #TradeOffs

 

 

Disaster Recovery and System Wide Failure (2PC vs 3PC)

Two-Phase Commit (2PC) protocol and Three-Phase Commit (3PC) protocol are two most popular algorithms of managing how to commit or abort distributed transactions in Distributed Database Management System (DDBMS).

Two-phase commit (2PC) enables databases to be returned to a former state if an error condition occurs.  It helps databases remain synchronized.  A coordinator is required and has the role of trying to determine consensus among a set of processes in two phases.  In terms of a sequence, first the coordinator contacts all the processes and suggests a value and solicits their response.  After getting the responses, the coordinator makes a decision to commit if all processes agreed upon the value or abort if there is a disagreement.  In the second phase, the coordinator then contacts all the processes again and communicates the commit or abort decision.

In a three-phase commit (3PC) protocol, all the nodes in a distributed system agree to commit to a transaction.  Unlike two-phase commit, the three-phase commit is non-blocking.  The phases include preparing to commit and then if the coordinate receives a yes from all processes during the prepare to commit phase then it asks for all the processes to commit.

In terminating a distributed transaction, since the two-phase commit is a blocking protocol, the system can get stuck.  It can get stuck because the system cannot resolve the transaction.  If the cohort sends an agreement message to the coordinator it holds the resources associated with consensus until it receives the commit or abort message of the coordinator.  The failure of the coordinator then prevents the cohorts from recovering from failure.

On the other hand, the three-phase commit protocol eliminates this blocking problem.  If a message times out, for example, other processes can unanimously agree that the operation was aborted.  The pre-commit phase helps the recovery when a process failure or both coordinator and process node failure during the commit phase occur.  In the event of a system wide power off failure, two-phase commit protocol might not recover data to the initial state when in a blocking state.  With a three-phase commit, the model is able to prevent blocking as crashes can be detected accurately.  One limitation though, for example, is that this protocol will not function with network partitions or asynchronous communication.  It is also important in this situation to do a system-wide backup as part of your disaster recovery plan.

In conclusion, 3PC is a better protocol for both terminating a distributed transactions and recovering from a system wide power off failure.

#3PC #ComputerScience #DisasterRecovery #SystemFailure

 

Conservative Concurrency Control vs. Optimistic Concurrency Control

Abstract future technology concept background, vector illustration

Companies are collecting, storing and analyzing data more than ever before in human history.  It is estimated that 90% of all the data in the world was been created in the past few years.  The data within a database needs to be managed effectively for it to provide optimal value to the company and customers. Database Manage Systems helps that data management process.

A transaction is just a group of tasks which represents the minimum processing unit.  The transaction throughput is just the number of transactions that can be performed in a given period.  To put multiple transactions together, there are problems that can arise because of concurrency.

Concurrency control is the process of managing simultaneous execution of transactions in a shared database.  The purpose of concurrency control is to enforce isolation, preserve database consistency and to resolve read-write and write-write conflicts.  Concurrency control is essential to ensure atomicity, isolation, and serializability of concurrent transactions.

If it is insisted that only one transaction can execute at a time (that is, in serial order), then performance can be poor.  At the highest level, concurrency control is a method for scheduling or controlling the operations of transactions in such a way that they can be executed safely.  For transactions to be executed safely, they need not to cause the database to reach an inconsistent state.

For the conservative concurrency control, the pros include that all transactions can be executed correctly, the data is appropriately consistent, and the database is relatively stable and reliable.  The cons include that transactions can be slow, run time can be longer, and throughput can be decreased.

For optimistic concurrency control, the strengths include transactions that are executed efficiently, relatively safe data content and potentially higher throughput.  The cons include the risk of data inference, that there could be hidden errors, and that transactions could deadlock.

If concurrency is adequately controlled, then it is possible to maximize transaction throughput while avoiding the change of corrupting the database.

In a world where over 2.5 billion gigabytes are generated every day, effectively managing the data is essential to a company’s performance.  These data management concepts can help the information technology professional actively implement Database Manage Systems.

#Concurrency #DMS #ComputerScience

 

 

The Impact of Distributed Data Management Systems in Healthcare

Medicine doctor hand working with modern computer interface as medical concept

A distributed database management system (DDBMS) used when there are large datasets.  It is a centralized application that manages a distributed database like it is stored all on the same computer.  The logically interrelated databases are used to make the distribution of data transparent to users.  On the other hand, a centralized database system is a database that keeps data all in one single database at one single location.  There are pros and cons to DDBMS versus a centralized database system as it relates to system architecture, system functions and suitable applications.

Many companies have multiple locations that may benefit from DDBMS.  For example, consider a company, like HCA with hospitals located across the country.  Each state may have a different database that holds medical records, appointment history, etc.  The management at each hospital can query that data but the corporate office can also perform queries across the country.  Also, as new hospitals are added to the system, those hospitals can be added to the network without messing up the operations of other sites.

Another benefit of DDBMS includes that users can access data stored at other sites.  For example, use the example of a hospital that is trying to understand the number of falls occurring across the system.  In this instant, that data maybe needs to be queried often from the Chief Medical Officer that then can have the data placed at the site near her potentially enhancing the speed of the database access.

Also, if there is a DDBMS failure at one of the hospital sites, for example, that does not make the entire system breakdown.  DDBMS, unlike a centralized DBMS can function even with these local failures.

However, unlike a centralized DBMS, there is more complexity with a DDBMS in the sense that it hides from the user the distributed nature and allows data replication which if unmanaged can create challenges in reliability and performance.

Another serious risk in the hospital example with DDBMS is security.  In the world of HIPPA where data breaches can be very expensive and brand damaging, it is harder to control security when it is not a centralized approach.

Also, with the field of evidence based medicine advancing (algorithms that determine best case scenarios for treatment plans given the inputs), DDBMS may prove more challenging in execution.

Do the pros outweigh the cons? What has been your experience?

#BigData #DataAnalytics #DDBMS

 

Artificial Intelligence & Discrimination

In this rapidly changing digital world, artificial intelligence helps machines take on more complex responsibility on a regular basis.  In Singapore, what started as a program to prevent terrorism now is applied to immigration policy, the property market and even school curricula.  At a basic level, algorithms collect data about users, and then that can determine access points to many pieces of our civil society.

One of the challenges that comes with this opportunity is understanding cognitive bias to make sure that we are not programming machines to mirror discrimination.  There is growing evidence that artificial intelligence applications threaten to discriminate against legally protected groups.  An example includes in 2015 when it was discovered that Google’s photo application that applies automatic labels to pictures was classifying images of black people as gorillas.  Data rules are fed by specific images and how the programming happens matters if our world is going to be based on those selections. In theory, society is envisioned supported by the advancements of artificial intelligence, not in a world where a history of discrimination is mirrored.  For example, what does this mean for robots that are armed to replace police work or private individuals that want security robots?  Computer scientists at Carnegie Mellon University found that women are less likely than men to be shown ads on Google for high pay jobs.  Combining examples like this with how few women go into the field of computer science, and it is evident that there could be a problem if these issues are not addressed.

Part of the root cause has to do with that programming can be done with good intention, but if diversity is not represented upfront, then there can be unintentional biases downstream. Also, flawed algorithms are not immediately discoverable, and companies have little to no self-interest in making this area more transparent.

Potential solutions include having more diversity upfront in how we program that machines and potentially more public policy to drive transparency and accountability.  Ethical perceptions need to be taught in computer programming classes, as these values are fundamental to minimizing discrimination.  Specifically, with unbiased machine learning being the subject of a lot of research, there is the opportunity for teachers to offer a consensus view of discrimination.  Also under consideration is combining both anti-discrimination laws, data protection law and algorithmic fairness could support future design as it relates to artificial intelligence.  As seen from a glass half full standpoint, artificial intelligence offers an unprecedented opportunity to build inclusive practices in the process of companies.  However, there is a way to capitalize on the benefits and mitigate the risks.

 

Data Privacy

The 2018 Cambridge Analytica case forced a worldwide discussion on whether or not data privacy is a human right.  In that instance, 50 million Facebook profiles were used for Cambridge Analytica’s major data breach.

Companies that are trusted by consumers are abusing that trust every day by sharing third-party information.  Many of today’s phone applications come with real-time tracking data, and that information is being capitalized on for profit. The European Union successfully passed legislation to transform EU data privacy law to now include a range of individual rights designed to protect consumers whose personal information is collected, processed and stored by companies.  This past month with the European General Data Protection Regulation, organizations now risk losing 20 million Euros or 4 percent of annual revenue, whichever happens, to be greater.  The tides are changing, or maybe more appropriately, catching up to the rapidly changing digital environment.

Another case study that demonstrates the issues in the data privacy space is how biomedical health data is handled.  Due to electronic medical records, more health data is being saved than ever before.  How different tables merge to create new insights can sometimes contain sensitive information that then is sold to third-parties.  The data sold to the third-parties is sometimes thought to be not identifiable, but if multiple datasets are received, sometimes primary key can be developed and trace data back to the particular individual.  However, even though in 2014, the United States granted individuals a right to access their lab and genomic data, there is not a clear legal framework or ethical and accountable guidelines for the use of the data.

Regarding the cause of this problem, some of it may have to do with the rapidly advancing digital world and specialization of knowledge.  With the rapid pace of technological development from wearables to STEM cell research, the public policy has just not caught up, yet.  Also, in each area of computer science, there is specialized knowledge which makes it more difficult for key stakeholders to know what questions to ask and what controls to put in place to safeguard citizens.

Citizens, providers, and administrators need to be better aware of privacy issues and have the proper guidance on how to manage them across different parts of the delivery system.  Solutions could include future legislation, similar to the theme of the recent advancements in Europe that recognizes data privacy as a human right and has consequences for those that they violate the policy.  Also, more education for the public will likely help the momentum to balance the scales in the future.  Finally, clearer guidelines regarding the roles and responsibilities as it relates to third-party data would also be a useful start.