Why Amazon, Google, Netflix and Facebook Switched to NoSQL?

While relational databases are the most widely used application in big data, they are not suited for handling the exponential growth of real-time data. For example, the growth of information on the internet is a challenge for relational databases. Each day the world creates 2.5 quintillion bytes of data, with 90% of the data generated being unstructured. By 2020, it is estimated that over 40 Zettabytes of data will be created.

To help overcome the challenges of this unstructured growth, many developers have been switching to “NoSQL” or “Not Only SQL” databases. NoSQL database systems are distributed, non-relational databases that also use non-SQL language and mechanisms in working with data. NoSQL databases can be found in companies like Amazon, Google, Netflix, and Facebook that are dependent on large volumes of data not suited to relational databases. These databases can work efficiently with current unstructured data like social media, email, and documents. NoSQL has a simple query language with high scalability and reliability.

In the relational database or RDBMS, there are several other limitations besides the handling of unstructured data. For example, the scalability of relational databases includes distribution across multiple servers which can be challenging. There is also a catching layer issue where distributed cache can cause de-normalization. Additionally, there can be sharing problems with rebalancing issues. Not to mention that the cost of dealing with billions of rows in traditional databases can get expensive.

On the other hand, with NoSQL databases, the workload can be automatically spread across multiple servers. Also, unlike RDBMS, NoSQL is highly distributable with clusters of servers which can hold the database. It has cached data in memory which is transparent to application developers and users. And, it allows easy scaling to adapt to the complexity of the cloud. With lots of open-source options, NoSQL technology enables developers to try the software before buying the product. Since a DBA is not needed to refactor SQL and create materialized views, this can also potentially reduce cost.

While NoSQL is an expanding field that challenges many assumptions made by companies around maintaining legacy systems, is a credible movement that is solving real problems posed by big data.

#BigData #NoSQL

Two Important Tools in the Digital Revolution: Data Warehouses and Data Mining

Data warehouses allow us to store large amount of data and then data mine against it.  Data mining helps provide insight towards customer behavior or business strategy by exploring patterns that are relevant to business success.  These are all part of business intelligence that allows data to be transformed into actionable insights.

Data Warehouse

The first use of the term dates back to the 1980s, and during the 1990s data warehousing emerged as its own research area. Today’s data warehouses are at the very core of modern decision making. Data warehouses store data from different sources making it available in a multidimensional form or aggregate form for creating knowledge to drive decision making. It can also be used to analyze patterns and trends. Data warehouses store detailed historical data for use in decision support systems. At a high level, data warehousing systems are designed to support OLAP which helps with data analysis and visualization.  Data warehouses receive much of their value through data mining techniques that produce knowledge.

Data Mining

Data mining can occur on many different data sets including the data warehouse. At a simple level, data mining is just applying mathematical techniques to extract patterns and trends to provide information. With data sets growing exponentially over the last few years, there is more demand to turn raw data into knowledge. In artificial intelligence and machine learning, data mining is a rapidly growing field.

There is a general process for turning raw data into knowledge. First, the raw data must be selected and then pre-processing can occur to detect outliers and other trends. In some cases, algorithms do not work well with outliers, and they may need to be removed, or they may indicate a data quality issue. Also, missing values can be detected. Next, the data is transformed. This also helps normalize the data. Correlated variables can be better understood at this point. If the variables are uncorrelated, more valuable information may be provided. Finally, the actual data mining occurs from this transformed data. Algorithms can be applied to find trends and patterns, like the decision tree application. Then when the data is accurately divided into patterns, then the user looks at the patterns to apply an interpretation.

There are many data mining functionalities that include but are not limited to classification and prediction, cluster analysis and trend and evolution analysis. In data mining, it is essential to be able to identify which of the patterns is most useful to achieve the goal. Data mining is about information discovery which provides support for better decision making. Generally, patterns are attractive if they are understood and validate a hypothesis that the user wants to confirm. Potential applications for data mining include targeting marketing, cross-selling, market segmentation, risk analysis, competitive analysis and text mining.

There are some common issues in data mining. There can be challenges in the methodology, in the handing noise and incomplete data, as well as the integration of discovered knowledge with existing data. Also, the user interactions can be a challenge with ad-hoc mining, visualization of data mining results and interactive minding of knowledge at different levels of abstraction.

Business Intelligence

Business intelligence can help organizations improve efficiency. There is no good in storing data if you cannot turn it into actionable information. There are many analytical support tools today for data mining that can help you identify key trends and patterns in your organization’s data to help you move your business forward.  We live in the age of a the digital revolution where information is power.

#DigitalRevolution #DataWarehouse #DataMining #BusinessIntelligence

 

Artificial Intelligence:  Supervised Versus Unsupervised Learning

While people learn quite a bit from the human experience, machines learn from following instructions.  However, machines can also learn from experience which in the world of a computer means learning from previous data.

Supervised learning and unsupervised learning are two different machine learning methods.  The supervised learning approach is used for most practical learning and analyzes data to produce and inferred function for mapping.  The algorithm makes generalizations based on the data to understand and predict new situations.  An example includes recommendations on Amazon, face recognition technology and even a robot learning to sort garbage using visual identification.

Unsupervised learning follows the process of trying to find a hidden structure within the unlabeled data.  The data is clustered into different groups through portioning to try to understand the structure or distribution of data to learn more about the data.  Applications can include market segmentation for targeting customers, fraud detection in banking, image segmentation and gene clustering.

Challenges for machine learning techniques include issues around volume, velocity, and variability.  More specifically, there are challenges in decision making, modeling, human interactions and data-driven scalability.

There are also some situations where there is a large about of input data and only some of the data is labeled which creates a semi-supervised learning situation. An example would be where there are a lot of images, but only a few of the images are labeled.  This scenario can happen given the reasonably low cost of storing unlabeled data.  In this instance, unsupervised learning can help provide insight into the structure of the input variables and supervised learning can help make predictions on the unseen data.

These tools are widely accessible and straightforward for people to set up on personal computers or to create simple models to help advance business goals.

#DataMining #SupervisedLearning #UnsupervisedLearning #AI

Business Intelligence and Decision Support Systems

Business intelligence has transformed the way critical decisions can be made in the digital age. Early generations included Decision Support System (DDS) that has both a data warehouse system and online analytical processing (OLAP) system. However, there are important differences in the data warehouse system and OLAP system regarding suitable applications, system architecture, and system functions.

A data warehouse provides data for decision making. The online analytic processing or OLAP helps data analysis and visualization. Data warehousing systems are designed to support OLAP.  The DBMS that runs decision making queries is the Decision Support System.

The purpose of the data warehouse is detailed historical data whereas the OLAP server is for analytics.  Even though the two have differences, they can work together to achieve business goals.

Regarding access, data warehouses have read-only access and singular list-oriented queries and reports.  OLAP servers have both read and write access with iterative and comparative analytic investigation access modes.  Also, where a data warehouse can have slow query response, the OLAP server is fast with more consistent query response.

Regarding data storage, the data warehouse has cross-subject data, a single subject area data mart and houses historical data. With the OLAP server, there are many cubes where each cube is a single subject area.

In an OLAP server, the data is dimensional and hierarchical. The design goal of the data structure in a data warehouse is list-oriented query whereas the OLAP server design goal is analysis.  However, the data warehouse can also store terabytes of data whereas the OLAP server typically deals more with gigabytes. Comparatively, the hardware investment in a data warehouse is not cheap whereas the cost of the OLAP server can vary.

Regarding implementation, the data warehouse is also slow taking months or years whereas the OLAP implementation can happen in days or weeks. And, the adaptability of a data warehouse is low whereas the OLAP server is easily modified.

A Decision Support System (DSS) analyzes business data and presents it so that users can make business decisions. The data warehouse for decision support takes data from different sources and then uses advanced tools and technologies to support the development of Decision Support Systems.

The Decision Support System starts with information sources, then moves through the data warehouse, to the OLAP servers that then go to the client for querying, reporting and data mining.  OLAP provides high levels of functionality for decision making in analyzing large collections of historical data stored by the data warehouse.

With DSS focused on flexibility and adaptability to accommodate changes in the environment and decision making approach of the user, advancements are likely to continue to evolve over the next few years.

Overall, the data warehouse and OLAP can work together to provide information needed to support business goals.

#DataWarehouse #DecisionSupportSystem #OLAP

 

Differences in Kimball vs. Inmon Approach in Data Warehouse Design

When working on a data warehouse project, there are two well-known methodologies for data warehouse system development including the Corporate Information Factory (CIF) and Business Dimensional Lifecycle (BDL). Which one is better for business? The following summary reviews each the advantages and disadvantages of each approach.

Corporate Information Factory Definition and Main Principles

This approach, defined by Bill Inmon, is top-down, data is normalized to 3rd normal form, and the enterprise data warehouse creates data marts. It is a single repository of enterprise data and creates a framework for Decision Support Systems (DSS). For this top-down approach, the data integration requirements are enterprise-wide.

Corporate Information Factory Pros and Cons

Pros:

  • Maintenance is fairly easy
  • Subsequent projects costs lower

Cons:

  • Building the data warehouse can be time consuming
  • There can be a high initial cost
  • Longer time for start-up
  • Specialist team required

Business Dimensional Lifecycle Definition and Main Principles:

This approach, defined by Ralph Kimball, is bottoms, up where data marts are created to provide reporting. The data architecture is a collection of confirmed dimensions and confirmed facts that are shared between facts in two or more data marts. The data integration requirements for this bottom up approach includes data integration requirements for individual business areas.

Business Dimensional Lifecycle Pros and Cons

Pros:

  • Takes less time to build the data warehouse
  • Low initial cost with fairly predictable subsequent costs
  • Fast initial set up
  • Only a generalist team is required

Cons:

  • Maintenance can be difficult, redundant and subject to revisions

In the top-down approach, unlike the bottom-up approach, there is an enterprise data warehouse, relational tools, normalized data model, complexity in design, and a discrete time frame. In the bottom-up approach, unlike the top-down approach, there are dimensional tools, process orientation and a slowly changing time frame.

Both CIF and BDL use Extract, Transform and Load (ETL) to load the data warehouse. But, how the data is modeled, loaded and stored is different. The different architecture impacts the delivery time of the data warehouse and the ability to accommodate changes in ETL design.

#DataWarehouse #DataAnalytics

 

Comparison of Persistence Mechanisms (XML, RDBMS, NXD)

Two illustrative approaches to support the persistency of eXtensible Markup Language (XML) data include the relational database management system (RDBMS) and the native XML database (NXD).

A relationship database that most people are familiar with includes tables with fields and rows.  Examples include Oracle, Sybase, IBM DB2, Microsoft SqlServer and Informix.  A RDBMS stores data in tables that are organized in columns.  The XML documents can be converted in an relational database and queried with tools like SQL. In a RDBMS the persistence framework contains the logic for mapping application classes to the database or other persistent storage sources.  In refactoring the database, the meta data may need to be updated which describes the mappings.  The approach to storing XML data in a RDBMS includes flat storage and shredded storage.  Flat storage takes an XML document in a cell whereas shredded storage normalizes in into millions of parts.

While RDBMSs are the most commonly used type of persistence mechanism, there are also other options including native XML databases.  NXD examples include dbXML, OpenLink Software’s Viruoso, Sofrware AG’s Tamino and X-Hive/BD.  NXDs have XML nodes and documents.  A native XML database is based on “containers” that are designed to work with XML data.  Generally speaking, NXDs are not intended to replace existing databases but provide a tool for storage and operations on XML documents.  In terms of storage, NXDs are good a storing XHTML or DockBook type data where the data is less rigid.  NXDs usually store a modified form of the entire XML document in the file system in a compressed or pre-parsed binary form.  It is also a possibility to map the structures accede.

The following summaries the general pros and cons of each mechanism:

Relational Databases

Pros:

  • Mature technology
  • Dominate the persistence mechanism market
  • Several well-established vendors
  • Standards, such as Structured Query Language (SQL) and JDBC well defined and accepted
  • Significant experience base of developers

Cons:

  • Object-relational impedance mismatch can be a significant challenge
  • Mapping objects to relational databases takes time to learn

XML Databases

Pros:

  • Native support for persisting XML data structures
  • For XML intensive applications it removes the need for marshalling between XML structures and the database structure

Cons:

  • Emerging technology
  • Standards (the XML equivalent of SQL, not yet in place for XML data access)
  • Not well suited for transactional systems

There is no one right answer on how to approach XML persistence.  While there are many new tools in the marketplace, having a consistent and well documented approach is important. The next decade will illustrate whether or not XML can live up to the hype of promises for data storage.

#Persistence #XML #RDBMS #NXD