Review of What Goes Around Comes Around... And Around...

Non-SQL, non-relational system are either a niche-market or are fast becoming SQL/RM systems.

MapReduce Systems#

The Evolution of Big Data Processing: From Hadoop to BigTable#

In the world of big data, various technologies have come and gone, each bringing its own set of solutions and improvements. Let's take a look at the journey of data processing, focusing on key players like Google, Yahoo, and Meta, and how they have shaped the landscape over the years.

Google's Transition to BigTable#

Google, a pioneer in web crawling and data processing, initially relied on a periodic crawl of the internet. The data collected from these crawls was processed and then moved to BigTable in 2014. BigTable is a highly scalable database designed for real-time updates, which was crucial for maintaining an up-to-date crawl database.

Understanding Map and Reduce#

The core of many data processing tasks involves two main operations: Map and Reduce.

Map: A user-defined function (UDF) that performs computations or filtering on the data.
Reduce: Essentially a GROUP BY operation that aggregates the results.

A typical query might look like this:

SELECT map() FROM crawl_table GROUP BY reduce()

Yahoo and the Birth of Hadoop#

Yahoo created Hadoop, an open-source framework that ran on top of the Hadoop Distributed File System (HDFS). Hadoop was revolutionary because it allowed for the distributed processing of large data sets across clusters of computers using simple programming models.

The Rise and Fall of Hadoop#

Hadoop, with its MapReduce (MR) programming model, was excellent for "one-shot" tasks such as text processing and Extract, Transform, Load (ETL) operations. However, it was a batch processing system, which meant it wasn't suitable for tasks requiring real-time updates. This limitation led Google to migrate to BigTable.

Enhancements and Alternatives#

Hive: Developed by Meta (formerly Facebook), Hive provided a SQL-like interface on top of Hadoop, making it easier for users to query large datasets.
Presto: Also created by Meta, Presto was designed to replace Hive, offering faster query performance.
Cloudera: Rebranded and built a relational database management system (RDBMS) on top of HDFS, but it did not use Hadoop.
MapR: Developed Drill, another tool aimed at improving data query performance.

The Decline of Hadoop#

Around 2014, Hadoop began to lose its prominence. Newer technologies and frameworks emerged, offering better performance, real-time processing, and more user-friendly interfaces. The shift towards these modern solutions marked the end of Hadoop's era as the go-to big data processing framework.

Key-value Stores#

What Are Key-Value Stores?#

Imagine you have a simple box where you can store items. Each item has a label (key) and a content (value). This is essentially what a Key-Value (KV) store does. It's a basic way to store and retrieve data using pairs of keys and values.

How Do They Work?#

Key-Value stores operate on a straightforward model:

Key: Think of this as a unique identifier or label.
Value: This is the actual data or content associated with the key.

You can perform three main operations:

Get: Retrieve the value associated with a key.
Set: Store a value with a specific key.
Delete: Remove the key and its associated value.

Why Use Key-Value Stores?#

Simplicity: They are the simplest model for data storage.
Performance: They often have higher and more predictable performance compared to traditional relational databases (RDBMS).
Specialized Use: They are particularly useful for narrowly focused applications, such as caching and storing session data.

Examples of Key-Value Stores#

Memcached: Often used for caching to speed up web applications.
Redis: Known for its speed and flexibility, often used for real-time applications.
Dynamo KV: Developed by Amazon, used for highly scalable and distributed systems.

Embedded Storage Managers#

Some Key-Value stores can run within the same address space as a higher-level application. This means they are embedded directly into the application, providing fast and efficient data storage.

Examples include:

BerkeleyDB
Google's LevelDB
RocksDB

Limitations of Key-Value Stores#

While they are great for simple use cases, Key-Value stores are not ideal for applications that require complex data relationships. For instance, if you need to perform operations that involve multiple pieces of related data, a relational database (RDBMS) might be a better choice.

When to Choose RDBMS Over Key-Value Stores#

Even for simple applications, an RDBMS might be a better option because it offers a path forward if the application's complexity increases over time. Relational databases can handle more complex queries and relationships between data.

Examples of Integration and Replacement#

MySQL: Has an API that allows developers to replace the default Key-Value storage manager.
Meta (formerly Facebook): Built RocksDB to replace InnoDB.
MongoDB: Uses WiredTiger’s Key-Value store, which also allows it to function as a database management system (DBMS).

Document Databases#

In the world of databases, document databases have become increasingly popular due to their flexibility and performance benefits. But what exactly are document databases, and why are they gaining traction? Let's break it down in simple terms.

What is a Document Database?#

A document database represents data as a collection of record objects. Think of it as a digital filing cabinet where each file (or document) contains data in a structured format. This structure is often represented using field/value pairs, much like entries in a dictionary.

Example in JSON#

One common way to represent data in a document database is using JSON (JavaScript Object Notation). Here's a simple example:

{
  "name": "First Last",
  "orders": [
    { "id": 123, "items": [...] },
    { "id": 456, "items": [...] }
  ]
}

In this example, a document contains a name and a list of orders, each with its own ID and items.

Evolution from Old Formats#

Before JSON, data was often formatted using SGML and XML. These formats were more complex and less efficient compared to JSON.

Performance Benefits#

One key advantage of document databases is performance. By denormalizing entries into nested structures, you can store related data together. This approach reduces the need to dispatch multiple queries to retrieve related data, solving the N+1 problem commonly seen in Object-Relational Mappings (ORMs).

Drawbacks of Denormalization#

However, denormalization isn't without its issues:

Data Duplication: If the relationship isn't one-to-many, you might end up with duplicated data.
Speed: Prejoined data isn't always faster than traditional joins.
Data Independence: There's a lack of flexibility in how data can be manipulated independently.

Modern Trends in NoSQL#

MongoDB is the most popular document database, but the landscape is evolving. Many NoSQL databases now offer SQL-like interfaces, blending the best of both worlds. Examples include:

DynamoDB's PartiQL
Cassandra's CQL
Aerospike's AQL
Couchbase's SQL++
MongoDB's SQL interface for its Atlas service

These databases have their proprietary query languages inspired by SQL, reflecting a shift in the NoSQL message from "Do not use SQL" to "Not only SQL." This means that SQL is still useful for certain tasks, and modern systems are increasingly integrating SQL-like capabilities.

Column Family/ Wide-Column#

Understanding Column-Family NoSQL Databases#

When we talk about NoSQL databases, there's a special category known as column-family databases. This type of database offers a unique way to store and manage data, and it's different from traditional relational databases.

What Makes Column-Family Databases Unique?#

Column-family databases simplify the document data model by allowing only one level of nesting. This means that while you can have complex data structures, they can't be nested multiple levels deep. Each record in this type of database can have optional attributes, and the cells within these records can contain arrays of values.

An Example to Illustrate#

Imagine you have a database that maps user identifiers to JSON documents. Here's how it might look:

User1 -> {"name" : "Alice", "accounts" : [123, 432], "email" : 'xxx@xxx.edu'}

In this example, User1 has attributes like name, accounts, and email. The accounts attribute is an array containing multiple values.

The Pioneers: Google's BigTable#

Google's BigTable was the first database to introduce the column-family model. It laid the groundwork for this type of NoSQL database. Following its success, other databases like Cassandra and HBase were developed as copies of Google's BigTable.

Limitations to Keep in Mind#

While column-family databases offer flexibility and scalability, they do come with some limitations:

Lack of Joins: Unlike relational databases, you can't perform joins between tables in column-family databases. This can make complex queries more challenging.
Lack of Secondary Indexes: Secondary indexes are used to improve the speed of data retrieval. However, column-family databases often lack this feature, which can affect performance for certain types of queries.

In summary, column-family NoSQL databases like BigTable, Cassandra, and HBase offer a powerful way to handle large volumes of data with a simplified data model. However, they come with limitations such as the lack of joins and secondary indexes, which are important to consider when choosing the right database for your needs.

Test Search Engine#

Understanding Text Search Systems: From SMART to Modern Solutions#

In the 1960s, the SMART system revolutionized information retrieval with the introduction of the vector space model. This model involved tokenizing documents into a "bag of words" and creating full-text indexes, also known as inverted indexes, to facilitate efficient content queries.

Today, leading text search systems like Elasticsearch and Solr dominate the field. Both of these systems rely on Lucene as their internal search library. However, they do come with certain limitations. For instance, they offer limited to no transaction capabilities and can experience significant downtime when recovering from data corruption, as this often requires rebuilding the document index from scratch.

On the other hand, traditional relational database management systems (RDBMSs) such as Oracle, Microsoft SQL Server, MySQL, and PostgreSQL also support full-text search indexes. Text data is inherently unstructured, meaning it lacks a predefined data model. To efficiently search through this unstructured data, systems aim to extract structure in the form of metadata and indexes, thereby avoiding the inefficiency of sequential searches, akin to finding a needle in a haystack.

There are three primary ways to manage text data within applications:

Multiple Systems Approach: This involves using a specialized text search system like Elasticsearch alongside an RDBMS for operational workloads. While this method allows the use of best-of-breed systems, it necessitates additional ETL (Extract, Transform, Load) processes to transfer data from the operational RDBMS to the text database.
Integrated RDBMS Approach: Some RDBMSs offer robust text-search integration capabilities, albeit with divergent APIs in SQL. An example of this is Django Haystack, which provides a unified interface for different search backends.
Polystore System: This approach uses middleware to mask the differences between systems, presenting a unified interface to the user. This can simplify the management of text data across different systems.

In summary, while the technology for text search has advanced significantly since the 1960s, each approach to managing text data comes with its own set of trade-offs. Whether using specialized systems, integrated RDBMSs, or polystore systems, the goal remains the same: to efficiently and effectively search through unstructured text data.

Array Databases#

Understanding Arrays in Modern Data Management#

Arrays are a fundamental concept in data management and come in various forms, each suited for different applications. Arrays can be thought of as collections of numbers or values organized in different dimensions. Let's break down the types of arrays and their uses:

Vectors (1-Dimensional Arrays): Imagine a list of numbers, like your shopping list, but with numerical values. Vectors are just that—single-dimensional arrays.
Matrices (2-Dimensional Arrays): Think of a spreadsheet with rows and columns. Each cell in the spreadsheet holds a value, making it a matrix.
Tensors (3-Dimensional Arrays): Picture a stack of spreadsheets, one on top of the other. This stack represents a tensor, a 3-dimensional array.

Real-World Examples#

Arrays are used in various scientific and technical fields. For instance, scientific surveys that measure geographic regions might store sensor data in an array format like this:

{latitude, longitude, time, [vector-of-values]}

This array holds multiple values for each geographic point over time.

In genomic sequencing and computational fluid dynamics, arrays help manage and process large datasets efficiently.

Arrays in Machine Learning#

Arrays are the backbone of most machine learning datasets. They store and manage the vast amounts of data required to train and test machine learning models.

Evolution of Array Database Management Systems (DBMS)#

1960s: The concept of array databases began with systems like APL and PICDMS.
Older Array DBMS: Systems like Rasdaman and kdb+ are still in use today.
Newer Array DBMS: SciDB and TileDB are modern systems designed to handle complex array data efficiently.

For scientific data, specialized formats like HDF5 and NetCDF are popular. HDF5 is often used for satellite imagery and other gridded scientific data.

Challenges with Real-World Array Data#

Real-world array data doesn't always fit neatly into a regular grid. For example, geospatial data is often divided into irregular shapes, requiring sophisticated metadata to map these shapes correctly. To handle this, array DBMSs must use advanced indexing and data structures to traverse the array dimensions efficiently.

Business Applications and SQL Support#

Interestingly, business applications rarely use dedicated array DBMSs. However, for any product to thrive, it must integrate array management effectively.

The SQL:2023 standard has introduced support for true multi-dimensional arrays (SQL/MDA), inspired by Rasdaman's RQL. This update allows SQL to represent arrays with arbitrary dimensions using integer-based coordinates, making it more versatile for handling complex data structures.

Vector Databases#

In the world of databases, a new type of database management system (DBMS) is gaining traction, known as a vector DBMS. While it shares some similarities with traditional array DBMSs, it is specifically designed to handle one-dimensional raster data. Let's break down what this means and why it's important.

What is a Vector DBMS?#

A vector DBMS is a specialized database system used to store and manage single-dimension embeddings. These embeddings are generated by AI tools that convert data, such as text or images, into vectors. These vectors represent the latent semantic meaning of the data, essentially capturing its essence in a numerical format.

For example, imagine converting each Wikipedia article into a vector using a tool like Google BERT. This vector, along with additional metadata about the article (such as title, author, and content), is stored in the vector database. The structure would look something like this:

{title, data, author, [embedding-vector]}

The size of these embeddings typically ranges from hundreds to thousands of dimensions.

Key Differences from Array DBMSs#

The primary difference between vector and array DBMSs lies in their query patterns:

Vector DBMSs: Designed for similarity searches, they find the closest match to a given input vector in a high-dimensional space. This is particularly useful for applications like recommendation systems, where finding similar items is crucial.
Array DBMSs: These systems search for exact matches at a specific offset in a vector or extract slices across multiple vectors.

How Vector DBMSs Work#

Vector DBMSs are essentially document-oriented databases equipped with specialized indexes for approximate nearest neighbor (ANN) searches. These indexes significantly speed up the process of finding similar vectors. The database can also use metadata to filter records either before (pre-filter) or after (post-filter) the vector search.

Popular Vector DBMSs#

Some of the most popular vector DBMSs include Pinecone, Milvus, and Weaviate. Additionally, text search engines like Elasticsearch, Solr, and Vespa have incorporated vector search capabilities.

Integration with AI Tools#

One of the major advantages of vector DBMSs is their seamless integration with AI tools such as ChatGPT and LangChain. These databases can natively transform a record's data into an embedding upon insertion and use the same transformation to convert a query's input into an embedding for the ANN search. This eliminates the need for external transformations, simplifying the process.

Vector Search in Traditional RDBMS#

Interestingly, traditional relational database management systems (RDBMS) like Oracle, SingleStore, Rockset, and Clickhouse have also started supporting vector search through extensions. PostgreSQL, for instance, uses the pgVector extension. This development is notable because it took less than a year to add vector search capabilities to these systems, whereas adding support for JSON in NoSQL databases took several years. This rapid development was possible because these RDBMSs integrated open-source libraries like pgVector, DiskANN, and FAISS, rather than building their vector indexes from scratch.

Graph Databases#

The Rise of Graph Databases: A Decade of Innovation#

Over the past decade, graph databases have garnered significant attention from both academia and industry. This surge in interest is largely due to their ability to model semi-structured information effectively, which is essential for applications such as social media and entity relationship models.

Key Approaches in Graph Databases#

Two primary approaches dominate the landscape of graph databases:

Resource Description Framework (RDF) and Triplestore:
- RDF models a directed graph with labeled edges.
- Triplestore databases are designed to handle these RDF triples, enabling efficient storage and querying of graph data.
Property Graph:
- Maintains a directed multi-graph structure.
- Supports key-value labels for both nodes and edges, offering more flexibility in data representation.

Popular Graph Database Management Systems (DBMS)#

Neo4j:
- The most popular graph DBMS for Online Transaction Processing (OLTP) applications.
- Uses pointers for edges but does not cluster nodes with their "parent" or "offspring," unlike traditional relational databases (RDBMS) which use joins for such operations.
- Ideal for analytics, such as determining which user has the most friends under 30 years old.
TigerGraph and JanusGraph:
- Focus on advanced query languages and efficient storage mechanisms tailored for graph data.
Gigraph and Turi (GraphLab):
- Provide a computing fabric that supports parallel execution, making them suitable for large-scale graph computations.

Graph Analytics#

Graph analytics involve operations like finding the shortest path, cut sets, or determining cliques within a graph. These operations are crucial for deriving meaningful insights from graph data. Developers can write their own algorithms using abstractions that hide the underlying system topology, simplifying the development process.

However, distributed algorithms often struggle to outperform single-node implementations due to the high communication costs involved. A more effective strategy is to compress the graph into a space-efficient data structure that fits into the memory of a single node, then run queries against this structure.

Integration with Relational Databases#

While traditional RDBMS can be used for graph data, vanilla SQL lacks the expressiveness needed for complex graph queries, often requiring multiple client-server roundtrips for traversal operations. However, the introduction of SQL:2023, which includes Property Graph Queries (SQL/PGQ), has significantly narrowed the functionality gap between RDBMSs and native graph DBMSs.

Recent studies have shown that SQL/PGQ in DuckDB can outperform leading graph DBMSs by up to 10 times, highlighting the potential for RDBMSs to handle graph data more efficiently.

Example of Tables in Graph Databases#

To illustrate, consider the following table structures:

Node Table:
- Columns: node_id, node_data
Edge Table:
- Columns: node_id_1, node_id_2, edge_data

These tables can be used to represent graph data in an RDBMS, though specialized graph DBMSs offer more optimized solutions.

Conclusion#

The review of database types concludes that non-SQL, non-relational systems are either niche markets or evolving into SQL/Relational Model (RM) systems.

Key points include:

MapReduce Systems: Considered legacy technology.
Key-value Stores: Many have matured into RM systems or are used for specific problems, often outperformed by modern RDBMSs.
Document Databases: Converging with RDBMSs, with diminishing differences.
Column-Family Systems: Niche market, mainly due to Google's influence.
Text Search Engines: Used in polystore architectures; RDBMSs need better search capabilities to integrate these functions.
Array Databases: Preferred for scientific applications due to RDBMS inefficiencies with arrays.
Vector Databases: Single-purpose, expected to be integrated into RDBMSs via extendable type systems.
Graph Databases: OLTP graph applications will be served by RDBMSs, while analytic graph applications require specialized in-memory data structures.

There are also attempts to rebrand old data models as new, such as graph-relational and document-relational models. Non-SQL interfaces over RDBMSs address some SQL shortcomings but are unlikely to displace SQL's established user base and ecosystem.

Reference#

¹: What Goes Around; Michael Stonebraker from MIT