Tuesday 9 January 2024

Unraveling Data Architecture: Data Fabric vs. Data Mesh

In this first post of this year, I would like to talk about modern data architectures and specifically about two prominent models, Data Fabric and Data Mesh. Both are potential solutions for organisations working with complex data and database systems.

While both try to make our data lives better and try to bring us the abstraction we need when working with data, they are different in their approach.

But when would be best to use one over the other? Let's try to understand that.

Definitions Data Fabric and Data Mesh

Some definitions first. I come up with these after reading up on the internet and various books on the subject.

Data Fabric is a centralised data architecture focusing on creating a unified data environment. In it's environment data fabric integrates different data sources and provides seamless data access, governance and management and often it does this via tools and automation. Keywords to remember from the Data Fabric definition are centralised, unified, seamless data access.

Data Mesh, is a paradigm shift, it is a completely different way of doing things. Is based on domains, most likely inspired by the Domain Driven Design (DDD) example, is capturing data products in domains by decentralising data ownership (autonomy) and access. That is, the domains are the owners of the data and are responsible themselves for creating and maintaining their own data products. The responsibility is distributed. Key words to take away from Data Mesh are decentralised, domain ownership, autonomy and data products.

Criteria for Choosing between Data Fabric and Data Mesh

Data Fabric might be preferred when there is a need for centralised governance and control over the data entities. When a unified view is needed across all database systems and data sources in the organisation or departments. Transactional database workloads are very suitable for this type of data architecture where consistency and integrity in their data operations is paramount.

Data Mesh can be more suitable for organisational cultures or departments where scalability and agility is a priority. Because of its domain-driven design, Data Mesh might be a better fit for organisations or departments that are decentralised, innovative, and require their business units to swiftly and independently decide how to handle their data. Analytical workloads and other big data workloads may be more suitable to Data Mesh data architectures.

Ultimately, the decision-making process between these data architectures hinges on the load of data processing and the alignment of diverse data sources. There's no universal solution applicable to all scenarios or one size fits all. Organisations and departments in organisations operate within unique cultural and environmental contexts, often necessitating thorough research, proof of concept, and pattern evaluation to identify the optimal architectural fit.

Remember, in the realm of data architecture, the data workload reigns supreme - it dictates the design.

Friday 24 November 2023

Using vector databases for context in AI

In the realm of Artificial Intelligence (AI), understanding and retaining context stands as a pivotal factor for decision-making and enhanced comprehension. Vector databases, are the foundational pillars in encapsulating your own data to be used in conjunction with AI and LLMs. Vector databases are empowering these systems to absorb and retain intricate contextual information.

Understanding Vector Databases

Vector databases are specialised data storage systems engineered to efficiently manage and retrieve vectorised data - also known as embeddings. These databases store information in a vector format, where each data entity is represented as a multidimensional numerical vector, encapsulating various attributes and relationships, thus fostering the preservation of rich context. That is text, video or audio is translated into numbers with many attributes in the multidimensional space. Then mathematics are used to calculate the proximity between these numbers. Loosely speaking that is what a neural network in an LLM does, it computes proximity (similarity) between the vectors. A bit like how our brains do pattern recognition. The vector database is the database where the vectors are stored. Without a vector databases under architectures like RAG, it is impossible to bring own data or context into an LLM app in an AI model as all it will know will be only what it is trained on from the public internet. Vector databases enable you to bring your own data to AI.

Examples of Vector Databases

Several platforms offer vector databases, such as Pinecone, Faiss by Facebook, Annoy, Milvus, and Elasticsearch with dense vector support. These databases cater to diverse use cases, offering functionalities tailored to handle vast amounts of vectorised information, be it images, text, audio, or other complex data types.

Importance in AI Context

Within the AI landscape, vector databases play a pivotal role in serving specific data and context for AI models. Particularly, in the Retrieval-Augmented Generation (RAG) architecture, where retrieval of relevant information is an essential part of content generation, vector databases act as repositories, storing precomputed embeddings from your own private data. These embeddings encode the semantic and contextual essence of your data, facilitating efficient retrieval in your AI apps and Bots. Bringing vector databases to your AI apps or chatbots will bring your own data to your AI apps, agents and chatbots will speak your data!

Advantages for Organisations and AI Applications

Organisations can harness the prowess of vector databases within Retrieval-Augmented Generation (RAG) architectures to elevate their AI applications and enable them to use organisational specific data:

Enhanced Contextual Understanding: By leveraging vector databases, AI models grasp nuanced contextual information, enabling more informed decision-making and more precise content generation based on specific and private organisational context.
Improved Efficiency in Information Retrieval: Vector databases expedite the retrieval of pertinent information by enabling similarity searches based on vector representations, augmenting the speed and accuracy of AI applications.
Scalability and Flexibility: These databases offer scalability and flexibility, accommodating diverse data types and expanding corpora, essential for the evolving needs of AI-driven applications.
Optimised Resource Utilisation: Vector databases streamline resource utilisation by efficiently storing and retrieving vectorised data, thus optimising computational resources and infrastructure.

Closing Thoughts

In the AI landscape, where the comprehension of context is paramount, vector databases emerge as linchpins, fortifying AI systems with the capability to retain and comprehend context-rich information. Their integration within Retrieval-Augmented Generation (RAG) architectures not only elevates AI applications but also empowers organisations to glean profound insights, fostering a new era of context-driven AI innovation from data.

In essence, the power vested in vector databases will reshape the trajectory of AI, propelling it toward unparalleled contextualisation and intelligent decision-making based on in house and organisations own data.

But the enigma persists: What precisely will be the data fuelling the AI model?

Sunday 16 April 2023

Vscode container development

If you're a software developer, you know how important it is to have a development environment that is flexible, efficient, and easy to use. PyCharm is a popular IDE (Integrated Development Environment) for Python developers, but there are other options out there that may suit your needs better. One such option is Visual Studio Code, or VS Code for short.

After using PyCharm for a while, I decided to give VS Code a try, and I was pleasantly surprised by one of its features: the remote container development extension. This extension allows you to develop your code in containers, with no footprint on your local machine at all. This means that you can have a truly ephemeral solution, enabling abstraction to the maximum.

So, how does it work? First, you need to create two files: a Dockerfile and a devcontainer.json file. These files should be located in a hidden .devcontainer folder at the root location of any of your GitHub projects.

The Dockerfile is used to build the container image that will be used for development. Here's a sample Dockerfile that installs Python3, sudo, and SQLite3:

FROM ubuntu:20.04

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update -y

RUN apt-get install -y python3

RUN apt-get install -y sudo

RUN apt-get install -y sqlite3

The devcontainer.json file is used to configure the development environment in the container. Here's a sample devcontainer.json file that sets the workspace folder to "/workspaces/alpha", installs the "ms-python.python" extension, and forwards port 8000:

{

"name": "hammer",

"build": {

"context": ".",

"dockerfile": "./Dockerfile"

"workspaceFolder": "/workspaces/alpha",

"extensions": [

"ms-python.python"

"forwardPorts": [

8000

]

}

Once you have these files ready, you can clone your GitHub code down to a Visual Studio Code container volume. Here's how to do it:

Start Visual Studio Code
Make sure you have the "Remote Development" extension installed and enabled
Go to the "Remote Explorer" extension from the button menu
Click "Clone Repository in Container Volume" at the bottom left
In the Command Palette, choose "Clone a repository from GitHub in a Container Volume" and pick your GitHub repo.

That's it! You are now tracking your code inside a container volume, built by a Dockerfile which is also being tracked on GitHub together with all your environment-specific extensions you require for development.

The VS Code remote container development extension is a powerful tool for developers who need a flexible, efficient, and easy-to-use development environment. By using containers, you can create an ephemeral solution that allows you to abstract away the complexities of development environments and focus on your code. If you're looking for a new IDE or just want to try something different, give VS Code a try with the remote container development extension.

Wednesday 12 January 2022

Is Data Hub the new Staging environment?

"A data hub is an architectural pattern that enables the mediation, sharing, and governance of data flowing from points of production in the enterprise to points of consumption in the enterprise” Ted Friedman, datanami.com

Aren't relational databases, data marts, data warehouses and more recently data lakes not enough? Why is there a need to come up with yet another strategy and paradigm for database management?

To begin answering the above questions, I suggest we start looking at the history of data management and figure out how data architecture developed a new architectural pattern like Data Hub. After all, history is important as a famous quote from Martin Luther King Jr. says "We are not makers of history. We are made by history"

Relational architecture

A few decades ago, businesses began using relational databases and data warehouses to store their interests in a consistent and coherent record. The relational architecture still keeps the clocks ticking with its well understood architectural structures and relational data models. It is a sound and consistent architectural pattern based on mathematical theory which will continue serving data workloads. The relational architecture serves brilliantly the very specific use case of transactional workloads, where the data semantics are defined in advance before any data is stored in any system. If implemented correctly the relational model can become a hub of information that is centralised and easy to query. It is hard to see that the relational architecture could be the reason to cause a paradigm shift into something like a data hub. Most likely is something else. Could it be cloud computing?

When the cloud came, it changed everything. The Cloud brought along an unfathomable proliferation of apps and an incredible amount of raw and unorganised data. With this outlandish amount of disorganised data in the pipes, the suitability of the relational architecture for data storage had to be re-examined and reviewed. Faced with a data deluge, the relational architecture couldn't scale quickly and couldn't serve the analytical workloads and the needs of the business in a reasonable time. Put simply, there was no time to understand and model data. The sheer weight of the number of unorganised chunks of data coming from the cloud, structured and unstructured, at high speeds, propelled the engineers to look for a new architectural pattern.

Data Lake

In a data lake, the structured and unstructured data chunks are stored raw and no questions are asked. Data is not organised and is not kept in well-understood data models anymore and it can be stored infinitely and in abundance. Moreover, very conveniently the process of understanding and creating a data model in a data lake is deferred to the future, which is a process known as schema-on-read. We have to admit, the data lake is the new monolith where data is stored only, a mega data dump yard indeed. This new architectural pattern also brought with it the massively parallel (MPP) processing data platforms, tools and disciplines, such as machine learning, which became the standard methods for extracting business insights from the absurd amounts of data found in a data lake. Unfortunately, the unaccounted amounts of unknown data living in a data lake didn't help us understand data better and made the life of engineers even more difficult. Does a data lake have any redundant data or bad data? Are there complex data silos living in a data lake? These are still hard questions to answer and the chaotic data lakes looked like are missing a mediator.

Data Hub

Could the mediator be a "data hub"? It is an architectural pattern based on the hub and spoke architecture. A data hub, which itself is another database system, integrates and stores critical and important data and metadata for mediation, from diverse and complex transactional and analytical workloads and data sources. Once the data is stored, the data hub becomes the tool to "harmonise" and "enrich" data and then radiate it to the AI, Machine Learning and other enterprise insights and reporting systems, via its spokes.

What's more, while sharing the data in its spokes, the data hub can also help engineers to govern, secure and catalogue the data landscape of the enterprise. The separation of data via mediation from the source and target database systems inside a data hub also offers engineers the flexibility to operate and govern independently of the source and target systems. But this reminds me of something.

If the data hub paradigm is a mediator presented to understand, organise, correct, enrich and put an order in the data chaos of data lake monoliths, doesn't the data hub look similar to the data management practice engineers have been doing for decades and we all know as "Staging"? Is data hub the evolved version of staging?

Conclusion

The most difficult thing in anything you do is to persuade yourself that there is some value in doing it. It is the same when adopting a new architectural pattern as a data management solution. You have to understand where the change is coming from and see the value before you embark on using it. The data upsurge brought by the internet and cloud computing cause changes to made in data architecture and data storage solutions. The data hub is a new architectural pattern in data management introduced to mediate the chaos of fast-flowing data tsunamis around us and we hope it will help us tally everything up.

What is a Data Hub?

Saturday 17 October 2020

Oracle Apex 20.2 REST API data sources and database tables

Oracle Apex 20.2 is out and has a very interesting new feature, REST Data Source Synchronisation

Why is the REST Data Source Synchronization feature interesting?

Oracle Apex REST Data Source Synchronisation is exciting because it lets you query REST endpoints on the internet on a schedule or on-demand basis and saves the results automatically in database tables.

I think this feature will suit slow-changing data accessible with REST APIs very well. That is, if a REST endpoint data is known to be changing, say few times a day, why should we call the REST endpoint via HTTP every time we wanted to display data on an Apex page? Why would one want to render a page with data over HTTP if that data changes only once a day? Why should we cause network traffic and keep machines busy for data which is not changing often? Or maybe by requirement, you only needed to query a REST endpoint once a day and store it somewhere for data-warehousing.

Wouldn't it be better to store the data in a database table and render it from there every time a page is viewed?

This is exactly what the REST Data Source Synchronisation does. It queries the REST API endpoint and saves the JSON response as data in a database table on a schedule of your choice or on demand.

For my experiment, I used the Public Free London TfL REST API Endpoint from the TfL API which holds data for TfL transportation disruptions and I configured this endpoint to synchronise with my database table every day at 5am.

I even created the Oracle Apex REST Data source inside the apex.oracle.com platform. I used the TfL API Dev platform provided key to make the call from there to the TfL REST endpoint and I managed to sync it once a day on an Oracle Apex Faceted Search page and some charts.

I was able to do all this with zero coding, just pointing the Oracle Apex REST Data Source I created for the TfL API to a table and scheduling the sync to happen once a day at 5am.

To see the working app, go to this link: https://apex.oracle.com/pls/apex/databasesystems/r/tfl-dashboard/home

Screenshots of the app below