Innovation starts with a blank page, where creativity and imagination can fully express themselves. It is about putting aside what we know, erasing misconceptions and norms. It is about using every element around us to discover what we hadn’t seen before.
Experimenting with data requires this sort of freedom. Data exploration and data mining are a crucial step in digital innovation. It needs raw data, untouched data without any pre-interpretation that would lead to bias.
Bias is at the heart of many public concerns in Artificial Intelligence. It can be caused by humans but also by machines when processing and delivering data….
One system, one role
Traditionally, company applications have been built for a specific business process within the application landscape: a CRM, a marketing automation tool, an ERP, etc. Each such application creates and interprets data according to its specific process-related logic. Applications require some sort of data storage to properly function and to persist their state as data. Changes in data or a new record entry often triggers actions within the application itself or across integrated systems. To trigger such actions, the application emits “business events” (to a message queue, an ESB, an API etc.) that translate the change in the state of data into a specific message based on a system logic. The very logic behind this message is often written in the application source code itself, if at least the need for such an event was already defined or anticipated in the first place.
Let’s take a concrete example. In a company X, a marketing tool got implemented in order to segment customers and to create, manage, send and track campaigns. Imagine now that John Smith, a prospect, receives an email marketing campaign and clicks on a button “accept this deal”. The marketing tool has specific logic that reflects how the marketing team works as well as how marketing KPIs are calculated: when a prospect clicks on “accept this deal”, it means that the campaign was successful for this given prospect. The tool will generate and emit for instance the following business event: “Prospect John Smith answered positively to campaign A”.
As in many companies, marketing is of course very much linked to sales. This kind of event in the marketing tool will trigger an action in the CRM tool so that the sales team can develop a deal. This means the CRM reacts to the business event being emitted and reflects the sales team logic by creating a lead for “John Smith”.
John Smith has been, is and will probably be many things for this company across the different systems and processes: prospect, lead, client, ticket, warranty owner, order, newsletter subscriber etc. But originally, John Smith is John Smith. He is a person with a specific behavior and history that are interpreted and labelled by information systems across the company.
Data scientists are not only interested in the business context data (all information linked to business events such as personal data, product or campaign information etc.), but also in these original events across time (history). These original events are without pre-interpretation of any kind. They are the source of information that applications use to deduce their business events from: the raw event data or in common terms “anything that happens”. In our example above, the raw event data would typically be “John Smith opens email campaign A” and “John Smith clicks on button “accept the deal”, as opposed to the interpretation of those facts by the marketing tool: “John Smith answered positively to campaign A”.
Of course, to understand the relationship that John Smith has with the company, business events (“create lead in CRM”) are primordial whereas raw data events contain additional neutral descriptive information about a person’s behavior.
A high volume of data hidden in database systems
Let’s face it, capturing raw event data does not happen without difficulty.
First of all, if every single raw description of ”what happens” need to be collected, processed and distributed, it means that your systems will be dealing with high volume of data which can easily impact their performance and reliability. Typical request-response protocols from application to application would not be able to catch up the rhythm while concurrency issues would strike sooner or later.
Secondly, it is important to note that this data is not always available in all storage systems. However, it can be found in the most commonly used ones: databases. Raw data events are located in the journal of database, the invisible part hidden behind the usual database tables: the commit log. Capturing this data in databases can involve a lot of tedious works on the application code itself. It used to be very difficult to have access to this raw event data, as it often required to open up the application and change its code to really get a grasp on it. No need to say that the workload, cost and risk it represents was a deal-breaker for many IT-managers when considering event-stream processing… Often no one even dares to touch those (legacy) systems anymore!
Recent innovations to unlock the data potential
Over the past years two big innovations revolutionized the capture, processing and more generally usage of this raw event data.
The first one is the rise of event stream processing platform such as Apache Kafka. Apache Kafka was originally built as a message queue by LinkedIn to process a huge amount of daily messages (here is an insightful article explaining the philosophy behind Apache Kafka: “Using logs to build a solid data infrastructure (or: why dual writes are a bad idea)“). Donated to the Apache Software Foundation as open source software, this next generation message queue quickly evolved in an event streaming platform that captures, processes, stores and distributes millions of event messages in real time. It suffices to connect a data source once to this platform for data to be centralized and readily available within this platform. We call this “connect once, index at will”.
The second innovation is the availability of Change Data Capture (CDC) technology. A framework such as Debezium (open source, backed by RedHat) integrated in connectors (Kafka Connect) enables this non-invasive way to capture changes in data and metadata in real-time from different systems (relational databases, transactional systems, mainframe systems and applications). It streams this raw event data to their rightful place in the data pipeline without impacting the source system.
We use this powerful combination of CDC and Kafka often at our clients to facilitate integration but also to ensure that all raw event data is unlocked and ready-to-use at any time for any future or present analytics use case. It allows data scientists to work with all data without bias from machines and it optimizes a big part of data engineering efforts (see our article: "Data engineering, the immersed part of the digitalization iceberg").
So next time you need to integrate your systems, think of Kafka as an integration middleware that can bring you more than real-time event stream processing capabilities… It is a perfect ally for AI and data science which makes it a fantastic data innovation enablement tool.
Comments