The key role of data integration in leveraging the power of machine intelligence

By: Bijoy Khandelwal  |  June 10, 2019

Introduction

Today Artificial Intelligence (AI) is possibly the most widely discussed technology trend. There are strong views about the applications and benefits of AI, while some feel that it will bring in the apogalactic day of machine controlling human beings; a few others believe that AI will be extremely beneficial for humans.

In the near term, AI and machine learning is becoming a handy tool to solve important business and social problems. In many cases, machine intelligence is being used as a supplement to human decision making, thereby “augmenting” human intelligence. In this context ‘Augmented Intelligence’ is proving to be more critical rather than Artificial Intelligence.

In recent times, several business and social applications are using the power of machine intelligence to solve day-to-day complex challenges. Examples include identification of crop failure and de-forestation from satellite images, recognition of disease and anomaly identification from medical images, predicting failure of machines from sensor data, real time product recommendations, identification of fraud and cyber threat and many more.

Hence there is no question about the immense benefits of machine intelligence (whether it is being used as artificial intelligence or augmented intelligence). There are five key components that influence the development and usage of machine intelligence; these are:

  • Data
  • Quantified monetary or social benefit of machine intelligence
  • Humans, who develop, use or are impacted by machine intelligence
  • The technology which is used to gather data and apply machine intelligence
  • The algorithms which are used to extract machine intelligence form data

Out of the above components, data is the most critical element for building and applying machine intelligence. Without data it is impossible to have machine intelligence. Data is really the heart or the primary input to machine intelligence.

For machine intelligence to work, one needs historical data to build the intelligence and current data (or data as of the present moment) to apply the intelligence. Extensive research in academia and industry have demonstrated that higher volumes of data and availability of multiple sources of data is the most important driver in developing more accurate machine intelligence. Hence, data is the competitive edge for an organization and for a nation to build an edge in this space.

However, the data in most organizations (including Government agencies) are extremely disaggregated. Putting all the data together and resolving data quality challenges is the first and the most difficult step in building machine intelligence. In addition, the ability of tapping into new data sources or obtaining external data remains the key differentiator in building strong machine intelligence that is difficult for a competitor to replicate.

Technology and algorithms are the other key enablers for building machine intelligence. Previously, most of the technology required for building machine intelligence used to be encapsulated in proprietary tools and hence, were prohibitively costly and was out of reach for the mid-to-small sized organizations. However, the open source revolution has broken that barrier and has made very sophisticated technologies available at nearby zero costs. This consequently fuelled the adoption machine learning across SME’s and has helped democratise machine intelligence.

Challenge

Popular technology culture is rife with the belief that “mathematically complex” algorithms and methodologies are the biggest challenges for leveraging the power of machine intelligence. However, a detailed discussion with any practitioner will reveal that this is far from reality.

Integrating data across multiple sources and resolving data quality issues is THE SINGLE MOST IMPORTANT CHALLENGE in leveraging machine intelligence. This is also the biggest challenge even if one desires to derive simple rules or insights from data. Once, this sky-high challenge is resolved, generating insights or building machine learning model is very easy.

If data integration presents the biggest challenge, the second biggest challenge is the time and customized coding that is required to implement a machine learning model (or even a simple rule) within the production systems of an organization.

Due to data integration challenges, most data driven initiatives within organisations tend to take very long to demonstrate tangible results. This causes budget escalation and frustrations because senior leadership has to wait for very long before they can see any measurable impact.

Resolving the data integration challenges

Step-1: pre-built connectors

The first step in resolving the data integration challenge is to use pre-built connectors to major data bases and enterprise applications. These connectors act like pre-built pipes through which any table from source systems can be transferred to a data lake. While vanilla pre-built connectors are great to connect to source data, one needs to be careful of the following challenges:

  • Change data capture
  • Ensuring zero data loss while connecting to streaming data sources
  • Ability of performing incremental data loads in cases where a well defined incrementation field (e.g. date modified) does not exist in the source system

Step-2: agile mini-data-marts

Once connectors are created to bring in any source table into a data lake, it is critical to summarize the data and combine information from multiple tables into data-marts. These data-marts are used to build machine learning models or to generate insights. In any organization, this can be a daunting task because one needs to find out the relationship between a few thousand tables, and in most cases getting access to comprehensive and well-documented meta data is a far cry. To achieve this, one must adopt an agile approach wherein it is possible to create quick workflows, scripts or data pipelines that can be used to combine a small sub-set of tables to build mini-data-marts that are used to drive specific use cases.

Step-3: data quality management

Data has quality if it satisfies the requirements of its intended use. It lacks quality to the extent that it does not satisfy the requirement. The same data element may be of high quality for a particular purpose and may be of very poor quality for some other purposes. Hence, the agile mini-data-mart approach calls for an agile data quality management framework. For this purpose, it is essential, to break down data quality into components so that the right data quality checks is applied for the relevant use case.

Step-4: machine-aided meta data creation

Step-2 and 3 mentioned above helps organizations to quickly create many mini-data-marts. But this often leads to governance issues and lack of clarity around information content (or meta data) of each data mart. A machine-aided meta data creation utility which enables builders of mini-data-marts to create quick meta data is a great enabler in the data integration journey.

Step-5: ability of using human judgement on top of machine learning models

While auto hyper parameter tuning and built-in cross validation helps in creation of machine learning models, most often business users would like to use human judgement along with model predictions for making automated decisions. For this purpose, interactive decision tree and rule creation utilities are critical to help business users leverage the combined power of machine learning and human judgement.

Step-6: single-click deployment of models and rules

Once models and rules are developed, the same must be implemented on real time data. It is critical to be able to easily publish them as web-services so that the results of model (or rule) is available to any front-end system.

Step-7: making models explainable

Another key challenge of most current machine learning models is the lack of exploitability. When any complex machine learning model is developed, it is a must to use methodologies like Locally Interpretable Model Explanations or Shapley values to provide explain-ability to end users. One should keep in mind that ultimately it is we humans who use machine learning models in real business processes; hence, explain-ability is key to adoption.

Approach to address the challenges

To implement the steps mentioned above, it is important to have a team that is nimble and to have pre-built technology frameworks for each of the steps. Such frameworks could be developed from scratch or could be used as pre-built utilities that are available in various data platforms. In addition, it is also important to have an agile and use case driven approach to problem solving. Often the mind-set of the organization plays a bigger role compared to technology challenges, so an agile-learn-as-you-go approach is very critical for success.

Bijoy Khandelwal
Head of Engineering & Products at Actify Data Labs