Almost every industry has witnessed an exponential increase in the volume of data and the velocity of change in data, facilitated by a tremendous increase in computation power. Simultaneously, with globalisation and increased competition, most companies have witnessed significant margin pressures. These concurrent phenomena have propelled most companies to explore ways and means of generating insights and predictions from their internal and external data. The potency of using data for driving competitive advantage has become so well recognised that most companies have started viewing data as an information asset rather than as a by-product of business processes. To this end, analysis of data and creation of data-driven strategies have become an integral part of the overall management philosophy.
Enormous amount of research both in industry and academia has focused on tools, techniques and methodologies of generating insights, prediction and strategies through analysis of data and the name data science has become a “catch all” nomenclature for the entire set of activities which focuses on generating business value out of data. Some companies have started to view data science as a business function much like marketing, finance, information technology or operations. While others believe that data science is embedded in every function, and every function needs to use the tools and techniques of analytics to deliver incremental business value.
Irrespective of the approach that an organization adopts for developing data science capability, it is critical to have the right people (and team structure), the right process and build the right tools. Actify Data Labs works with organizations to help build strong data science capabilities.
Building future ready data science teams and how we do it differently
Building data science organizations for the future requires individuals who bring in multi-disciplinary skills and expertise. As per the seminal article by Thomas H. Davenport and D.J. Patil, a data scientist needs to bring together a wide variety of skills including:
- Deep knowledge of quantitative methods including statistical learning, machine learning, simulation and optimization
- A clear understanding of the business context and how data science solutions are used (consumed) by business owners
- Understands the data environment of the organization, external data and methodologies of managing large volumes of data, unstructured data and streaming data
- Is an excellent programmer
- Understands technology landscape
- Can handle change management
- Is an excellent story teller
Building a team of multi-skilled data scientists could be a challenge for most organizations. We could help accelerate the process in multiple ways:
- Provide a seed team that can kick-start the data science journey
- Provide high-end skilled resources across both data engineering and data science who could work closely with the existing team to enhance scale and expertise
- Provide a set of advanced data engineering and data science experts to act as guides and reviewers to help the existing team
- Train data engineering and data science teams and transfer capability
The process of developing machine learning models requires a rigorous set of checks to ensure that correctness of problem identification, data sufficiency & data quality analysis, methodology selection and testing. We help create a review process that addresses issues like data sufficiency challenges, methodology for missing value imputation, handling challenges of truncated data, cross validation requirement, interpretability of black box models using Local Interpretable Model-Agnostic Explanations (LIME) and Shapley value. We also help put in processes to ensure ongoing tracking of machine learning models including tracking for stability, discrimination, accuracy, correlation structure and predictive pattern of all predictors.
Hive, Spark, SQL, R or Python based data preparation pipeline
To build a machine learning model, the first step involves getting data from multiple systems and creating a single data set on which the model will be developed. This process includes extracting data, matching data across multiple sources, aggregating data across entities of interest (e.g. customer, supplier, equipment etc.). To this end we build data pipelines using either conventional SQL processing or a high-level processing platform like R or Python. In case, where the data volumes are significantly higher, we use the appropriate big data technologies like Hive or Spark to process the data. Based on the infrastructure and processing volumes we tweak the processing parameters between disc or in-memory processing.
Data pipeline preparation within ADAPTify
We could also leverage our data platform ADAPTify to build data pipeline. ADAPTify has two flavours to it – an RDBMS version of the platform and a Hadoop version of the platform. The RDBMS version uses SQL for data processing while the Hadoop version uses Spark. We create drag-and-drop data processing pipeline within ADAPTify which could be run in an ad-hoc manner or could be scheduled based on pre-set scheduling requirements. Workflows can be shared across users to facilitate collaboration. Workflows can also be leveraged for standardizing data processing across multiples projects.
R and Python based model development and validation process
We create modelling process pipeline and utilities in R, Python and Spark. These includes set of standard processes and utilities that include:
- Data sufficiency and data quality utilities
- Missing imputation utilities (determining missingness at random vs. missing being part of data generation process) – MCMC and similar methodologies
- K-fold cross validation
- Outlier detector
- Industry specific feature engineering libraries – financial services, retail and sensor data-based feature engineering libraries
- Univariate predictive power determinant
- Automated hyper parameter tuning
- Model assessment utilities
- Simulation and what-if analysis utilities that are based on an input variance-covariance matrix
- Local Interpretable Model-Agnostic Explanations (LIME) and Shapley value utilities
We also create checklist and review templates that help evaluate and govern machine learning model development processes.
Machine Learning model development and validation process within ADAPTify
Our data platform ADAPTify allows users to build machine learning models using simple drag-and-drop components. It also provides a wide set of helper components like K-Cross validator, missing value imputer, outlier detector, univariate information value analyser etc. ADAPTify allows users to build modelling workflows (pipelines) which can be copied and shared across users. This allows experienced users to create processes which can be used as reference by other members within the data science organization. Specific utilities are also provided for handling text and image data.
ADAPTify also allows for a wide set of ongoing model validation components including validation of stability, accuracy, discriminatory power, variable strength and correlation structures. ADAPTify also allows for creation of simulations and what-if analysis using a user defined correlation matrix.