Comparing Patterns of Big Data Integration

Comparing Patterns of Big Data Integration
admin | April 9th, 2016

Previously, I introduced five key patterns of big data integration. Here, we review those patterns, then dive into the pros and cons of each.

Five key big data integration patterns

With increased reliance on Hadoop and Spark for data management, processing and analytics, data integration strategies should evolve to exploit big data platforms in support of digital business, Internet of Things (IoT) and analytics use cases. While Hadoop is used for batch data processing, Spark supports low-latency processing. Integration leaders should understand the various patterns of integration described below, and align use cases with vendor offerings.

 
Native ETL and ELT on Apache Hadoop & Spark platforms – 

    Data integration occurring natively on the Hadoop & Spark platforms leveraging data integration tools which can generate native code (such as Pig, Hive, MapReduce and Spark).

Data integration offering specific to Hadoop & Spark platforms – 

    Incumbent Vendors providing a dedicated big data integration offering (distinct from their traditional data integration offering) that runs on Hadoop & Spark platforms. A separate offering potentially avoids disruption to traditional data integration workflows.

The same offering for both traditional and big data integration needs – 

  • Vendors leveraging the same offering for traditional data and big data integration.
  • Incumbent vendors evolving their traditional data integration product to support big data integration.
  • Emerging vendors providing data integration products that support both traditional and big data integration.

Data pipelines on Hadoop & Spark platforms – 

  • Vendors providing end-to-end data management solution (ingestion, organization, transformation, enrichment, and quality) including integration.
  • Vendors providing end-to-end analytics solution (ingestion, organization, transformation, enrichment, and analytics) including integration.
  • Vendors providing a framework for building and deploying data applications including integration.

Self-service data preparation using Hadoop & Spark platforms – 

    Self-service data preparation offerings using Hadoop & Spark platforms to support the processing requirements for data preparation.

 

Table 1 delineates the pros and cons of each of these patterns.

 

Table 1. Pros and cons of big data integration patterns.

 

Pattern Pros Cons
1 Native ETL and ELT in Apache Hadoop & Spark (or other distributed data processing) platforms
  • Potential to be cost-effective
  • Leverages distributed & parallel processing power.
  • ETL/ELT and data management can occur on same platform.
  • Flexibility to mix-and-match Hadoop & Spark ecosystem tools
  • Skill set scarcity
  • Approach differs from traditional data integration
  • May lack robust orchestration and troubleshooting
  • May lack seamless integration with data quality and metadata products
2 Data integration product offering specific to Hadoop & Spark platforms (distinct from traditional data integration offering)
  • Leverages distributed & parallel processing power
  • Requires different skillsets for Big Data & traditional data
  • Robust orchestration and troubleshooting.
  • Avoids disruption to traditional data integration workflows by avoiding upgrade to existing product.
  • Investment in multiple products (assuming existing ownership of traditional data integration offering).
  • Lack of holistic approach to data integration.
  • Maturity may still be evolving.
3 The same offering for both traditional and big data integration needs
  • Robust orchestration and troubleshooting.
  • Familiar user interface.
  • Skillset shareable between big data and traditional data integration.
  • Can embed open-source tools  (e.g., Apache Kafka, Storm, and Spark MLlib)
  • Some offerings support running entire integration flow natively in Hadoop & Spark data processing platforms.
  • Holistic approach to data integration.
  • Some offerings may partially leverage Hadoop & Spark (via push-down processing). Some offerings leverage Hadoop & Spark natively
  • Troubleshooting can span both Big Data and traditional data integration platforms.
  • Maturity may still be evolving.
4 Data pipelines in Hadoop & Spark platforms
  • Entire data management or analytical solution runs natively in Hadoop & Spark platforms.
  • Can support traditional data integration needs 
  • May lack compliance with other enterprise data quality, data governance and metadata products and standards.
  • Maturity may still be evolving.
5 Self-service data preparation using Hadoop & Spark platforms
  • Can support traditional data integration.
  • Robust orchestration and troubleshooting.
  • Familiar user interface.
  • Skill set shareable between big data and traditional data integration.
  • Can embed open-source tools  (e.g., Apache Kafka & Storm).
  • Some offerings support running entire integration flow natively in Hadoop & Spark platforms
  • Multiple data footprints requiring governance may reside in data processing platform which may be distinct from data management platform.

 

Information leaders should leverage the analyses presented above to ascertain the benefits of incorporating the Apache & Spark platform to satisfy their data integration requirements. In particular, they should:

  • Identify use cases (if any) that might benefit from exploiting big data integration platforms.
  • Understand the patterns of big data integration and align your use cases with the relevant pattern(s).
  • Investigate approaches taken by vendors to leverage Hadoop & Spark for big data integration.

 

Unabashed.io: The information contained herein was obtained from sources understood to be reliable. Lakshmi Randall disclaims all warranties regarding its accuracy, completeness or adequacy, and is not liable for errors, omissions or inadequacies therein. This blog represents the opinions of Lakshmi Randall and therefore should not be construed as statements of fact.

Category: Big Data Integration Data Integration Hadoop Spark