- HDInsight Essentials(Second Edition)
- Rajesh Nadipalli
- 899字
- 2025-02-21 11:35:28
The next generation Hadoop-based Enterprise data architecture
We will now see how modern data architecture addresses the pain points of a legacy EDW and prepares the organization to handle the big wave of data. It is designed to handle both structured and unstructured data in a cost effective and scalable mode. This provides the business with a wide range of new capabilities and opportunities to gain insights.
Instead of a complete EDW replacement, this architecture leverages the existing investment by preserving end-user interfaces that require relational stores. In this model, Hadoop becomes the prime data store and EDW is used to store aggregates.
The following figure shows you how to transition from legacy EDW-based solutions to a hybrid Hadoop-based ecosystem, where EDW's role is reduced to hosting the aggregated data enabling queries via well-established tools that are relational in nature:

The following figure shows you the new reference architecture for a Hadoop-based Data Lake:

Let's take a look at the stack from bottom to top.
Source systems
The following are the data sources in the next generation architecture:
- OLTP: These databases store data for transactional systems such as CRM, ERP, including manufacturing, inventory, shipping, and others
- XML and text files: Data is also received in the form of text files, which are generally delimited, or XML, or some other fixed format known within the organization
- Unstructured: Information from various websites, word documents, PDF documents, and other forms that don't have fixed structure or semantics
- Machine-generated data: Data captured from automated systems such as telemetry is used primarily for monitoring and performance
- Audio, video, and images: Audio, video recordings, and images that are difficult to analyze due to their binary formats
- Web clicks and logs: Click stream and logs from websites that provide you with valuable information about consumer behavior
- Social media: Messages, tweets, and posts on several social media platforms such as Twitter, Facebook, and Google that provide you with consumer sentiments
Data Lake
This is the heart of the architecture that includes storage and compute.
- The following are the key data stores for a Data Lake:
- Hadoop HDFS: HDFS is a core component of Hadoop that provides a data store that can scale as per the business needs and run on any commodity hardware and is 100 percent open source. In this new architecture, all the source data first lands to HDFS and is then processed and exported to other databases or applications.
- Hadoop HBase: HBase is a distributed and scalable NoSQL database that provides low latency option on Hadoop. It uses HDFS to store data files and is hosted on Hadoop cluster.
- Hadoop MPP databases: MPP stands for massively parallel processing where data can be stored in HDFS and access to MPP can be through SQL or APIs enabling easier integration to existing applications. We are seeing a lot of innovations in this area.
- Legacy EDW and DM: This architecture leverages current investment on EDW, DM, and MDM. The size of EDW, however, is reduced as HDFS takes the heavy lifting and only the summary data is hosted in EDW.
The following are the processing mechanisms for Data Lake:
- Hadoop Batch (MapReduce): MapReduce is a core Hadoop component and is a good fit to replace ETL batch jobs. MapReduce has built-in fault tolerance and runs on the same HDFS data nodes that can scale when the demand increases.
- Hadoop Streaming (Storm): Storm allows the distributed real-time computation system on top of the Hadoop cluster. A good use of this technology is the real-time security alerts on dashboards that require low latency and cannot wait for a complete batch execution.
- Hadoop Real time (Tez): Tez is an extensible framework that allows developers to write native YARN applications that can handle workloads ranging from interactive to batch. Additionally projects such as Hive and Pig can run over Tez and benefit from performance gains over MapReduce.
- Hadoop Oozie workflows: This enables creation of workflow jobs to orchestrate Hive, Pig, and MapReduce tasks.
- Legacy ETL and stored procedures: This block in the architecture represents the legacy ETL code that will gradually shrink as Hadoop ecosystem builds more capabilities to handle various workloads.
User access
This part of the architecture remains identical to the traditional data warehouse architecture with BI dashboards, operational reports, analytics, and ad hoc queries. This architecture does provide additional capabilities such as fraud detection, predictive analytics, 360 views of customers, and longer duration of history reports.
The new architecture will also require provisioning and monitoring capabilities such as the EDW-based architecture that includes managing deployments, monitoring jobs, and operations.
A Data Lake architecture does have additional components from the Hadoop stack that add to the complexity and we need new tools that typically come with the Hadoop distribution such as Ambari.
Data governance process and tools are built for EDW-based architecture that can be extended to the Data Lake base architecture.
Current tools for security on Hadoop-based Data Lake are not sophisticated but will improve in the next few years as adoption of Hadoop is gaining steam.
The core of Hadoop is essentially a filesystem whose management requires metadata for organizing, transforming, and publishing information from these files. This requires a new metadata component for the Data Lake architecture. Apache HCatalog does have some basic metadata capabilities but needs to be extended to capture operational and business-level metadata.