Data Architecture with SAP — Data Fabric
This blog post is a repost from SAP Community.
Today we find a lot of different data architectures available like:
- Data Warehouse
- Data Lake
- Data Lakehouse
- Data Fabric
- Data Mesh
To be honest, the fight for an integrated, central and harmonized, canonical data model is lost. And this is a good message, because we invested so much in the last decades to run after this illusory goal that we forgot what is important. And this is to create value from data. Therefore we see the rise of new concepts like Data Mesh, finding a way to create value in a coordinated, distributed way. And Data Fabric which is doing two things for me in simple words. Helping a broad range of users to find the distibuted data they need where ever data is located and enabling transparency and access to it.
Defining Data Fabric
While the term “Fabric” can be understood as “the structure of something; the parts of something that hold it together and make it what it is”, the term “Data Fabric” have some variants and changed over time. I would like to approach the term Data Fabric using different definitions.
According to Claudia Imhoff, Forrester coined the Term “Data Fabric” in 2013. They evaluated “Big Data Fabrics” in their Waves 2016 and 2018 but talks nowadays and since Forrester Wave 2020 about “Enterprise Data Fabric“.
Orchestrating disparate data sources intelligently and securely in a self-service manner, leveraging data platforms such as data lakes, data warehouses, NoSQL, translytical, and others to deliver a unified, trusted, and comprehensive view of customer and business data across the enterprise to support applications and insights.
– Forrester, 2021
Gartner, and especially Mark Beyer is popular for propagating Data Fabric.
A design concept that serves as an integrated layer (fabric) of data and connecting processes. A data fabric utilizes continuous analytics over existing, discoverable and inferenced metadata assets to support the design, deployment and utilization of integrated and reusable data across all environments, including hybrid and multi-cloud platforms.
Or as Mark Beyer himself would express it:
What I found was that NetApp was one of the earliest vendor came up with the term Data Fabric in the context of delivering a solution.
A data fabric that seamlessly connects different clouds, whether they are private, public, or hybrid environments. A data fabric unifies data management across distributed resources to allow consistency and control of data mobility, security, visibility, protection, and access.
– NetApp, 2016
So we see different perspectives about Data Fabrics. Forrester and Gartner both see a unified view on data, build on automation and technologies, helping to handle these. While Gartner emphazie the strong usage of metadata, Forrester see a more platform-oriented approach, orchestrating all company data. NetApp shows possibly a more vendor-like, technical approach but came up very early with this idea, identifying current market challenges.
Data Fabric Architecture
There are different perspectives, which components are included in a Data Fabric architecture and how they should be composed. Unlike a Data Warehouse or a Data Lake, technologies and concepts are not as mature and defined here. Therefore a lot of individual vendor-specific implementations can be found and finally a data fabric don’t even need to be build on components of one single vendor even if a thight integration and a unified user experience would be prefered. Gartner and Forrester defined reference architectures how a Data Fabric should look like.
Typical elements which can be found as part of Data Fabric are:
Put definitions and components together the following picture is my idea of a rather generic, maybe a little simplified logical reference architecture:
Data Fabric with SAP
SAP itself started in 2014 propagating SAP HANA Smart Data Access (SDA) as a functionality for In-Memory Data Fabric together with BW 7.4. Today I see a more differentiated view while SAP communicates the term “Fabric Virtual Table” as a extended approach to SDA, able to virtualize or replicate as needed in the sense of “HANA Cloud is a data fabric that virtualizes others in the cloud so you can access. By default, not replicate data, just virtualize. Unified access layer via SQL to remote sources supported”, as Tammy Powlas commented from an SAP webinar.
But according to Forrester, Data Fabric is not just virtualization:
Data virtualization gives you direct access to transactional systems in real time via a data abstraction layer, while data fabric delivers end-to-end data management capabilities, supporting many more stack components such as data catalog, data preparation, and data modeling.
Today I see SAP more aligned to the Forrester perspective of Enterprise Data Fabric. SAPs Andreas Wesselmann defines Data Fabric as “Data fabric focuses on automating the processing, integration, transformation, preparation, curation, governance, and orchestration of all data assets in order to enable real-time analytics and insights for successful business outcomes.” He gives the following guiding principles for a SAP Data Fabric:
- Integrate any data — solve the data deluge and integrate any kind of data
- Ensure data quality — Discover, prepare, and govern all your data assets in the same tool
- Democratize data — through self-service data preparation and automation capabilities
- Any cloud or on-premise — deploy on any mix of hyperscalers, hybrid, or on-premise
- Reuse any engine — orchestrate any SAP or non-SAP data processing engine
In his blog he sees SAP HANA (Cloud) and SAP Data Intelligence (Cloud) as main technology enabling the Data Fabric. From what Forrester evaluates, this is generally true but it is also more than that:
If we create a SAP Data Fabric based on current cloud solutions, a possible instance would look like this:
SAP Data Intelligence (Cloud) plays a major role in this concept brings with it data orchestration, data catalog and data governance features like data quality management. It delivers a lot of platform features connecting complex system landscapes.
For a deeper dive and explaination of the integration capabilites of SAP HANA Cloud and their underlying technologies Smart Data Access (SDA) and Smart Data Integration (SDI), I recommend the blog “SAP HANA Cloud: switch between data federation, replication and snapshot” by Maxime Simon.
Both SAP Data Warehouse Cloud and SAP HANA Cloud can make use of different capabilities extending to hyperscaler services for data processing and machine learning through data federation. Both strongly make use of the mentioned SAP HANA Cloud integration capabilities.
Back in time the challenge of data federation layed in the availability of remote data sources. In times of cloud data sources with availability of 99.9% this is maybe not a reason anymore to replicate all data into one database.
Data federation is part of data virtualization and handles queries in remote sources on premises or in the cloud. Following SAP shows examples how data federation can be used with typical hyperscalers to extend the reach of SAP’s technologies but manage it from a single point of access.
Information on data federation based on SAP HANA can be found here:
- Data Federation Between SAP HANA Cloud and Amazon Redshift Through SAP HANA smart data integration
- Federating Queries in HANA Cloud from Amazon Athena using Athena API Adapter to Derive Insights
- Data Federation between SAP HANA Cloud and Amazon S3 to Blend Business Data with External Data
- Connect Google BigQuery to SAP HANA Cloud as a Remote Source
Information about data federation based on SAP Data Warehouse Cloud can be found here:
- Federating Queries in SAP Data Warehouse Cloud from Amazon Athena to Derive Insights
- Data Federation Between SAP Data Warehouse Cloud and Azure Data Explorer
- Data Federation Between SAP Data Warehouse Cloud and Amazon Redshift
- Data Federation Between SAP Data Warehouse Cloud ( DWC ) and Google BigQuery
Currently SAP is developing capabilities to make use of Machine Learning on Hyperscalers steered by SAP Data Warehouse Cloud. They describe it as:
“In a nutshell, FedML 2.0 now allows the data scientists to completely automate the end-to-end flow from data sourcing to model training, deployment, prediction and to persist the results back in SAP DWC too, all with just a few lines of code.”
For references see the GitHub and corresponding blogs:
- FedML — The Federated Machine Learning Libraries for Hyperscalers 2.0
- Federated Machine Learning using SAP Data Warehouse Cloud and Amazon SageMaker 2.0
- Federated Machine Learning using SAP Data Warehouse Cloud and Google Vertex AI 2.0
- Federated Machine Learning using SAP Data Warehouse Cloud and Azure Machine Learning 2.0
As having a unifyed access to all data on the one side, we have to consider that on the consumption side we have different personas. While I show different possibilities how this usage is possible via SAP tools, basically access is open to different kind of 3rd party tools, too.
While technically SAP HANA could be the main integration layer, SAP Data Warehouse enables additional possibilities while integrating smoothly with the underlying SAP HANA Cloud and their capabilities.
While self-service data preperation and modeling becomes increasingly important SAP Data Warehouse Cloud delivers additionally the following aspects:
- Spaces — deliver a selfe-serve infrastructure to handle data management more decentralized and business-oriented
- Data Sharing and Data Marketplace — enables to share and integrate external data and internal Data from other Spaces
- Prebuild business content — make use of SAPs experience in integrating SAP sources and build analytical content
- Business Modeling — the Business Layer enables non-technical users to focus on KPIs and re-use of business logic based on a semantic layer
Conclusion
Be aware, Data Fabric makes sense with a mindset of data democracy and a good data culture in the company. It should not be seen as introducing SAP HANA Cloud or SAP Data Warehouse Cloud automatically means having a Data Fabric. It means that this technology is organized as the single access point for the whole organization. Driving data quality, establishing high data security standards and a good data governance.
In most cases no company starts on a green field a company should develop a clear vision what Data Fabric means for them. Than take the steps creating the most value. A Data Fabric still have a lot of complexity, even if the tools used are there to help handling this. But at the end you still have to handle harmonization, data understanding and the business behind, data semantics and so on.
Data Fabric is no self-service, one-click solution making specialists for data management obsolete. Not done right you just implement additional tools making your landscape more complex and expensive as before. As it is with other advanced approaches like Data Mesh, for some companies the classical Data Warehouse can still be the single source of truth and the best approach to handle your analytical needs.
Ultimately, technology is a very dynamic matter and therefore the Data Fabric concept is constantly evolving. The term is also characterized by different providers. I am therefore looking forward to comments, additions and views on the topic of data fabric.