Data Catalogs — Market, Capabilities & Roles

Peter Baumann
5 min readOct 3, 2022

This article should give you an compact overview about solutions that matter, capabilities of current Data Catalogs and the roles connected to introducing a Data Catalog in your company.

Data Catalog are playing an important role in current distributed data landscapes. They help to manage your data assets, giving you an overview about relevant and sensible data within your company to support data culture and and all kind of data workers (which is possibly everyone in your company). We also see that new data architecture approaches like Data Mesh, Data Fabric and Modern Data Stack pushing Data Catalogs to the next level using knowledge graphs, active metadata and evolve to platforms to create “Data Intelligence”.

Fraunhofer defines Data Catalog as follows:

A Data Catalog is an integrated platform for data curation, matching data supply and demand. It offers users functions to register data; to retrieve and use data; and to assess and analyze data. A Data Catalog therefore should provide a data inventory (for data supply) and features for data discovery (for data demand) as key components. Additional features should support data governance, data assessment, and data analytics, alongside with appropriate features for catalog administration and data collaboration.

— Fraunhofer, 2022

For reference, here’s an overview of what Data Catalog solutions have been looked at recently, by market research:

Fig. 1: Market view and evaluations of Data Catalog solutions

Some remarks for Fig. 1:

  • Basically, the color marking of the cell always refers to the column only
  • For platform considerations, the wider spectrum, not only the narrow Data Catalog consideration, was taken into account
  • Columns marked with * were only added for solutions that have already been evaluated, the source describe possibly more products
  • All market studies are from 2022 as far as can be traced
  • The values are normalized to 0–5
  • The Fraunhofer evaluation refers to the equally weighted sum of the individually considered and evaluated factors.

Basically, it can be said that there are different views. BARC views Data Intelligence Platforms as a further development of Data Catalogs. Forrester considers Data Catalogs for DataOps. Dresner has also looked at solutions that are strongly embedded in Analytics (therefore only complementary), as does the website Notion.so.

In comparison, 2 patterns can be quickly identified.

Pattern 1 — The always considered leaders:

  • Alation
  • Collibra
  • Informatica
  • Atlan

These four solutions are almost always considered. Except for BARC, all of them have also considered Atlan. In many cases, these are also the leading solutions with a broad spectrum of capabilities.

Pattern 2 — the non-inclusion of Open Source Data Catalogs:

  • Amundsen (LF AI & Data Foundation — initiated by Lyft)
  • Atlas (Apache — initiated by Cloudera)

These solutions are rarely covered by the bigger market researchers as they typically look just on commercial offerings. But we also see that the maturity is often not comparable to the commercial offerings, too. On the other side they are community driven and adapt latest concepts quickly. Another project I would mention but have not seen any coverage yet is DataHub, initiated by LinkedIn (while the name is possibly a little bit misleading).

A capability model to evaluate the capabilities of data catalogs can be helpful but should be adapted to the individual need and need to be understood well:

Fig. 2: Fraunhofer capability model for Data Catalogs

These capability models can be very helpful to work out a fit between solutions and a companies demand and priorities. To define use cases and necessary capabilities should be an early step in selecting the right Data Catalog for a company.

In the case of Data Catalogs, a pure consideration of the capabilities and the technology is not enough. In order for a data catalog to be used meaningfully, it must be set up appropriately for the departments and business requirements, and a clear role model must be introduced in the context of Data Governance. The roles can be linked to workflows and processes and, as a rule, stored in the Data Catalog. Depending on the complexity of the organization and the use case for your Data Catalog the following roles should be considered:

  • Data Owner — is assigned to a business unit and is responsible for specific types of data (e.g. data of a specific product). Responsibilities include quality, security, and compliance of data.
  • Data Steward — are responsible for managing data according to business requirements. They implement defined data policies and procedures at a business and technical level. Data stewards have knowledge of business and data requirements and translate these requirements into technical specification.
  • Enterprise Data Steward— assume an oversight role in data governance activities. Responsibilities include providing leadership, guidance and coordination to Data Stewards across multiple areas, developing data governance of a data governance vision, ensuring data governance is executed in alignment with the vision and business objectives, and creating and managing data governance artifacts.
  • Data Catalog Manager — is responsible for providing a data catalog solution. Among other responsibilities are the high availability of the solution, the timely delivery of updates, and the support of Data Catalog users. Overall, the Data Catalog Manager seeks to enhance the experience of all Data Catalog users improve users and performs a key facilitator role.
  • Data Protection Officers — are responsible for ensuring that data users process personal data in accordance with applicable data protection regulations. Data Protection Officers monitor data processing activities within the organization to ensure that data is processed properly. They also take steps to proactively prevent misuse of personal data and promote aspects of data protection through technology.
  • Data Architect — are responsible for defining data objects and creating, deploying and maintaining conceptual and logical data models and mapping them to physical data models.
  • Data Engineer — Data Engineers are responsible for creating and providing the database for data analysis. This includes tasks such as data discovery, data preparation, and implementing and maintaining data pipelines.
  • Data Scientist — is responsible for developing, deploying, and maintaining advanced analytics models and is therefore more focused on long-term and future developments compared to a Data Analyst.
  • Data Analyst — is responsible for developing, deploying and maintaining reports and ad hoc analysis and is therefore more focused on short-term or past developments than a Data Scientist.
  • Data Citizen — come from business areas outside the Advanced Analytics environment and are able to combine their expertise with Data Science skills. They thus bridge the gap between the business world and Advanced Analytics.

In the Data Catalog market there are different approaches of Data Catalogs and the context they are used. Currently we can see a dynamic evolution driven by new technological and organizational developments. Currently I see not much consideration in classical data management as they are strongly connected to Data Governance and processes and not actively managing data, ‘just’ make use of the metadata of these systems. But as data landscapes become more and more distributed and complex and companies make increasing use of Data Catalogs I expect to see more consideration in the coming years.

--

--

Peter Baumann

As a Consultant for Data & Analytics Strategy I help my customers with topics around Data Strategy. Opinions reflect my personal view. I work @INFOMOTION