With the new release Fall ’18 Talend has published an extension of the Talend Meta Data Manager. In this introduction we would like to give you more information about this new component and the possibilities of Talend Data Catalog.

What is a Data Catalog?

In short, a data catalog is a central collection of information about datasets. A data catalog consists of metadata, descriptions and information about object definitions, such as tables, synonyms, views and indexes. By centrally maintaining this information from data sources, transformations and flows, the data catalog ensures, within the framework of data governance, that a central inventory of business information and processing is created. This can be used by any user, from analysts to data scientists and developers.

Talend Data Catalog

Talend Data Catalog (TDC) is part of the new release of Metadata Manager, which is used for Data Governance. In the context of information management, TDC focuses on “minimizing risk and maximizing data usage”.

TDC focuses on creating and controlling a central data catalog. It gives the user a secure, single-point-of-control, where they can work together to improve the accessibility, accuracy and relevance of the data. This makes it a single source of “reliable data” for the entire organization and supports, among other things, the correct way of working regarding privacy legislation.

How can TDC be used?

  • Talend Data Catalog works as a spider/crawler and uses machine learning (and smart semantics) to automatically map all data.
  • Indexes data lakes, data warehouse, local apps, etc.
  • Improves data accuracy, compatibility, security and relevance
  • Supports data privacy, regulatory compliance tracking, version control, and audit trails
  • Ensures that end users have faster access to reliable data
  • Can map the data automatically.

Important TDC Features:

  • Data Catalog
    • Multi-layer search, data sampling, semantic discovery, categorizing and auto-profiling
    • Management capabilities with data tagging, adding comments, review, promotion, certification
    • Data relationship discovery
    • Automatic detection of data lake and other data sources
  • Crawlers en Connectors
    • Crawling and retrieving data from any supported data source (RDBMS, cloud, big data, NoSQL, files)
    • Retrieving from Talend Data Integration, Talend MDM, Talend Data Preparation
    • Retrieval of Salesforce.com and SAP
    • HiveQL Parsing
    • SQL Parsing
    • Retrieve any supported tool (data modeling, business intelligence, data integration)
  • Design and Productivity Tools
    • Metadata search and analysis
    • Business Glossary
    • Metadata documentation and enrichment
    • Optional: data modeling and forward engineering
  • Management en Monitoring
    • Metadata documentation and end-to-end data lineage
    • Impact analysis and notification of changes
    • Version control system
    • Approval workflows for business glossary authoring
  • Customizable user interface and REST API

TDC Use Cases

TDC has several possibilities, as described above, but there are three aspects to better understand the tool, namely: discover, curate, explore.

Scenario: A retail use case with focus on privacy-sensitive data. The administrator must ensure that the customer’s personal data is used in accordance with privacy legislation.


Discover

  • Suppose a company processes three types of personal data, namely first name, last name and e-mail address.
  • Someone in the team (data scientist or developer) adds a new file to the catalog without notifying the other team members.
  • The TDC automatically notifies the administrator of this.
  • The administrator can automatically compare the file with the previous versions of the catalog (provided the file has been modified). In this case the file is new, so the administrator can view the data.

Talend Data Catalog Discover

Curate

  • While validating the file, the administrator will see that the email address is not masked and has been published without the customer’s consent.
  • The administrator adds a note stating that it is not a valid file.
  • The administrator uses an already available DI-job to mask the e-mail address.
  • The TDC automatically refreshes the catalog with the updated data.

Explore

  • The administrator can track the lineage of the data (Impact & Lineage).

Traceability

  • The administrator can also perform an end-to-end semantic traceability (track & trace for all personal data in the organization).

More information

Would you like more information about Talend Data Catalog or a hands-on demonstration? Please feel free to contact us. We will be happy to give you more insight and examples about the possibilities for data governance, information management, documentation and data lineage.