Skip to Content

Data Discovery

In other blogs, we’ve looked at the types of data that AI can work with and how to spot opportunities where AI may add value. The next step is understanding the data itself. Before training a model or running inference, it is important to know what’s inside your datasets, how reliable they are, and where the gaps may be. This is the role of data discovery, the process that transforms raw inputs into trustworthy foundations for AI.

What is Data Discovery?

Data discovery is the process of exploring and analyzing raw data from different sources to understand its contents, quality, and relationships. It is the first step in any data-driven project, whether for business intelligence, analytics, or AI, because it transforms raw data into knowledge you can act on.

Core Activities in Data Discovery

  1. Exploratory Data Analysis (EDA): EDA is the hands-on process of investigating datasets to summarize their main characteristics. Using techniques like visualization, it helps you uncover patterns, spot anomalies, and form initial hypotheses.
  2. Data Profiling: This is the detective work that digs into the details. Profiling reveals what types of data are present, how values are distributed, and where gaps or errors exist. It helps determine if the data is trustworthy and usable.
  3. Data Visualization: While not an activity on its own, visualization is a key tool used throughout the discovery process. It turns complex number sets into easy-to-understand charts, graphs, and dashboards, making patterns and outliers stand out at a glance.

Related Processes That Follow Data Discovery

The insights gained during data discovery often lead to these next steps:

  • Data Preparation and Cleaning: Discovery often reveals issues like missing values or inconsistencies. The actual process of filling gaps, correcting errors, and structuring the data for analysis happens after discovery.
  • Data Lineage and Governance: Trust in data depends on knowing its origins. The discovery phase can highlight the need to track where data comes from and how it has changed (lineage) and to establish rules for sensitive information (see note below on governance).
  • Active Experimentation: If discovery shows that key data is missing, companies might then choose to collect new information by deploying sensors, running pilot tests, or conducting surveys. This process of generating new data happens after a need has been identified through discovery.

Digging a Bit Deeper

Not wanting to confuse the situation I wanted to acknowledge that there is more to this topic.

The Blurring of "Discovery" and "Catalog"

Apparently this is perhaps the biggest point of confusion in the industry.

  • Data Catalog: A data catalog is a passive inventory. It's like a library's card catalog, documenting metadata (data about data) such as what data exists, where it's stored, who owns it, and how it's defined. A data catalog's primary purpose is to help people find and understand data.
  • Data Discovery: This is the active process of exploring and analyzing the data itself to find hidden patterns and insights.

The controversy/variation: The lines have become very blurry. Many modern data catalog tools now include robust data discovery capabilities (like data profiling, visualization, and automated lineage) and market themselves as "data discovery platforms." Conversely, many business intelligence and analytics tools, which are used for discovery, now have built-in catalog features. The debate is whether "discovery" is a human-led process or an automated tool function. The consensus is that they are complementary: you use a data catalog to find the right data assets, and then you use discovery tools to explore them.

The Role of AI and "Smart Discovery"

The term "smart data discovery" is gaining traction. It refers to using AI and machine learning to automate parts of the discovery process.

  • AI-driven features: This includes automatic data profiling, relationship mapping, and generating natural language insights from data.
  • The variation: While this promises to make discovery faster and more accessible, there's a debate about how much you can trust the machine. A human eye is still critical for validating the AI's findings and applying business context. The "black box" nature of some AI models can also make it difficult to understand how an insight was derived, which can be a barrier to trust.

Competing Architectural Paradigms

Data discovery is heavily influenced by the underlying data architecture. The rise of new architectures has changed how discovery is performed.

  • Data Lakes: These repositories of raw, unstructured data were initially difficult for discovery. Without a defined schema, a data lake could easily become a "data swamp," making it nearly impossible to find or trust data.
  • Data Warehouses: These are highly structured and curated, making discovery straightforward but rigid and slow.
  • Data Lakehouses: The new hybrid architecture, combining the flexibility of a data lake with the structure of a data warehouse, is designed to make discovery easier. It places a structured metadata layer on top of a data lake, which helps solve the data swamp problem and makes data more discoverable for both BI and AI workloads.

Industry practice variation: Organizations are still in the process of adopting these new architectures, so the methods and tools for discovery vary widely depending on whether they are working in a legacy data warehouse, a data lake, or a modern lakehouse environment.

Governance and Why it Matters

The rulebook for data. Governance is the set of policies and procedures put in place to make sure data is a valuable asset, not a liability. That means having clear standards for things like data quality, security, and privacy, all so we can trust the data to be consistent and safe, and know we're following all the rules.


Next Steps:

  • In an upcoming blog I will map out a journey leading to the use of boards like the ALC-4096-AIH.
  • The context of this website is the use of AI accelerator boards, see our first board, the ALC-4096-AIH. We have another board coming, to be announced soon.
  • Contact us to discuss projects or opportunities. 
Data Discovery
James Henry August 31, 2025
Share this post
Archive
Identifying AI Opportunities