Stage 2: Labeling & Dataset Creation – Giving Meaning to Patterns
In Stage 1 we saw how unsupervised learning can reveal hidden structure in raw data, clusters of similarity, outliers, and natural groupings. But patterns alone aren’t enough. To build a useful model, we need to tell the AI what those patterns mean. That’s where labeling and dataset creation comes in.
This stage is often the most time-consuming and resource-intensive step in an AI project, but it’s also the one that transforms raw observations into actionable intelligence.
The Importance of Labeling
Without labels, supervised training is impossible. Labels act as the ground truth, the “answers” that let a model learn the mapping from input data to outcomes. For example:
- A cluster of vibration sensor readings becomes labeled as “normal operation” vs “bearing fault.”
- Video frames from a jobsite are tagged as “person,” “forklift,” or “idle area.”
- Audio clips are annotated as “generator start,” “ambient noise,” or “anomaly.”
By attaching human-understood meaning to the discovered structure, we move from unexplored data to a structured dataset that’s ready for training.
Methods of Labeling
-
Manual Annotation
Human experts review data samples and apply labels. This provides accuracy, but can be slow and costly at scale. -
Semi-Automated Labeling
Unsupervised clusters from Stage 1 are pre-grouped, and humans only confirm or adjust them. This speeds up the process while maintaining quality. -
Rule-Based Tagging
In some cases, logs or events can provide automatic labels. For example, when a machine reports “ON” in its control system, any corresponding sensor data can be labeled as such. -
Synthetic Data & Augmentation
Sometimes it’s easier to generate additional labeled examples (e.g., simulated video footage, augmented sensor signals) to strengthen the dataset.
Dataset Creation Workflow
On a Digital View AI project, labeling often follows this flow:
- Collect raw data using the Pi CM5 and accelerator-enabled board.
- Identify clusters and anomalies from Stage 1 (unsupervised learning).
- Label representative samples using domain expertise, automation, or hybrid approaches.
- Build a balanced dataset, making sure all categories of interest are represented.
- Store in a structured format that can be used for supervised training in Stage 3.
Challenges & Considerations
- Scale: A single deployment can generate millions of data points. Labeling only the most informative samples is key.
- Bias: Labels reflect human judgment, if the labeling is inconsistent, the model will inherit that bias.
- Cost: Labeling is often the bottleneck in time and money. Strategies like semi-supervised methods help reduce this burden.
Labeling is Fundamental
For Digital View boards, labeling isn’t just an academic step. It determines whether the deployed edge model in later stages will be useful in the field. An unlabeled cluster might just look like “Activity A,” but once labeled, it becomes “Excavator operation after hours”, suddenly actionable for safety or compliance monitoring.
The value of edge AI doesn’t come from running models; it comes from running the right models trained on the right labels.
Looking Ahead
With a labeled dataset in hand, we’re ready for Stage 3: Supervised Training - building models that learn from those labels and can later be optimized for the accelerators on Digital View boards.