You can’t attend Transform 2022? View all summit sessions in our on-demand library now! See here.
Data Labeling is one of the most fundamental aspects of machine learning. It is also often an area where organizations struggle – both to correctly classify data and reduce potential bias.
With data labeling technology, a dataset used for training machine learning The model is first analyzed and given a label that provides a category and definition of what the data actually is. While data labeling is an important component of the machine learning process, it has also recently been shown to be highly inconsistent. experiments. The need for accurate data labeling has fueled a bustling market data labeling provider.
Among the most popular data labeling technologies are Open Source Label Studio, backed by San Francisco-based startup Heartex. The new Label Studio 1.6 update released today will provide users with new features to help better analyze and label data inside videos.
According to Michael Malyuk, co-founder and CEO of Heartex, the challenge for most companies with artificial intelligence (AI) is having good data to work with.
“We think of labeling as a broader category of dataset development, and Label Studio is an end-to-end solution that allows you to do any kind of dataset development,” says Malyuk.
Defining the data labeling category is a challenge
While Label Studio’s version 1.6 has the ability to play video as the main new feature, Malyuk emphasizes that the technology is useful for any type of data including text, audio, time series, and video.
Among the biggest problems with any labeling approach for all data types is really defining the categories used for data labels.
“Some people can name things one way, some people can name things another way, but they basically mean the same thing,” Malyuk said.
He explains that Label Studio provides a classification for labels that users can choose to describe a piece of data, be it text, audio, or image files. If two or more people in the same organization label the same data differently, the Label Studio system identifies the conflict so it can be analyzed and remedied. Label Studio offers both a manual conflict resolution system and an automated approach.
Vector databases vs data labeling?
The data labeling process can often involve manual work, with humans assigning a label or confirming that the label is correct.
There are several approaches to process automation, startup Lightweight AI is using a self-monitoring machine learning model can integrate with Label Studio. Then there are vendors who will use vector database to convert data to math, instead of using data labels to identify the data and its relationships.
Malyuk says that vector databases have their uses and can be effective for performing tasks like finding similarities. In his view, the problem was that the vector approach was not efficient with unstructured data types like audio and video. He notes that vector databases can use identifiers for common objects.
“As soon as you start deviating from that common sense into something a little bit different, it gets very complicated without manual labels,” says Malyuk.
How data labeling can identify and minimize AI bias
Bias in AI is an ongoing challenge that many in the industry are trying to combat. At the root of machine learning is the actual data, and the way the data is labeled can also lead to bias. Bias can be intentional, and it can also be circumstantial.
“If you label a dataset very subjectively in the morning before coffee and then after coffee, you can get very different answers,” says Malyuk.
While you can’t always guarantee that data labeling processes are only enforced by those that are full of caffeine, there are procedures that can help. Malyuk says what Label Studio does on the software side is it provides a way to build a process for everyone to contribute. The system identifies and builds all the matrices it matches people to and how they label the same items. It’s an approach that Malyuk says can identify bias for a particular label.
The open source Label Studio technology is intended to be used by individuals and small teams, while the commercial project provides enterprise features for larger teams in terms of security, collaboration, and openness. wide.
“With open source, we are user-focused, and we are trying to make the lives of individual users as easy as possible from a labeling perspective,” said Malyuk. “With business, we focus on the organization and whatever the business needs, it’s there.”
VentureBeat’s mission is a digital city square for technical decision-makers to gain knowledge of transformative enterprise technology and transactions. Explore our summary report.