According to a 2020 Microstrategy survey, 94% of enterprises report data and data analytics are crucial to their growth strategy. And yet, surprisingly, as much as 73% of the data that enterprises collect is never used, including a vast majority of what is termed “categorical data.”
Why would enterprises ignore an entire class of data? Especially when it is essential to high-priority use cases like personalization, customer 360, fraud detection and prevention, network performance monitoring, and supply chain management?
The simple answer is that using categorical data with today’s tools is complex, and most data scientists aren’t trained to use it. Figuring out how to use categorical data will help companies solve complex problems that have long evaded them. And they’ll be able to do so with data they already have.
Here’s a look at categorical data, why it’s hard to wrangle, and how it could be useful.
Categorical Data 101
There are two main types of data: categorical and numerical. Numerical data, as the name implies, refers to numbers. Categorical data is everything else.
As its name suggests, categorical data describes categories or groups.
Some examples of categorical data could be:
- A list of most popular baby names;
- Census data, such as citizenship, gender, and occupation;
- ID numbers, phone numbers, and email addresses;
- Brands (Audi, Mercedes-Benz, Kia, etc.).
In some instances, categorical data can be both categorical and numerical. For example, weather can be categorized as either “60% chance of rain,” or “partly cloudy.” Both mean the same thing to our brains, but the data takes a different form.
The Challenges of Categorical Data
The same thing that makes categorical data so powerful makes it challenging. While it is easy for you and me to tell the relative difference between a dog and a plane versus a dog and a cat, doing so computationally is not so straightforward.
To express the difference between two pieces of categorical data, one must use graph-based analytical tools or have a background in graph theory. This is why “knowledge graphs” have been a recent hot topic.
Since graph tools are not so widespread in today’s enterprise and academic landscape, data scientists instead fall back on the statistical techniques they know and for which there are ready tools. Most machine learning algorithms can only handle numerical data. They can count instances of categorical data with real but limited utility. The other alternative is turning categorical data into numeric values using one of several encoding techniques. These techniques all tend to be slow and produce poor results – even making some goals impossible, like anomaly detection.
Using categorical data comes with another challenge: high cardinality. Cardinality refers to the number of possible values for a particular category. For example, the cardinality of a list of all models of iPhone ever made is a relatively manageable 34. On the other hand, a list of serial numbers for all 2.2 billion iPhones sold since production began represents a high-cardinality data set.
The size and complexity of traditional analytical approaches spiral quickly out of control with high-cardinality data. Additionally, almost all tools for turning categorical values into numbers (like one-hot encoding) require a fixed set of possible values known in advance. As some high-cardinality data values are unknown, this poses a problem since those tools cannot represent data they have never seen.
With all these challenges, you can begin to understand why enterprises end up ignoring categorical data altogether.
So, What Can You Do with Categorical Data?
The enormous and unrealized value of categorical data for enterprises resides in its ability to represent the relationships between values in a way humans can readily understand and express.
These relationships can include all the properties associated with an object – I am tall, blonde, married, and have two children – or the relationship between two objects – I wrote this article, and you are reading this article.
You can use categorical data to efficiently group and connect classes of objects; for example, you can show all tall, blonde, married authors and the readers of their articles organized by geographic area and hobby. In doing so, you can uncover some unique insight and analysis.
When you combine this “relationship thinking” with a computer’s ability to process enormous amounts of data, the astonishing power of categorical data becomes apparent.
The Strengths of Graph Technology
With the emergence of graph technology in recent years, enterprises can finally represent these relationships directly.
A graph is built of nodes and edges; you can picture this with circles for nodes and arrows for edges that connect nodes. The node-edge-node pattern connects two categorical values (nodes) by a relationship represented by the edge. This is a natural way to represent data because that node-edge-node pattern corresponds perfectly to the subject-predicate-object pattern at the core of a natural human language. So anything you can say in words can be represented naturally in a graph. Then we can analyze the relationships between the values by following the connections between categorical data in a graph.
The challenge of using categorical data is like having a pantry of canned food and no can opener. There’s food there, but you have no tools to access it. Instead of looking at the same data with the same approach, the next generation of streaming graph data tools needs to make categorical data more accessible and usable. We already see the success of categorical data as the key to improving anomaly detection in cybersecurity. But it’s only now that the tools for using this data to solve challenging problems are becoming available.
About the author: Ryan Wright is the Founder & CEO of thatDot, and has been leading software teams focused on data infrastructure and data science for two decades. He has served as principal engineer, director of engineering, and principal investigator on DARPA-funded research programs.