Data modeling has consistently proved one of the most multifaceted, influential disciplines in the data ecosystem. It’s contiguous to everything from data governance to data science. It’s comprised of a plethora of areas, including conceptual, canonical, logical, physical, entity-relationship, and domain models.
Because data modeling is such a sprawling, essential aspect of data management, the numerous developments affecting the former serve as a microcosm of those impacting the latter. Scrutiny of these trends not only reveals the present and future state of data modeling, but also that for the trajectory of data-driven processing in general.
Therefore, it’s extremely consequential that many facets of data modeling have become automated and implicit to elements of data cataloging, Business Intelligence, and data virtualization. “It depends on your lens,” explained Adrian Estala, Starburst VP of Data Mesh Consulting. “If you’re the data engineer, you’re talking about data models. If I’m a consumer, I don’t even know what a data model is. I don’t care; I’m just trying to answer this question.”
However, for business users to consume data to answer their questions—for data to be understandable and relevant to whatever objective they require data for—there are a number of concerns involving semantics, schema, and business metadata that must be defined and implemented. Numerous strides have been made to accelerate these and other fundamentals of data modeling.
Consequentially, this discipline is characterized by an almost newfound sense of fluidity that will continue to improve its efficacy in employing data to solve business problems.
The notion of schema will almost always remain foundational to, if not synonymous with, data modeling. Rigid relational paradigms still exist, yet have been supplemented by more pliant, schema-on-read options including JSON, Avro, and Parquet. Subsequently, this dimension of schema for data integrations is no longer as limiting as it once was. In terms of the business use of data, however, schema is part of a conceptual model that’s necessary to depict “an internal representation of the world,” commented Franz CEO Jans Aasman. That worldview is the business concepts that data is representative of; oftentimes that view consists of established enterprise knowledge about what data objects are in relation to business domains.
Such a domain model, which is also called an ontology or subject area model, provides definitions of specific characteristics of business objects. In that respect “you can compare an ontology to a relational database schema,” Aasman remarked. For example, if a person was a concept in a domain model, “You could say a person is a human,” Aasman mentioned. “It can have zero or more children, have zero or more spouses, have preferably two legs and two arms, and have an address.” One of the most significant developments pertaining to domain models in 2023 is there are a number of vertical-specific ones for industries like manufacturing, finance, life sciences, supply chain, and more.
Oftentimes, these industry-specific domain models are credible starting points for organizations to tailor according to their specific use cases, needs, and standards. More importantly, perhaps, they’re frequently attended by taxonomies to clarify the semantics used to describe data and their relation to business objects. These hierarchies of enterprise definitions are integral to building a department’s or organization’s worldview because they clarify the terms that describe it.
According to Yseop CEO Emmanuel Walckenaer, it’s not uncommon for vendors to expedite the subject area model process by offering industry-specific taxonomies. There are several means of automating the management of vocabularies necessary for the semantic clarity such models provide, including:
- Codeless Options: Users can avail themselves of tools that don’t involve code to refine their ontologies, including the specific words (and meanings) in vocabularies. “If a customer has a specific vocabulary and doesn’t want to use the word ‘revenue’ but ‘sales’ or whatever, everybody has their own sense of adage or jargon,” Walckenaer posited. “We offer a no-code studio so he can do that very easily without specialists.”
- Inference Techniques: There are also mechanisms involving business rules and terminology that reinforce semantic consistency for subject models. According to Josh Good, Qlik VP of Global Product Marketing, “In other tools, that can be a requirement: that you have to go in there and put all that business glossary and business rules information in. [Our software engine] will infer them.” Even with such techniques available, organizations can improve their accuracy by uploading vocabularies or taxonomies as desired.
- Recommendations: It has become fairly common for data catalogs to recommend glossary items via an artful combination of data harvesting (in which organizations simply point tools at sources to ingest metadata and aspects of schema), cognitive computing, and regular expressions. According to Alation Field CTO John Wills, some solutions “automatically infer and extend metadata about assets using [software engines]. Examples include adding classifications, tags and descriptions, and suggesting glossary terms.”
Business and Technical Metadata
Granted, the ready availability of industry-specific subject area models and taxonomies will rarely replace the effectiveness of those that have been specifically devised by organizations for their own use cases. However, these preset approaches, and the means of tailoring them with low code mechanisms, operate as a credible launching point to reduce the time and effort required for data modeling, so organizations can focus on using data for analytics and other applications. “You don’t have to go out and do a project where you do a huge amount of work and hope people use it,” Good reflected. “This enables a type of approach that makes [modeling] more useful because one, you get it today, and two, it gets smarter tomorrow.”
Additionally, advancements in software engines for analytics, federated query options, and data visualizations can reduce the time spent crafting data models for these applications. Good referenced a tool that supports natural language conversations between users and structured datasets. By concentrating on the connections between tables and specific fields, it enables users to answer multiple questions via “many to many joins, and then traverse them in both directions at the same time, to get the correct answer from both questions,” Good stipulated. “That’s a very unique thing. Normally you’d have to build two totally separate data models and take a very specific perspective on the data to prevent double counting and things like that.”
As the assortment of choices for hastening the process to construe taxonomies indicates, these hierarchies of definitions—which are searchable according to categories and terms—have an enduring relevance for data models. Even when organizations use pre-built taxonomies (or if they need to create a new one), their ultimate success depends on how well they curate them. Best practices for doing so include:
- Assembling a Corpus: According to Aasman, it’s critical for subject matter experts to “find in a particular domain the top hundred or thousands of documents that really cover almost every word that they find important in thier work.” This step not only assists with devising a taxonomy for a domain model, but also for applying it to natural language technologies.
- Word Extraction: The next step is to single out the words in the corpus that are representative of important concepts in the model. “There is software you can use to analyze the corpus and come up with the most frequently used words that are not in your taxonomy,” Aasman noted.
- Gamification: When applying a domain model and its taxonomy to rigorous analytics exercises such as Natural Language Processing, it’s incumbent upon organizations to make it as exhaustive as possible. This practice also yields the best taxonomies for other applications. “You must look at a taxonomy like a game where you get a point for every concept you get out of the documents,” Aasman said. “The question is who can get the most points. Meaning, how can I make as many alternative synonyms for the most important concepts that I need?”
The centrality of data modeling to data management will likely remain for some time. What’s most notable about this fact is it’s becoming considerably simpler to perform some of the basic tasks associated with this discipline. From flexible, schema-on-read models to the multitude of options for clarifying semantics and implementing the basics of domain models, data modeling’s effectiveness is increasing almost as much as the effort it requires is decreasing. This fact will continue to behoove enterprise users of all types in the coming year.
About the Author
Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideBIGDATANOW