Network Theory 2.0 - Part 2: Data Ancestry
Second part of an article series on Network Theory 2.0 and the potential future of the Digital Data Universe is changing
We live in a processed world.
Nearly everything we wear, touch, and eat, has at some point gone through a human-designed process or value chain, whether that’s production processes, logistics, or cultivation.
The same also applies to our Data.
Although we have limited knowledge of where all the things around come from, we generally have some idea of how something was made. Beyond the “Made in Country X”, we all (kind of) know how apples are grown and where milk comes from. But with Data it is a completely different ballgame.
We know disturbingly little about the data we consume.
We trust that the person behind and account or email sender is who they say they are. We trust that the publication date on a newsletter is correct.
But we simply do not know for sure.
And when we put data into a network with high Data Gravity (Part 1 of this series on Network Effects 2.0), it becomes even more complex.
Knowing where our data comes from, its roots, how it was produced, and by whom, is becoming a critical aspect not only of keeping our digital systems secure, but also how we live our digital lives.
In this article we will deep dive on three aspects of Data Ancestry:
Why AI is “muddying the water” and challenges trust in our data
“Data Ancestry” in the modern data stack and how it can build trust
How new business models may be built upon Data Ancestry
Let’s dive in!
Muddy AI Waters - Hysteresis
I discussed with a presenter from Microsoft the other day (Maxim Salnikov - definitely worth following) about the future of “organic” vs. “synthetic” data. Although it’s hard to make a forecast on the “market share” synthetic data will take, it raises an important issue:
Artificial Intelligence models train on large training sets.
And the AI models used to generated content for the Internet (blog posts, copywriting, consumer contents), have been trained on - well - the Internet.
Thus if recent trends continue, AI models will inevitably be training on their own former outputs, generating all sorts of interesting and problematic feedback loops.
What we will see is Data Hysteresis - the dependence of data on its own history.
How will you know which data came from where?
What data is “real” and what has been “processed” by an AI?
Well, it sounds like we will first need to know the data’s ancestry:
Where does our Data come from?
Data Ancestry in the Modern Data Stack
The “modern data stack”, whether you use that to refer to the Internet or your own company’s enterprise architecture is built on data flows.
You know them as APIs, data pipelines, or other data transfer processes.
But they all have a common problem - Data Explainability.
Understanding where your data comes from, how it was generated, and thus how you should be using it - it’s all a major challenge today.
In fact, one of the major threats in CyberSecurity is the undetected manipulation of historical data. If someone hacks your historical records, be it in finance, operations, or HR, and you make decisions based on that data - you are digitally exposed to manipulation.
This is why it is so important to protect even the seemingly most non-critical data sets from intrusion.
Being able to trace your data, however, and identifying what data was generated from which source, whether it was “organic” generation by humans or “synthetic” data generated by artificial intelligence, will be a business critical capability.
The Business of Data Ancestry
We have yet to see major companies out of stealth resolving Data Ancestry.
True, there are a lot of companies providing data lineage within a data stack, but these services fail once the data leaves one system and is transferred to another.
Perhaps the winners in this space are still in “stealth mode”, but unless there is a (cryptographic?) proof or origin backing a data set, it is today hard to trace the full “blood-line” / ancestry of the data we use every day - at least when we start seeing artificial intelligence increasingly deployed across industries and within systems.
This is where I think a lot of novel and innovative business models will be created - being able to distinguish the origin of a piece of data and put it into context.
We have already seen companies being able to detect if a piece of text has been generated by ChatGPT, or images created by diffusion models, but this will be a constant arms race.
The business of Data Ancestry will literally be a race of Humans vs. Machines, Humans and Humans, and Machines vs. Machines.
A lot of those systems that can leverage Network Effects 2.0 to both attract data and importantly explain data will be the winning systems of the Future.
Data is the foundations for knowledge, and companies that can extract knowledge from ever expanding sets of data are the ones that make me excited to wake up and get working every day.
If we can better understand where our Data comes from, we can better understand where it can take us next
About Author: Although trained as a robotics engineer, I’ve spent nearly a decade as a consultant working on strategy development for global corporations.
Now, I spend my time strategizing on the Future of Tech and how new technologies will impact sectors and business models across industries, advising both corporates and startups on digital strategy development.
LinkedIn, Twitter, Website