While the year 2022 has been bumpy for us all, chances are it will be remembered as the year when AI truly started delivering on its promises. After decades and billions of dollars spent on AI (and particularly on Autonomous Driving in recent years) and – truth be told – some serious disenchantment, it seems like research in AI is finally back on track and gaining credibility among both investors and the general public under the somewhat unexpected form of Generative AI.
Yet when a Google engineer proclaimed in June that Google’s intelligent chatbot had become sentient, many received the news with a sense of amusement, and it is hard to believe that just a few short months later, a real revolution has taken the industry by storm: from MidJourney in July to Stable Diffusion and ChatGPT in the past few months, it looks like many professionals are ready to outsource all their content generation tasks to AI.
The advent of Generative AI is unquestionably a huge milestone in the history of AI, which had been until now marked by a series of AI Winters. Naturally, this begs for the question: why is this happening now?
The field of Machine Learning owes its recently accelerated adoption in the Enterprise to some spectacular progress in the computer hardware field. The simultaneous popularization of GPUs and the appearance of Cloud Technologies at the turn of the 2010s made ML finally accessible to a wider range of companies, including SMBs. However, even though organizations could suddenly get the ability to incorporate ML into their strategies, it did not mean that Machine Learning was about to be infused into all digital products just yet. It would take years for the leaders of tech companies to figure out how to implement ML into their business strategies.
That said, there is one thing that all tech leaders instantly understood: that owning strategic data would be critical to their ability to compete in the market. Orders were given to collect and store as much data as possible until they could figure out how to leverage it in practice. Data became a top company asset. In fact, in some larger organizations, data was protected so fiercely that it was not even freely shared across departments. Data became the new oil.
As this was happening behind closed corporate doors, another story was unfolding in plain view: the internet became practically as mainstream as electricity, and people worldwide started sharing shamelessly their stories, pictures and videos all over social media. But that sharing frenzy didn’t stop there and extended to professional circles as well. Platforms like Quora, Wikipedia and Reddit took off and started gathering data on all kinds of topics. Eventually, programmers got into it and took a liking to sharing their knowledge and expertise out in the open: the concepts of open-source knowledge bases and open-source software were born. In 2008, both GitHub and Stack Overflow saw the light of day, allowing developers to find answers to their trickiest questions in just a couple of clicks. This also meant that large quantities of invaluable code was available in the open for anybody to scrape and use.
Unsurprisingly, in 2021, just a few months before the release of Dall-E, GitHub (then owned by Microsoft) pulled the trigger and released a marvel of ML technology: CoPilot, an AI pair programming application that could automatically generate code. This is when it got complicated: Microsoft got sued by programmers who claimed that they deserved compensation for providing training data for the development of the CoPilot applications. The matter still hasn’t been settled to this day.
You get the idea: data is valuable, and organizations know this, even if they don’t know how to exactly extract its value. Yet, at first, the data collected by an organization would remain an internal asset for that same organization to leverage. But Generative AI might be signaling a very important shift: the most valuable data for an organization might be data that this organization did not collect itself. ChatGPT was trained on data scraped from the internet (such as articles and textbooks) and that was publicly available to all. OpenAI was not the exclusive owner of that data, and they made the most popular and powerful AI product in history, which brings us to an important question: should the creators of a dataset really keep the data to themselves?
It seems that the answer to this question is clearly no, especially if the ML community is to keep progressing at the same pace. It also seems clear that the generators of the data consider themselves the owner of a very valuable asset, and are expecting financial compensation to allow others to use that data to their benefit. The signs are clear: we are witnessing the early days of a new time of market: the global data market, and it will create new opportunities for all involved.
Very soon, we will see satellite companies sell skyview images for Agritech companies or Governmental Agencies to train models with data they wouldn’t have access to otherwise. And we will most likely see individuals demanding some share of the money that social media companies make from selling their personal data. Data will become more than oil: it will become a tradable currency, and we will all become data traders. Once these data markets become the norm, there will be a need for valuation processes, and for ways to ensure the validity of the data. As it turns out, this might end up being the perfect application for Blockchain – though the use of Blockchain might not be entirely necessary here.
How exactly those open data markets will be organized and regulated, who will be controlling them, and will there be many or a large centralized one, are all very important questions that remain to be answered. What is for sure, is that such markets will definitely require systems and practices to streamline the management of data preparation and data quality (a growing field known as DataPrepOps), to ensure the trustworthiness of the data and track its origin (a nascent area of DataOps called “Data Lineage” is actually exactly about that), as well as significantly more advanced international laws to guarantee that the privacy of individuals will be protected while ensuring that experts worldwide still have the ability to build the data products of the future.