This is the size of the ImageNet dataset, which was created by Fei-Fei Li, an assistant professor at Princeton University at the time. She hoped that this would help push forward the development of the stagnant field of computer vision. It was a bold attempt. The 22,000 categories were at least two orders of magnitude more than any previously created image dataset.
Her colleagues believed that the answer to building better AI systems lay in algorithm innovation, and they questioned her wisdom. “The more I discussed the idea of ImageNet with my colleagues, the more isolated I felt.”
Despite the skepticism, Fei-Fei and her small team – including PhD candidate Jia Deng and several undergraduate students earning $10 an hour – began labeling images from search engines. Progress was slow and painful. Jia Deng estimated that at their pace, completing ImageNet would take 18 years – time they didn’t have. It was then that a master’s student introduced Fei-Fei to Amazon’s Mechanical Turk, a marketplace for crowdsourcing “human intelligence tasks” from contributors worldwide. Fei-Fei immediately realized that this was exactly what they needed.
In 2009, three years after Fei-Fei started her most important project, ImageNet was finally ready with the help of a dispersed global workforce. She had done her part in advancing the common mission of computer vision.
Now it was up to researchers to develop algorithms that could utilize this massive dataset to observe the world like humans. However, in the first two years, this didn’t happen. These algorithms performed little better than the previous state of ImageNet.
Fei-Fei began to doubt if her colleagues were right about ImageNet being a futile effort.
Then, in August 2012, just as Fei-Fei was giving up hope of her project inspiring the change she envisioned, Jia Deng excitedly called her with news about AlexNet. This new algorithm, trained on ImageNet, outperformed all previous computer vision algorithms in history. Created by three researchers from the University of Toronto, AlexNet used a nearly abandoned AI architecture called “neural networks” and exceeded Fei-Fei’s wildest expectations.
In that moment, she knew her efforts had borne fruit. “History was just made, and only a few people in the world knew it,” Fei-Fei shared the story behind ImageNet in her memoir, “A World I Saw.”
The combination of ImageNet and AlexNet is significant for several reasons.
First, the reemergence of neural networks, long considered a dead-end technology, became the actual architecture behind the exponential growth of AI development for over a decade.
Second, the three researchers from Toronto, one of whom you may have heard of, Ilya Sutskever, were among the first to use graphics processing units (GPUs) to train AI models. This has now become the industry standard.
Third, the AI industry finally recognized Fei-Fei’s point from years ago: a key element of advanced AI is a large amount of data.
We have all read and heard adages like “data is the new oil” and “garbage in, garbage out” countless times. If these statements weren’t about fundamental truths of our world, we might be tired of them. Over the years, AI has gradually become an increasingly important part of our lives – influencing the tweets we read, the movies we watch, the prices we pay, and the credit we are deemed worthy of. All of this is driven by collecting data meticulously tracking every move we make in the digital world.
But over the past two years, since the relatively unknown startup OpenAI released a chatbot application called ChatGPT, the importance of AI has moved from behind the scenes to the forefront. We are at the forefront of machine intelligence permeating every aspect of our lives. As the competition heats up over who will control this intelligence, the demand for the data that drives it is also increasing.
That is the topic of this article. We discuss the scale and urgency of data that AI companies require and the challenges they face in acquiring it. We explore how this insatiable demand threatens our love for the internet and the billions of contributors. Finally, we introduce some emerging startups that are using cryptocurrency to address these issues and concerns.
Before we delve into the discussion, a quick clarification: this article is written from the perspective of training large language models (LLMs), not all AI systems. Therefore, I often interchangeably use “AI” and “LLMs.” While this usage may be technically inaccurate, it applies to the concepts and issues surrounding data, particularly in the context of LLMs.
Data
The training of large language models is limited by three main resources: computation, energy, and data. Companies, governments, and startups are all competing for these resources, with significant capital backing them. Of the three, the competition for computation is the most intense, partly due to the rapid rise in NVIDIA stock prices.
Training LLMs requires a massive cluster of specialized graphics processing units (GPUs), especially NVIDIA’s A100, H100, and upcoming B100 models. These are not computers you can buy off the shelf from Amazon or your local computer store. Instead, they cost tens of thousands of dollars. NVIDIA decides how to allocate its supply to AI labs, startups, data centers, and large-scale customers.
In the 18 months following the release of ChatGPT, the demand for GPUs far exceeded the supply, with wait times of up to 11 months. However, as the initial frenzy settled, the supply-demand dynamics are normalizing. Startups are closing, algorithm and model architecture improvements are being made, dedicated chips from other companies are emerging, and NVIDIA is increasing production – all contributing to increased GPU availability and decreasing prices.
Secondly, energy. Running GPUs in data centers consumes a significant amount of energy. It is estimated that by 2030, data centers will consume 4.5% of global energy. As this surge in demand puts pressure on existing power grids, tech companies are exploring alternative energy solutions. Amazon recently purchased a data center powered by a nuclear power plant for $650 million. Microsoft has hired a nuclear technology lead. OpenAI’s Sam Altman has supported energy startups like Helion, Exowatt, and Oklo.
From the perspective of training AI models, energy and computation are commodities. Using the B100 instead of the H100 or using nuclear power instead of traditional sources may make the training process cheaper, faster, and more efficient – but it doesn’t affect the quality of the model. In other words, in the race to create the most intelligent and human-like AI models, energy and computation are fundamental elements, not differentiating factors.
The critical resource is data.
James Betker, a research engineer at OpenAI, claims to have trained “more generative models than anyone else has the right to train.” In a blog post, he points out that “given enough training time on the same dataset, virtually any model with sufficient capacity and training will converge to the same point.” This means that what differentiates one AI model from another is the dataset. Nothing else.
When we refer to a model as “ChatGPT,” “Claude,” “Mistral,” or “Lambda,” we are not talking about the architecture, the GPU used, or the energy consumed – we are referring to the dataset it was trained on.
How much data does it take to train a state-of-the-art generative model?
The answer: a lot.
GPT-4, still considered the best large language model more than a year after its release, was trained on an estimated 1.2 trillion tokens (or around 900 billion words). This data comes from publicly available sources on the internet, including Wikipedia, Reddit, Common Crawl (a free and open web crawl data repository), over a million hours of transcribed YouTube data, and code platforms like GitHub and Stack Overflow.
If you think that’s a lot of data, hold on. There’s a concept in generative AI called “Chinchilla Scaling Laws,” which states that, for a given computational budget, training smaller models on larger datasets is more effective than training larger models on smaller datasets. If we extrapolate the computational resources allocated by AI companies for training the next generation of AI models like GPT-5 and Llama-4, we find that these models are expected to require five to six times the computational power and be trained on up to one quadrillion tokens.
Since most of the publicly available internet data has already been crawled, indexed, and used to train existing models, where does the additional data come from? This has become a forefront research question for AI companies. There are two approaches to tackle this issue. One is to decide to use synthetic data directly generated by LLMs instead of human-generated data. However, the effectiveness of such data in making the models more intelligent has yet to be tested.
The other option is simply to find high-quality data instead of creating synthetic ones. However, acquiring additional data poses challenges, particularly when the problems AI companies face threaten not only the training of future models but also the effectiveness of existing models.
The first data problem involves legal issues. Although AI companies claim to train models on “publicly available data,” much of it is copyrighted. For example, the Common Crawl dataset contains millions of articles from publications like The New York Times and Associated Press, as well as other copyrighted materials such as published books and lyrics.
Some publications and creators are taking legal action against AI companies, claiming infringement of their copyrights and intellectual property. The New York Times sued OpenAI and Microsoft for “unlawfully copying and using The New York Times’ unique and valuable works.” A group of programmers filed a class-action lawsuit questioning the legality of using open-source code to train GitHub Copilot, a popular AI programming assistant.
Comedian Sarah Silverman and writer Paul Tremblay have also filed lawsuits against AI companies for using their works without permission.
Others are embracing the age of change by collaborating with AI companies. The Associated Press, Financial Times, and Axel Springer have all signed content licensing agreements with OpenAI. Apple is exploring similar partnerships with news organizations like Condé Nast and NBC. Google agreed to pay $60 million annually for access to the Reddit API to train models, and Stack Overflow reached a similar agreement with OpenAI. Meta is reportedly considering direct acquisitions of publishers like Simon & Schuster.
These collaborations align with the second problem AI companies face: the closure of the open web.
Internet forums and social media platforms, where a significant amount of data for training AI models is derived from, are increasingly tightening their access and imposing restrictions on data scraping. Twitter, Facebook, and Reddit have all implemented stricter API policies, limiting the amount of data that can be harvested. This restricts the availability of training data for AI companies.
In response to these challenges, startups are emerging with innovative solutions: using cryptocurrency to incentivize data contributors. These startups aim to create decentralized platforms where individuals can contribute their data and be rewarded with cryptocurrency tokens. The tokens can then be exchanged for various goods and services or sold on cryptocurrency exchanges.
These startups provide an alternative way for AI companies to acquire the large-scale, high-quality data they need while addressing concerns about data privacy, ownership, and legality. However, they also face their own set of challenges, such as building trust, ensuring data quality, and establishing partnerships with AI companies.
In conclusion, while computation and energy are important resources for training AI models, data is the critical resource. The demand for data is insatiable, and AI companies are faced with the challenge of finding additional high-quality data while navigating legal issues and the closure of the open web. Emerging startups are exploring innovative solutions using cryptocurrency, but they too face their own challenges. As AI continues to permeate every aspect of our lives, the importance of data and the ethical considerations surrounding its acquisition and use will only become more prominent.Establishing Partnerships with Data Providers
We are aware that there is at least one way to match buyers and sellers looking for specific products: internet marketplaces! eBay created a marketplace for collectibles, Upwork created one for labor, and countless platforms have created markets for various other categories. It comes as no surprise that we have also seen the emergence of markets for specialized datasets, some of which are decentralized.
Bagel is building a “universal infrastructure,” a set of tools that enables holders of “high-quality, diverse data” to share their data with AI companies in a trustworthy and privacy-preserving manner. They use advanced techniques such as zero-knowledge (ZK) and fully homomorphic encryption (FHE) to achieve this.
Companies often possess data that has immense value but cannot be monetized due to privacy or competition concerns. For example, a research lab may have a large genomic dataset that they cannot share to protect patient privacy, or a consumer goods manufacturer may have data on supply chain optimization to reduce waste, which cannot be made public without revealing trade secrets. Bagel uses cryptographic advancements to make these datasets useful while mitigating the associated concerns.
Grass, the residential agency service, can also help create professional datasets. For example, if you want to fine-tune a model to provide expert cooking advice, you can request Grass to gather data from Reddit subreddits like r/Cooking and r/AskCulinary. Similarly, a travel-focused model creator can ask Grass to scrape data from TripAdvisor forums.
While these may not be proprietary data sources, they can still serve as valuable supplements to other datasets. Grass also plans to use its network to create archival datasets that can be reused by any client.
Contextual Data
Imagine asking your favorite language model (LLM) “When is your training deadline?” and receiving an answer like “November 2023.” This means that the base model only provides information available before that date. Considering the computational cost and time involved in training (or fine-tuning) these models, this makes sense.
To keep them up-to-date in real-time, you would have to train and deploy a new model every day, which is simply not feasible (at least for now).
However, an AI that lacks up-to-date information about the world is quite limited in many use cases. For example, if I use a personal digital assistant that relies on LLMs for responses and ask it to summarize unread emails or provide the goal scorers in the previous Liverpool match, it would be constrained.
To work around these limitations and provide responses based on real-time information, application developers can query and insert information into a “context window” of the so-called base model. The context window is the input text that an LLM can process to generate a response. It is measured in tokens, representing the text that the LLM can “see” at any given moment.
So, when I ask my digital assistant to summarize my unread emails, the application first queries my email provider to fetch the contents of all unread emails, inserts the response into the prompt sent to the LLM, and appends something like, “I have provided a list of all unread emails in Shlok’s inbox. Please summarize them.” With this new context, the LLM can then complete the task and provide a response. Imagine this process as copying and pasting an email into ChatGPT and asking it to generate a response, but it happens on the backend.
To create applications with real-time responses, developers need access to real-time data. Grass nodes can fetch data from any website in real-time, providing this data to developers. For example, a news app based on LLMs can request Grass to scrape all popular articles from Google News every five minutes. When a user queries, “What was the magnitude of the earthquake that just hit New York City?” the news app retrieves relevant articles, adds them to the LLM’s context window, and shares the response with the user.
This is where Masa fits in today. Currently, Alphabet, Meta, and X are the only large platforms with continuously updated user data because they have user bases. Masa provides a fair playing field for smaller startups.
The technical term for this process is retrieval-augmented generation (RAG). The RAG workflow is at the core of all modern LLM-based applications. This process involves vectorizing text, or converting text into numerical arrays that can be easily interpreted, manipulated, stored, and searched by computers.
Grass plans to release physical hardware nodes in the future to provide customers with vectorized, low-latency real-time data, simplifying their RAG workflows.
Most builders in the industry predict that context-level queries (also known as inference) will consume the majority of resources (energy, computing power, data) in the future. This makes sense. Training a model will always be a time-limited process that consumes a certain amount of resource allocation. On the other hand, application-level usage theoretically has unlimited demand.
Grass has already seen this happening, with most of their text data requests coming from customers looking for real-time data.
LLMs’ context windows have expanded over time. When OpenAI first released ChatGPT, its context window was 32,000 tokens. Less than two years later, Google’s Gemini model has a context window of over one million tokens. One million tokens is equivalent to over eleven 300-page books—a significant amount of text.
These developments have made it possible to build much more than just access to real-time information using context windows. For example, someone could dump all of Taylor Swift’s lyrics or the entire archive of this newsletter into a context window and ask the LLM to generate a new piece of content in a similar style.
Unless explicitly programmed not to, the model will produce fairly decent outputs.
If you can sense where this discussion is heading, hang on and see what happens next. So far, we have mainly discussed text models, but generative models have also become proficient in other modalities such as sound, images, and videos. I recently came across this very cool illustration of London created by Orkhan Isayen on Twitter.
[Image]
Midjourney, a popular (and great) text-to-image tool, has a feature called Style Transfer, which can generate new images with the same style as existing images (this feature also relies on a similar RAG workflow but not exactly the same). I uploaded Orkhan’s hand-drawn illustration and used the Style Transfer prompt to change the city to New York. Here’s what I got:
[Images]
Four images that, if you browse through the artist’s illustrations, could easily be mistaken for their work. These are all based on a single input image and generated by AI in just 30 seconds. I requested “New York,” but the subject can be anything, really. Similar kinds of replication can be achieved in other modalities like music.
Recalling our earlier discussion, it’s understandable why entities, including creators, have sued AI companies.
The internet used to be a boon for creators—it was their way to share their stories, art, music, and other forms of creative expression with the world, their way to find their 1,000 true fans. Now, the same global platforms have become the biggest threat to their livelihoods.
When you can get a copy of a style close enough to Orkhan’s work for a monthly subscription fee of $30 with Midjourney, why pay $500 for a commission?
Sounds dystopian, right?
The beauty of technology is that it almost always comes up with new ways to solve the problems it creates. If you turn the seemingly dire situation for creators around, you’ll find it’s an unprecedented opportunity for them to monetize their talents at a scale never seen before.
Before AI, Orkhan’s ability to create art was limited by the number of hours they had in a day. With AI, they can now theoretically serve an unlimited customer base.
To understand what I mean, let’s look at elf.tech, the AI music platform by musician Grimes. Elf Tech allows you to upload a recording of a song, and it transforms it into Grimes’ voice and style. Any royalties earned from that song are split 50-50 between Grimes and the creator. This means that as a fan of Grimes, her voice, her concerts, or her distribution, you can simply come up with an idea for a song and have the platform use AI to transform it into Grimes’ voice.
If that song becomes a hit, both you and Grimes benefit. It also allows Grimes to expand her abilities and passively leverage her distribution.
TRINITI, the technology supporting elf.tech, is a tool created by CreateSafe that extends the definition of digital content through creator-controlled smart contracts and reimagines distribution through blockchain-based, peer-to-peer, pay-per-access microtransactions, enabling any streaming platform to instantly verify and access digital content. The generative AI then executes instant micro-payments based on terms specified by the creator and streams the experience to consumers.
Balaji succinctly puts it.
[Image]
With the emergence of new media, we are eager to figure out how humans will interact with them. When they combine with the internet, they become powerful engines driving change. Books fueled the Protestant Reformation. Radio and television were integral parts of the Cold War. Media is often a double-edged sword—it can be used for good and for bad.
What we have today is centralized companies with the majority of user data. It’s almost like we trust these companies to do the right thing for creativity, our mental well-being, and better societal development. It’s too much power to be in the hands of a few companies whose inner workings we hardly understand.
We are still in the early stages of the LLM revolution. Just like Ethereum in 2016, we have little idea of the applications that can be built using them. An LLM that can converse with my grandmother in Hindi? An intelligent agent that can browse information streams and present only high-quality data? A mechanism for independent contributors to share specific cultural nuances (such as slang)? We don’t know too many possibilities yet.
However, what is evident is that building these applications will be limited by one crucial factor: data.
Protocols like Grass, Masa, and Bagel are infrastructures that provide data sources in a fair manner. The human imagination is limitless when considering the content that can be built on top of these infrastructures. For me, it seems exciting.