Author: Sky City Master, Source: Web3 Sky City
Host:
The first time I met our guest, Sam Altman, was about 20 years ago when he was developing a local mobile app called Loop. We both received support from Sequoia Capital, in fact, we were both part of their first batch of scouts.
He invested in a relatively unknown fintech company called Stripe, while I invested in Uber. In that small experiment…
Sam:
You invested in Uber? I’ve never heard of it before.
Yes, I did.
You should write a book, Jacob.
The small experimental fund that Sam and I were involved in as scouts had the highest return rate for Sequoia Capital. I heard that a few million dollars turned into over 200 million dollars. He also worked at Y Combinator for a while and served as the president from 2014 to 2019. In 2016, he co-founded OpenAI with others, with the goal of ensuring that general artificial intelligence benefits all of humanity. In 2019, he left YC and joined OpenAI full-time as the CEO.
On November 30, 2022, things got very interesting. That was the day OpenAI launched ChatGPT. In January 2023, Microsoft invested $10 billion. In November 2023, in a crazy five-day period, Sam was fired by OpenAI. Everyone was going to work for Microsoft. A bunch of heart emojis went viral on X and Twitter, and people started speculating that the team had achieved AGI. The end of the world was coming. Suddenly, a few days later, he was back as the CEO of OpenAI.
According to reports, in February 2023, Sam was seeking to raise $7 trillion for an AI chip project. There were previous reports that Sam was seeking $1 billion from Masayoshi San and working with Johnny Ive, co-founder of the iPhone, to create an iPhone killer.
Meanwhile, ChatGPT was getting better and becoming a household name. It had a huge impact on how we work and complete tasks. It was reported to be the fastest product in history to reach 100 million users in just two months.
Look at OpenAI’s insane revenue growth. It was reported that their ARR reached $2 billion last year. Welcome to the All-Powerful Podcast.
I have observed that the entire industry is eagerly awaiting the release of GPT-5. From what I understand, it may be launched sometime this summer, but the time frame is quite broad. Can you provide more details for us?
Sam:
We will take some time to release this major new model. I believe it will be great when we do. We will carefully consider how to do it. It may be released in a different way than our previous models. Also, I’m not even sure if we will call it GPT-5.
What I want to say is that since we released GPT-4, especially in the past few months, many people have noticed its outstanding performance. I think it better reveals the nature of the world, which is not simply 1, 2, 3, 4, 5, 6, 7, but a process of using AI systems where the entire system continuously improves. I believe this is both a better technical direction and easier for society to adapt to, and I believe this is the direction of our development.
Host:
So does that mean we won’t have long training cycles and instead continuously retrain or train sub-models, Sam? Perhaps you can share with us some potential architectural changes for future large-scale models.
Sam:
One scenario you can imagine is that we are constantly training one model. It seems like a reasonable approach to me.
What we’re discussing here is the different ways of releasing it. Are you considering releasing it to paid users first or slowing down the release to make the red team nervous because the risks are too high now? In fact, you have so many paid users, and everyone is watching every move you make. You have to be more careful now, right?
Yes, currently GPT-4 is still only available to paid users, but what we really want to do is figure out how to make more advanced technology available to free users as well. I think that’s a very important part of our mission. Our goal is to build AI tools and make them widely accessible, whether they are free or not so expensive, whatever it is, so that people can use them to create the future, rather than letting the magical AGI in the sky create the future and pour it upon us. It seems like a better path and a more inspiring path. I firmly believe that things are indeed moving in this direction. So I’m sorry, we haven’t figured out how to make GPT-4 level technology available to free users. That’s what we really want to do. I must admit, it’s very expensive.
Sam, I think two major factors that are often discussed are the potential costs and latency, which somewhat limit the rate at which killer applications can emerge. And then I think the second factor is the long-term capability built in the open-source world versus the closed-source world. I think the frenzy in this field is the enthusiasm of its open-source community. An incredible example is the crazy demonstration we did for Devon, which was about five or six weeks ago, and it was very impressive. Then some young people released the project under open licenses like OpenDevon based on MIT. It performed very well and was almost on par with other closed-source projects. So perhaps we can start the discussion with this question, what are the business decisions behind keeping these models closed-source? What do you think the future holds for the next few years?
For the first part of your question, speed and cost are very important to us. I don’t want to give a timeline on when we can significantly reduce latency because research is difficult, but I believe we can do it. We want to dramatically reduce latency and cost. I believe it will happen. We are still in the early stages of scientific development, and we don’t understand how it works. Additionally, we have all the engineering aspects in our favor. So I don’t know when we will have intelligence so cheap that it is immeasurable and so fast that it feels instantaneous to us and everyone else. But I believe we can reach a fairly high level of intelligence. This is important to us and to the users, and it will unlock a lot of things.
Regarding open-source and closed-source, I think both have their advantages. I think we have already open-sourced some projects, and in the future, we will open-source more projects. But in reality, our mission is to move towards artificial intelligence and find ways to distribute its benefits widely. We have a strategy that seems to resonate with many people. Obviously, it doesn’t fit everyone. And it’s a vast ecosystem. There will still be open-source models and people building in that way.
One area I personally am particularly interested in open-source is having a really good open-source model that can run on my phone. I think the world doesn’t have good enough technology yet to develop a good version. But it seems like a very important thing to do at some point.
Would you do that? Would you release it?
I don’t know if we would, or if anyone would.
What about Llama 3?
Llama 3 running on a phone?
I think maybe a 7 billion parameter (phone) version.
Yes. I don’t know if it’s suitable for phones, but…
It should be suitable for phones.
But I don’t know, I’m not sure if it’s suitable, I haven’t played with it. I don’t know if it’s capable enough to do what I’m considering here.
So when Llama 3 is released, I think the biggest takeaway for many people would be, wow, they’ve caught up to GPT-4. I think it’s not equal in all aspects, but overall, it’s very, very close. I think the question is, you just released 4 not long ago, you’re working on 5, or maybe more upgrades to 4. I’d like to hear your thoughts on how to stay ahead in the open-source environment. This is usually a challenging task, what do you think?
Our goal is not to create the smartest set of weights we can possibly create, but to create a useful layer of intelligence for people to use. The models are just part of it. I believe we will stay ahead in this regard, and I hope we can stay far ahead of the rest of the world. But there is still a lot of other work to be done on the entire system, not just the model weights. We have to build lasting value in a traditional way like any other company. We have to come up with a great product and a reason to stick with it and deliver it at an affordable price.
When you founded this organization, one of the goals or parts of the discussion was that it was too important for any one company, so it needed to be open. Then it shifted to no one can see it, it’s too dangerous, we need to lock it down because you were concerned about it. I’m curious to know if that’s true. Because a cynical person would say it’s a capitalist move. Then I’m curious about the decision from the beginning of openness. The world needs to see this. Closing it is really important. Only we can see it. So how did you come to that conclusion?
We released ChatGPT in part because we wanted the whole world to see it. We have been trying to tell people that artificial intelligence is really important. If you go back to October 2022, at that time not many people believed that artificial intelligence would be so important or that it was really happening. A big part of what we were trying to do was put the technology in people’s hands. Now, again, there are different ways to do that. I think there is indeed an important role to say, for example, this is the way it’s being done.
But in fact, we have so many people using the free version of ChatGPT, we don’t have ads, and we’re not trying to make money from it. We just launched it because we wanted people to have these tools. I think it has done a lot of work, provided a great value, taught people how to fish, and really made the world think about what’s happening here. Now, we still don’t have all the answers. We are figuring things out as we go along, just like everyone else. I think as we learn new things, we will change strategies multiple times.
When we started OpenAI, we really didn’t know how things would evolve, didn’t know we would make language models, didn’t know we would make products. I distinctly remember the first day when we thought, okay, now we’re all here. Getting everything ready was difficult. But now what? Maybe we should write some papers. Maybe we should stand around a whiteboard. We’ve been working hard, step by step, figuring out what the next steps are, what the next steps are, what the next steps are. I think we will continue to do that.
Let me confirm once again to make sure I didn’t misunderstand. I understand your point that regardless of whether it’s open-source or closed-source, all these models, no matter what business decisions you make, will gradually improve in accuracy. Not all companies have enough capital, but assuming there are four or five, like Meta, Google, Microsoft, or maybe a startup. These companies operate on open networks. And soon, the accuracy or value of these models may shift to proprietary training data sources, which are not accessible to everyone else.
According to sources, you may or may not be able to access something that someone else can. Do you think this will lead to a data arms race, where open networks create a competition to acquire data?
I don’t think so. I firmly believe that it won’t turn into a data arms race because when models become intelligent enough, it shouldn’t be just about acquiring more data, at least not for training purposes. Data may be valuable because of its worth, but what I’ve learned from all of this is that it’s difficult to confidently predict the direction of the next few years, so I don’t want to attempt it right now. What I want to say is that I expect to see many highly capable models emerging in the world. I feel like we’ve stumbled upon a new natural or scientific fact, or you can call it a fact that we can create. I don’t think it’s literal, but more of a metaphysical point. Intelligence is just an emergent property of matter, like physical rules or something. So people will figure it out. But there will be all these different approaches to system design. People will make different choices, come up with new ideas. I’m sure, just like any other industry, there will be multiple ways, and different people will like different ways. Some people like iPhones, some people like Android phones. I think that effect will be there.
Let’s go back to the first part and focus on cost and speed. All of you at NVIDIA have a bit rate limitation, and I think you and most people have already announced how much capacity can be obtained because it’s the maximum capacity they can produce. What needs to happen on the chip side to truly lower the cost, speed up the computation, and get more power? How are you helping the industry address these challenges?
We will surely make huge progress in algorithms. I don’t want to underestimate that. I’m very interested in chips and power. But if we can make the same quality of models twice as efficient, it’s like doubling our compute power. I think there’s a lot of work to be done there. I hope we can really see those results. Besides that, the entire supply chain is very complex. There’s the capacity of the logic factories. How much HBM the world can produce. How quickly you can get permits, pour concrete, build data centers, and then get people to wire it all up. And then there’s power, which is a huge bottleneck. But I think when it has so much value to humanity, the world will step up. We’ll be working to make that goal happen faster. The possibility is definitely there. I can’t give you specific probabilities, but I believe, as you said, it’s a huge foundational breakthrough. We already have more efficient ways of computing. However, I don’t like relying on it too much, nor do I want to spend too much time thinking about it.
In terms of devices, you mentioned models that can be deployed on phones. Obviously, whether it’s LLM or SLM, I believe you’re considering this. But will the device itself change? Does it need to be as expensive as an iPhone?
I’m very interested in that. I like new forms of computation, and every major technical advance seems to bring something new. The level of excellence in phones is incredible, so I think the bar is set very high there. Personally, I think the iPhone is the greatest technological product in human history, and it really is a fantastic product.
So what’s next?
I don’t know. It would be great if we could surpass it, but I think the bar is set very high.
I’ve been working with Johnny Ive, and we’ve been discussing various ideas, but I’m not sure if the new device has to be more complex or actually just cheaper and simpler. Almost everyone is willing to pay for a phone, so if you can make a cheaper device, I think the barrier to carrying a second device or using a second device is quite high. So I think, considering that we’re all willing to pay for phones, or most of us are, I don’t think cheaper is the answer to solving the problem.
So what’s the different answer then?
Could there be a specialized chip that runs on phones and drives AI models of phone size?
There might be, but it’s definitely something that phone manufacturers would do, and it wouldn’t require a new device. I think you have to find some really different interaction paradigm that the technology enables. If I knew what it was, I would be happy to start researching it right now.
Now, you can use voice in applications. In fact, I have set up the buttons on my phone to directly access the voice application of ChatGPT, and I use it with my kids, and they love talking to it. It has latency issues, but it’s really great.
We’ll make it better. I think voice is an indication of what’s to come. Like, if you can make voice interaction really good, it feels like a different way of using a computer.
However, just like the problems we’ve encountered, like why it didn’t respond? And it feels like CB, like over and over again. It’s really frustrating to use, but when it gives you the right answer, it’s also amazing.
We’re working on addressing that issue. Right now, it’s clunky, slow, and feels less smooth, authentic, or natural. We’ll make it all better.
What about computer vision? You can choose to wear related devices. You can combine visual or video data with voice data.
Today, AI is able to understand everything happening around you. You can ask ChatGPT questions like, “What am I looking at?” or “What plant is this?” and it has this incredibly powerful capability. It’s clearly another hint.
However, whether people choose to wear glasses or use some kind of device when needed, it will raise a lot of social and interpersonal issues. Wearing computer devices can become very complicated.
We’ve seen this in the use of Google Glass. People can get bothered when they’re performing tasks. I forget some specific situations.
If AI is everywhere, like on people’s phones, what applications can it unlock? Do you have that feeling? What do you want to see?
What I want is a device that is always on, with very low friction, that I can interact with through voice or text, or ideally through other means. It just needs to know what I want and be a constant presence helping me throughout the day. It has as much background information as possible, like the greatest assistant in the world. It’s this presence that makes me better and better.
When you hear people talk about the future of AI, they might imagine it in two different ways, although they might sound the same. But I think in practice, there’s a lot of difference in how we design systems. I want something that extends myself, like a ghost or another self, or something that truly belongs to me and acts on my behalf, replies to emails without even needing to tell me about it. It’s a bit like me, becoming more and more like me. On the other hand, I want an excellent senior employee. It might know me very well, and I might delegate tasks to it. You could have access to my emails, I’ll tell you the limits. But I think it’s a separate entity. Personally, I prefer the way of an independent entity and think that’s where we’re going.
So in that sense, it’s not you, but an always available, always great, super competent assistant manager.
To some extent, it’s like having an agent that works on your behalf, understands what you want, predicts what you want, that’s how I interpret what you’re saying.
I think there will be agent-like behavior, but there’s a distinction between a senior employee and an agent.
I want it, and I think one thing I like about a senior employee is that they would push back on me. Sometimes they won’t do what I ask, or sometimes they will say, “I can do that if you want, but if I do, I think this will happen.” And then this, and then that.
Are you sure about that?
I absolutely want that kind of atmosphere, not just giving it a task and it blindly does it. It can reason and argue. It can reason, and the relationship it has with me is more like the relationship I expect to have with a truly capable person, as opposed to a sycophant.
Indeed, if we have tools with reasoning capabilities like Jarvis, it might have a significant impact on the way valuable product interfaces we use today are designed. Take Instacart, Uber, and DoorDash, for example. These services are not meant to be pipelines, but rather provide a set of APIs to a multitude of ubiquitous intelligent agents representing the 8 billion people in the world. So we need to think about how we should change our understanding of how applications work, the entire experience infrastructure to adapt to this new world where you interact with the world in an agent-like manner.
I’m personally very interested in designing a world that is usable by both humans and AI. I like its interpretability, the smoothness of handoff, the ability to provide feedback. For example, DoorDash could expose some APIs to my future AI assistant, allowing it to place orders, etc. I can just say with my phone, “Okay, AI assistant, please place this order on DoorDash.” I can see the app open, see things being clicked, I can say, “Hey, not that one.” Designing a world that can be used by both humans and AI, I think it’s an interesting concept.
For the same reason, I’m more interested in humanoid robots than other shapes of robots. This world is designed for humans, and I think we should keep it that way. Shared interfaces are a good idea.
So, you’ll see patterns like voice and chat replacing applications. You just tell it you want sushi, and it knows what sushi you like, what you don’t like, and it will do its best to fulfill your request.
It’s hard for me to imagine that we’re entering a completely different world where you say, “Hey, ChatGPT, order me some sushi.” And it responds, “Okay, do you want to order from this restaurant? What kind, what time, anything specific?” I think visual user interfaces are very good for many things. It’s hard for me to imagine a world where you never look at a screen and only use voice mode. But I can imagine many things being like that.
Apple tried with Siri. They say you can order an Uber automatically with Siri. I don’t think anyone has done it because… why take the risk of not putting it on the phone? Like you said, the quality isn’t good. But when the quality is good enough, you actually prefer it because it’s lighter. You don’t have to take your phone out. You don’t have to search for your app and tap on it. Oh, it logs you out automatically. Oh, wait, login again. It’s just so painful.
It’s like setting a timer with Siri. I do it every time because it’s really convenient, it’s great. And I don’t need more information. But when it comes to ordering an Uber, I want to see the options. I want to see the price, maybe check the reviews, look at the map, decide which one I want, and then order it. I think users, I think visual user interfaces are great for many things. It’s hard for me to imagine a world where you don’t look at a screen and only use voice mode. But I can imagine many things being like that.Several different options of pricing. I would like to explore how widely applicable this technology is. I even want to know their specific locations on the map because I might choose to walk somewhere. I think by looking at the Uber order screen, I can get more information in a shorter amount of time, whereas if I had to get this information through an audio channel, it would take longer. I like the idea you brought up about observing things happening, it’s really cool. I think it will bring about some changes, and we will use different interfaces for different tasks. I believe this trend will continue.
Among all the developers building applications and experiences on OpenAI, are there any that have impressed you? Do you think this is a very interesting direction, even if it’s just a toy application? But have you pointed out and said that this is really important?
This morning, I came across a new company, actually, it can’t even be considered a company yet. It’s like two people working on a summer project, trying to eventually become an AI mentor. I’ve always been interested in this field. On our platform, many people have done great things. But if someone can deliver in a way that you really like, they used a phrase I liked, it would be like a Montessori-level reimagining of how people learn things. But if you can find this new way for people to explore and learn new ways themselves, I personally am very excited about it.
Devon, you mentioned a lot about coding-related things earlier, and I think that’s like a really cool future vision. I think healthcare should have a pretty big change because of it. But what I’m personally most excited about is faster and better scientific discovery. Although GPT-4 doesn’t seem to have played a big role yet, it could speed things up by increasing the productivity of scientists.
Sam… that would be a victory. The training and building of these models are different from language models. For some people, obviously, there are a lot of similarities. But there are many models that have a kind of from-scratch architecture and they are applied to these specific sets of problems, these specific applications like modeling chemical interactions. You would definitely need some of those.
I think for many of the things we’re discussing, what we generally lack is models that can reason. Once you have that ability to reason, you can connect it to chemical stimuli or anything else.
Yes, that’s an important question I wanted to discuss today, the concept of model networks. People often talk about agents as if there’s a linear set of function calls happening. But one thing that emerges in biology is a network of systems with cross-interactions, the aggregation of systems… the aggregation of networks produces the output, rather than one thing calling another, that thing calling another. Do we see a specialized model or network models emerging in this architecture, that collectively solve larger sets of problems using reasoning? There are computational models that can do things like chemistry or arithmetic, and there are other models that can do things that aren’t purely generalization models to rule them all.
I’m not sure how much reasoning can be turned into a broadly generalized form. I’m skeptical about it, it’s more of an intuition and hope, if it does happen, that would be great. I’m not sure…
Let’s take protein modeling as an example. We have a lot of training data, protein images, and sequence data, and based on that, we build a predictive model, and we have a process and steps to achieve that goal.
Have you ever imagined the existence of a general AI or a great reasoning model that can figure out how to build sub-models and solve problems by acquiring the necessary data, and then solve problems…
There are many ways to achieve this goal. Maybe it would train a language model, or it may just know a big model. It can choose the other training data it needs, ask questions, and then iterate on that.
I think the real question is, are all these startup companies going to fail? Because many startup companies work in this pattern of acquiring specialized data and then training new models from scratch with that specialized data. And it just does that one thing. And it works really well at that one thing. It’s better than anything else.
I think you’re starting to see a version of that.
When you talk about biology and these complex network systems, the reason I can understand it is because I’ve been really sick recently. Now I’m much better. But it’s like your body is getting defeated one system at a time. Like you can really see, okay, this is cascading. It reminds me of what you were talking about biology, like you don’t really know how big the interactions between these systems are until things start to go wrong. It’s kind of interesting.
But I was using ChatGPT trying to figure out what happened, anyway, I would say, I’m not sure about this. And then I published a paper, without even reading the paper, just like in the context. And it says, oh, that’s what I’m not sure about, now I’m changing my mind. So it’s like a little version of what you said, you can say, I don’t know this thing, you can add more information, you don’t need to retrain the modeler to add it into the context here.
So these models that predict protein structures, for example, yes, that’s the foundation. Now, there are other molecules on AlphaFold3. Can they do that? Yeah, that’s basically a best-in-class general model entering and acquiring training data and then solving problems on its own world?
Maybe you can give an example, can you tell us about Sora? Your video model can generate stunning dynamic images, dynamic videos, and what is different about the architecture there, whatever you’re willing to share, how does it stand out?
Yes, so I first talked about the general case, obviously you need specialized simulators, connectors, data chunks, etc., but my intuition. Again, without scientific basis, my intuition is that if we can find the core of general reasoning and connect it to new problem domains, like how humans are general reasoners, I think it can be unlocked faster, I think. But yeah, you see it doesn’t start from a language model. It’s a model specifically tailored for video. However, we obviously haven’t entered that world yet.
For example, to build an excellent video model, you start from scratch, I guess you use different architectures and different data. But in the future, for a general reasoning system, AGI, whatever system it is, it can theoretically achieve the goal by understanding how to do it.
Yes, one example is, from what I know, all the best text models in the world are still autoregressive models. The best image and video models are diffusion models. It’s somewhat strange in a way.
So there’s a lot of controversy about training data. I think you guys are the most considerate in that sense, you’ve already signed licensing agreements with FT and others. We have to be a little cautious here because you’re involved in the lawsuit with The New York Times, and I guess you can’t reach an agreement with them regarding training data.
How do you see fairness in fair use? We had a heated debate on the podcast. Obviously, your actions demonstrate that you’re trying to achieve fairness through licensing agreements. So what is your personal stance on the rights of artists who create beautiful music, lyrics, books? Do you use those rights to create derivative works and then monetize them? What is fair? How do we enable artists to create content and then decide how they want others to handle that content?
I’m just curious about your personal beliefs because I know you’re a thoughtful person in this regard. I know many others in our industry haven’t thought deeply about the views of content creators. So I think there will be a great variety of opinions.
On fair use, I think we have a very reasonable position under current law, but I think AI is so different. But for things like art, we need to think about it in a different way.
But, let’s say you read a bunch of math knowledge online and learn how to do math, it seems unobjectionable to most people. Then, there will be another group of people who may have a different view. In fact, to keep this answer from getting too long, I won’t delve deeper into this issue.
So I think there’s a class of people who would say, well, there’s general human knowledge, you can say that if you learn it, that’s, that’s open domain, if you go and learn the Pythagorean theorem. That’s one end of the spectrum. I think the other extreme is art, even more specific, I would say it’s like doing it, it’s a system that generates art in the style or similarity of another artist, which is perhaps the most extreme. And then there are many, many cases in the middle of the spectrum.
I think, historically, the focus of the discussion has been on training data, but as the value of training data decreases, the discussion will increasingly shift to what happens during reasoning. And what the system does, accessing information in real-time in context, or taking similar actions, what happens during reasoning, and how new economic models are influenced, will be more debated.
So if you say, for example, if you say, create a song in the style of Taylor Swift for me, even if the model has never been trained on any Taylor Swift songs, you would still encounter the question of whether the model might have read articles about Taylor Swift, it might know her themes, what Taylor Swift means. The next question is, even if the model has never been trained on Taylor Swift songs, should we allow it to do that? And if so, how should Taylor Swift get the corresponding compensation?
I think there should be an opt-in choice in this case, that you can choose to be part of it or not, and then there should be an economic model. Taking music as an example, there are interesting things to look at from a historical perspective, there’s sampling and how the economics around sampling work. It’s the same thing but an interesting starting point.
Sam, let me challenge you on this.
What’s different about the example you gave? The model learns the structure, rhythm, melody, harmony, relationships that make music successful, and then it builds new music using that trained data. And humans, having listened to a lot of music, their brains are processing and building all the same predictive models, those same discoveries or understandings. What’s the difference here? Why would you say maybe the artist should get unique compensation, this isn’t a sampling situation, you’re not AI outputting, it’s also not storing actual original songs in the model.
Yes, learning the structure.
So I’m not trying to emphasize that because I agree, like humans being inspired by other humans. I’m saying, if you say create a Taylor Swift-style song for me, even if the model has never been trained on any Taylor Swift songs, we still encounter the question of whether it should be allowed to do so. And then how should Taylor Swift get the corresponding compensation?
I think first and foremost, there should be a choice, that it can create or not create, and then there should be an economic model. In terms of art, from a historical perspective, there are some interesting things to look at, there’s sampling and how the economics around sampling work. It’s the same thing but it’s an interesting starting point.