DeepSeek R1: Understanding the Latest AI Breakthrough and Its Global Impact

DeepSeek R1: Understanding the Latest AI Breakthrough and Its Global Impact

Published
January 27, 2025
Video preview
In a rapidly evolving AI landscape, DeepSeek R1 has emerged as a significant development that's capturing attention from tech leaders, government officials, and AI enthusiasts worldwide. This comprehensive analysis, featuring insights from AI engineer LDJ, explores the technical capabilities, market impact, and future implications of this groundbreaking model.

The Rise of DeepSeek R1

DeepSeek R1's recent release has created waves in the AI community, quickly becoming the most-discussed AI model and even surpassing ChatGPT in search popularity. What makes this particularly interesting is not just its performance, but its approach to AI reasoning and the implications for the broader AI industry.
notion image

Technical Capabilities and Innovation

One of the most fascinating aspects of DeepSeek R1 is its approach to reasoning. Unlike previous assumptions about complex systems of prompts and parallel processing, R1 achieves its capabilities through a single stream of extended output. The model demonstrates an impressive ability to:
  • Generate lengthy, coherent responses
  • Perform complex reasoning tasks
  • Handle coding challenges effectively
  • Process and analyze information systematically

The Economics of AI Development

notion image
A crucial aspect of DeepSeek R1's impact is its reported training cost of approximately $5 million, significantly lower than previous estimates for comparable models. However, this needs context:
  • Current top models like GPT-4 likely cost less than $50 million to train
  • Training efficiency improvements follow a predictable rate of 2-6x per year
  • Future models (predicted for 2025-2026) may require investments of $3-5 billion

Understanding AI Model Evolution

The development of AI models follows a fascinating pattern of exponential growth:
  • GPT-2 to GPT-3: ~100x increase in training compute
  • GPT-3 to GPT-4: Another ~100x jump
  • Predicted next generation: Similar scale of advancement

The Global AI Race: Context and Implications

While DeepSeek's achievements are impressive, they need to be viewed in the context of global AI development:
  • Major US companies are likely working on significantly larger models
  • Future models may require 100-1000x more compute power
  • Hardware limitations and access to advanced chips remain crucial factors

Looking Ahead: The Next 18 Months

The AI landscape is poised for significant developments:
  • Expected release of new, more powerful models in early 2025
  • Potential for revolutionary advances in multimodal AI capabilities
  • Continued evolution of training efficiency and computational techniques

Key Takeaways

  1. DeepSeek R1 represents a significant advancement in AI reasoning capabilities
  1. Training efficiency improvements are making powerful AI more accessible
  1. The next generation of AI models will require unprecedented computational resources
  1. Global competition in AI development is intensifying, with hardware access becoming a critical factor

Technical Implications for Developers

For developers and AI practitioners, DeepSeek R1 offers several important insights:
  • Open-source availability enables broader experimentation and implementation
  • New approaches to reasoning and training efficiency
  • Potential for improved development workflows and coding assistance

Looking Beyond the Hype

While DeepSeek R1's achievements are significant, it's important to maintain perspective:
  • Major AI labs are likely working on more advanced models
  • Hardware and computational resources remain key limiting factors
  • The path to next-generation AI involves both technical and logistical challenges
The emergence of DeepSeek R1 marks another milestone in AI development, but it's just one step in a rapidly evolving landscape. As we look ahead to 2025 and beyond, the combination of increased computing power, improved training efficiency, and innovative approaches to AI development promises even more dramatic advances in artificial intelligence capabilities.

Transcript

DeepSeek R1 has just dropped less than a week ago, and it has all of these CEOs and top people in the U.S. scrambling, people from all the way from the government agencies to even normal folks who are actually searching for this app in the app store. It has actually now become the number one app available for you to use. And this is basically a language model that seems to the question we need to figure out, is this actually on par or? better than what's currently available in the U.S. with chat GPT-O-1. And so today we're going to be going over some of this information in a lot of detail here. And as of now, in Google Trends, DeepSeek has currently surpassed chat GPT with the search terms as well as Claude. Just today, this is just really kind of breaking news right now. And I have with me, LDJ confirmed, who's actually an AI engineer and he's actually a working to, we'll train these models and so forth. So he's an expert in this field. And he's here to provide and provide an important detail and figure out, , are these models actually that intelligent? Are they that good at doing what they do? Can they surpass, , USA's, , super intelligence that we're claiming impossible and beyond? So, LDJ, welcome to the show. Hey, thanks for having me on. Cool, man. So, dude, deep seek. , this is crazy as far as people who are using it to code. I was using it to code, and it looks you're able to do all this stuff with reasoning that seems very similar to chat GPT-O-1 and just kind of wanted to get your opinion as far as, , what are you thinking as far as how this model is working and kind of some of the thoughts right now with what's been happening with people scrambling around. Is this the next level of super intelligence and stuff? Yeah. Well, to start off with the question of how is it working, Luckily, they've open source everything, or they've open source the weights and the ability to run the models on your own hardware if you wish, although it's going to require a lot of resources to run the full thing. But we could see that when these reasoning models, at least R1, are actually working, it's actually just a single stream of really long output, where a lot of people were originally assuming when models 01 from Open AI came out, that it's doing this complex system of running Python scripts of prompting itself and this and that and maybe running a bunch of things in parallel and picking the best one. but it really does seem the most optimal way that R1 is doing it and likely O1 has been doing it as well. It is really just having this training technique implemented to where the model itself is able to learn how to reason for very long amounts of time within a single stream of output. And that ends up with a better answer. Got it. And so what that means is the model will actually kind of output more. intelligent thoughts to people who are using this. And so it's really interesting because I was just taking a look at Mark Andreessen's recent post. And he basically shows right here, it says deep seek AI system is now taking over chat GPT. And he's more specifically claims that deepseeks R1 is AI's sputnik moment. So what are your thoughts there? Yeah, I think that's a really good analogy. Being number one on the app store, I think there's. a lot of factors that played into that. But if we rewind a little bit, let's say, I want to say at least a few weeks, maybe a little bit more than over a month ago, we did have other open source reasoning models already out, such as QWQ from Kwan, which also happens to be a Chinese company Deep Sea. And then you also had, from a, I believe, a Canadian company actually called Riliad. And that was a 7 or 8 billion parameter reasoning model that does the same. similar technique. But this really seems to be maybe the first one with R1, where it's truly reaching kind of, in the main popular benchmarks, such as the Math Olympiad qualifying exams, it reaches really at that O1 level where others fall significantly short. And I think it's there where it's a combination of the fact that they've achieved that, as well as released just recently a technical report on this. And I think they already had, I want to say, R1 preview or even the full R1 model available for people to use, I want to say, at least for a week or two or maybe even three. But it's just around a week ago that they released a technical report that kind of unveiled a lot of the details of how they're doing things. And then just ever since, roughly a week ago, when they released that alongside, I think they had, of course, the online chat application, just ChatGPT and the actual mobile app as well for people to use. It's just been gaining so much popularity since then. Wow. Wow. That's amazing because I think the other question people ask about the compute is can you just throw more GPUs at this and more types of things? And also on top of that does that mean that Nvidia is totally cooked, right? , did they invent a new way of reasoning here in this paper or something that, , totally eliminates the need for Indyda's, , TPUs or whatever they're doing and stuff? Yeah, I think if we look at this with a tunnel vision of looking at specifically just these capabilities and 01 capabilities, yes, it is showing that you only maybe need a few million dollars to achieve such capabilities. However, I think it doesn't necessarily reduce the demand much for Nvidia GPUs because I think what a lot of people are missing is not only is there this inference time scaling where it could spend more compute to come to better answers, but there's also the actual reasoning, reinforcement learning, scaling where you can have more compute in that training of it basically in a way teaching itself to reason better and better. and that then ends up with better results. But then on top of that, you also have something just called the original pre-training laws, which some people have kind of called into question whether or not they'll continue as is. I think it's likely there's going to have to be new forms of data, more multimodal data, with video data and things that to help supplement the lack of web data that we're currently hitting. But it does seem just these multiple directions of being able to be able to be able to to scale the compute to get better and better models and better and better results. I think that will continue to play a role. Oh, that's interesting. So where are we in this timeline of things? Because I think, , can you go back and kind of explained the leaps of intelligence that have happened from GBT 2 to 3 and so forth? And then kind of where does this reasoning model with Deepseek fit in with the class of Open AI models because you're explaining, not all models are the same. You know, there's small models, there's large models, there's different capabilities. And so we're just kind of an understanding of kind of where we are in this scale and that people talk about and would love to have you kind of break it down a little bit. Sure. So in terms of, I think just to briefly first talk about, I think what you can call algorithmic efficiencies or training efficiency improvements, there's been quite, of research done by organizations EPOC AI that show a lot of times, even though it could seem these advancements or kind of training efficiency advancements are sporadic and kind of spontaneous, they do actually seem to follow a relatively clear line over time of it's roughly 2 to 6x per year. And then you can kind of average that more over time. And it does seem Deepseek v3 is just kind of it's another data point that's also following that. curve over the past years. And I think it's reasonable to suspect that Open AI and others actually maybe have similar algorithmic efficiencies already or not that far behind for their latest models, at least maybe GPT4O and maybe Anthropics, since they seem to be quite ahead in research. And just for a scale of the cost that we're talking about, DeepSeek v3, there's this commonly reported number of about $5 million. And a lot of people have been calling that into question saying, it sounds BS, it sounds a sham. And I have, , ran some numbers myself on their own published research. And it does seem that is in line with what the actual cost that I calculated. However, I think what a lot of people are getting wrong is assuming because of various news headlines that things GPT40 and Cloud 3.5 are in the billions or hundreds of millions of dollars. And that's a big misconception right now. That is a lot of the money being invested for future models. However, these current models, they are estimated closer to rate has on the stream here. They are estimated closer to likely significantly less than 50 million, very possibly even less than 20 million. And so, that would still, you know, that's getting roughly pretty close already to the deep seek v3 range. And then if you look at something Lama 3.1405B, that's an open source model meta released, and that's a lot of people are kind of wrongly comparing to that, but that's just pretty inefficient for various reasons. I think they're mostly just focused on kind of having a dense model to compare their research over time to their previous ones. and I think combinations of that along with they do have I think more efficient things planned for Lama 4 we can go into that another time maybe but yeah you could see that significantly more and then of course for the first half of 2025 we have these models that have recently been training GROC 3 and others that you know they're on the scale of more $500 million of compute I think those are probably implementing some things likely much more training efficient than something Lama did. And I think that will end up with something significantly better than anything we have available today. Wow, that's amazing. And so do you think that whatever's happened with Deep Seek that they'll be able to kind of catch up as well? Or is it just a matter of just not having these resources available? , can they just throw more compute to their, you know, I think that was kind of the other thing that people think. So, , if you just put more compute at it and you're saying about the data stuff, can they still end up with similar results or can they find another breakthrough or I guess? Yeah, so I think it's very possible that an organization, at least Open AI, might have similar training efficiency already to Deepseek v3, I described. But even if we say, let's say that Deep Seek and China are even 10x training efficiency ahead. Let's say they have such a leap. When you look at the long-term scale, such as something even the Grop 3-3 scale models and Open AI is inevitably training things that on that scale of compute as well, then even if DeepSeek V3 has a 10x training efficiency advantage, they would still need something a 50 million compute training run to match a $500 million dollar training compute training run. And this continues, of course, you know, let's say the big American players are doing $5 billion training compute runs. And then, again, even if Deep Seek has a 10x training efficiency advantage, they would still need a $500 million of training compute in order to match the capabilities of what the American companies are getting with $5 billion. Wow. Wow. Wow. that's pretty massive. And where are we in terms of timelines of scales of things? Because it seems that to train these models, there's a certain lead time because of how much data they're doing and crunching. And I think GPT4 was just released, what, in 20, late 2003, early? I don't remember as far as timelines and scales. I think you mean 2020. Yeah, yeah, 2020. But yeah, early 20203. It was trained in 2022. As you could see, I kind of put a label for that in the chart here. And for these upcoming models, I think they're likely to release in this quarter, if not sometime in the first half of 2025 at least. So maybe second quarter of 2025. And that's on the top bar. And I think likely, of course, XAI is basically confirmed to have such a scale model. and I believe Open AI likely has been training one in the past few months as well. It's expected maybe Google has to. And for that, that's something we can expect soon within the next few months. But I think longer term, on the order of 12 to 18 months, it's expected that training runs will already get on the scale of about 10x above that. So that would be closer to around $3 billion, $5 billion training runs. And that would be of if you kind of go back in time, I remember you asked me a question of what have the previous leaps been. GPT 2 to 3 has been about 100x and GPT 3 to 4 has also roughly been another 100x jump and training compute. And kind of the first 100x training compute above the original GPT 4 that we would get is on these kind of 12 month, 18 month training runs that are experienced. that would be 100x above original GPC4. Wow. So at the very top line, this big bar that we're seeing at the very top, when these training runs are done in 2025 or the whatever they're doing, could cost in the half a billion dollar range, right? Yeah We talking about 100x the leap from GPT4 And that estimate for training here was just a little over million or something somewhere in this range here What that was Cloud 3.5s on it. So if you can see the big leap, ladies and gentlemen, , and this is a logarithmic scale. So this is massive, right? So we're talking from tens to , you know, not 50 million, to add another zero. And that's kind of why it's so important to have this data. right, that these people are asking for data and all these different techniques to kind of push this together, because it does take quite some time for this to happen. And it sounds with the U.S., we're just basically barely around the corner from just a big nuclear bomb drop. As far as, , the AI space, feel , yo, you thought that was cool, kids? You know, maybe something, you know, can be even bigger here. And so in terms of, , response from, , these bigger companies, it almost makes sense kind of why they're a little bit silent. is because it's just kind of cute maybe, and then maybe they have something bigger? What do you think? Did they think, do they have something that? That's just going to be jaw-dropping here? Or am I just kind of going a little too optimistic here? Yeah, so I think there is a factor here of them already knowing, when I say them, I mean X-AI, Open AI, those companies already kind of knowing that they have something significantly better and bigger in the works already. However, I think in the short term, you know, as of the past week and as of probably the coming weeks, it is a little bit of a kind of not good look PR-wise and it is a little bit of a risk or maybe not even a little bit, maybe significant risk to, let's say, the actual mind share and user base. because as you saw, there's already more Google search popularity for deep seek even than cloud now, even above chat GPT now. And the fact that they're number one on the app store, it does seem , I think there's a lot of factors at play here, the attractive story of the underdog kind of creating this thing and it's available for free. And it is actually a good model, I'll say. It is actually integrated well. It even has internet search abilities apparently, whereas 01 doesn't even have internet search connection with chat GPT yet. There is these various advantages for the consumer. And I think they just kind of really need to get those, you know, $500 million scale models out soon so that they can kind of show their prowess above Deepseek. That's a really good point. And I kind of touching on that, I feel if you're a consumer, it's really hard to get this model that's hosted in the U.S. Because I just learned that if you go to DeepSeek's website, you're basically sending your data directly to China and their data laws and whatever their privacy is or lack thereof or who knows. You know, I haven't really read through the terms and conditions. So I don't want to make any assumptions here. but you would assume that , you know, if you have your data, you'd probably rather send it to a U.S. company who's going to maybe sell it to someone else. But you would rather do that as far as maybe control your privacy. But there's kind of a little bit of a fork here, which is really interesting, is that because this model with that level of intelligence you say and is very useful when people are using, and I'm using it as well, is that because we have sort of the recipe and the sauce, That means that U.S. companies can take this model and then host it, you know, privately here in the U.S. and then provide inference slash service so that you can make, you know, a chat GPT- app that has this kind of model running. That can, you know, can follow either certain guidelines or local types of stuff. And I think private companies as well, if you're a private company, there's probably people looking into trying to maybe run this on their own servers or renting GPUs. Is that kind of what you're seeing as well? Yeah, I mean, there's a lot of good points there, I think. In terms of the cost, let's start with that, I guess, that's another threat to kind of maybe Open AI and others in the short term where you have this thing that maybe it even cost the same training cost as 01. Maybe it's even the same inference compute as 01, but they're, you know, maybe they're not even necessarily selling it at below cost, but on the other side, open AI is simply selling 01 at very above cost, if you go what I mean. So Open AM, I just have very big profit margins on 01, for example, while DeepSeek is just selling closer to at cost. And whatever the situation is, they're selling it for much lower that is going to, in some sense, take away customers from Open AI, at least in the short term. And I think the fact, of this being open source, though, you said, not just the actual model weights in the trained model itself, but the research. This comes back to the hypothetical I said earlier of even if deep seek had this 10x training efficiency, then this and this. But I feel even that can kind of be negated because even if this is a 10x training efficiency leap by a deep seek, the fact that their research is open sourced for it now. it means that it could largely be replicated by the surrounding labs. And now kind of deep seek has even less of an advantage. And for the hypothetical I was saying to even be true, they would have to kind of get another big efficiency leap against the Western labs on top of that. And then make sure to not give it to the West. , okay. And it just kind of goes back to maybe what their philosophies are because there's a difference between Eastern philosophy and even Eastern medicine, thinking versus Western philosophy and Western, you know, types of thinking too, right? And, you know, what is it, what's to give and take? So that's an interesting insight or perspective there because sometimes it's just so hard to sift through the noise because there's just a lot going on. And for me, it's really hard to really understand, you know, where are these models and kind of what the differences are. So I think what one thing I kind of wanted to understand a little bit better is there was a conversation with the, the GPT-40 models being Omni. So that means that the model you can put text in, you can have it generate video, have it generate audio and so forth. And it's a little bit different than having it go through, I guess different steps from text to video, a video back down the text. So it doesn't have to pass through, I guess, multiple models. Or maybe if you can explain a little bit of what that is. And then maybe kind of the comparative to the differences of what we could see currently with Deepseek R1 or Deepseek, yeah. Yeah, I think that's a good point of something that people are also missing lately of DeepSeek has this architecture and they release this model, but it still is lacking certain things that's something GBT40 has in terms of the audio omnium modality, which basically means the architecture for 4O is said to actually be able to understand audio directly and not just transcription of audio, but actual audio itself. for example, if you theoretically were to speak very slowly or very quickly, then it can actually know which one you are doing. It can identify things tone of voice and such, and it can even express that back, can even speak to you, can even, they showed examples of, I think, , please give me an audio of Mario hitting a coin and it actually generates a sound that sounds Mario hitting a coin. And a lot of these things that aren't fully kind of rolled out to the public yet, but Open Eye has talked about in blog posts. And I think, , remember, that that was announced all the way back in May, I believe it was. Yeah, GBT40 was announced in May. Right. And now at this point, I think, speaking of this, actually, Deep Seek just announced something today, I want to say. I don't know if you saw it. Briefly. But Janus, 7B. Yeah. And this is actually partially omnipodal in the sense that they don't have the audio aspect, I think, but they do have the understanding of vision, but then it's also able to output images and generate images, all with the same model. Wow. Wow. And so this is something that GPT40 is to be able to do as well, but it's just not released and rolled out to the public yet. But, yeah, you could see here. They already had a previous model a few months ago called Janice and then a newer, one here. And sorry, one second, I'm going to get my dog. So I think you might hear them cry. It's okay. Yeah, so here's Janus down here in orange. And then you have Janus Pro 1B, the one billion parameter model. And then you have the Janus Pro 7B, which is that model there. And then here are the evals for the accuracy. And the scores is where there's PICA art. I'm back. Yeah. Cool. Wow. Yeah. And you can see their older version. I think, I think that didn't even release that long ago, I want to say, maybe a few weeks or a few months ago, it was the original Janus model. Holy. Yeah, this is really interesting, and there was research from meta that came out called Chameleon and another follow-up work called MoMA, which is, I want to say, , almost a year old at this point and is doing things very similar to this. However, it's still exciting that Deep Seek is coming out with this. I think maybe Deep Seek is kind of implementing some extra tricks and bells and whistles on top of those types of methods. And this is also open source. Yeah. Yeah, this is also open source. I think if you look at the middle image especially, the ability to do text, sorry, the chalkboard I mean. Yeah. The ability to do text with inside images, that's something that image generation models have historically had a lot of issues with. And it just kind of is funny and makes sense that something that's actually understanding and trained on a bunch of language. as well as images, is able to kind of just do that way better. Oh, that's interesting. Wow. That is really, and this just literally came out today, maybe an hour before the stream started. Yeah. Absolutely insane breaking news as far as what's going on. And this world moves incredibly fast as we can see. But I think it's just been really incredible to put this a little bit into context. it's really easy to jump on the fear train, especially since this involves a lot of layers of complexity. And there could be many people who are just watching and tuning in today. So please just drop any questions that you may have. And we can see if we can answer them. So appreciate people who've been able to join on so far. And I think that, you know, it's just kind of really interesting to hear people's feedback and opinions, because I'm still trying to learn about this. And I appreciate, , LDJ is just kind of coming on to share his background. and tell us a little bit of kind of what you do, LDJ. Sure. So my background is I've been in AI for about five years, originally involved in a company called TTS Labs, where I led the process of a lot of development for different models for, we specialize in basically voice specific models for streamers on Twitch. And it would be something where, you know, viewers could donate with specific text, and then it would make it sound the streamer saying that on the stream, or I'd say, thank you, Ray Fernando, for subscribing. And it would say it in the streamer's voice without them actually having to say itself. Wow, wow. Say it themselves. And so that was something started around 2019, 2020. It's still going on today. And later on, I joined something called News Research. And that was something that's more LLM focused. And that's something where, I was mostly focused on things post-training, and for people unfamiliar with what post-training means, that essentially just means the process of kind of after you've already trained the model on the internet, the post-training part is where you actually train it more specifically on how it should behave, how can it actually maximize its effective intelligence for the user, and it's different attributes of how it does things. Oh, cool. Wow. Yeah, that's super awesome. And it's just amazing their depth of background and just being in the industry. And to share, and I think if you haven't had a chance to follow LDJ confirmed down below on X, I think we hang out a lot, especially in spaces. And for those who aren't familiar, if you were in the, a lot of the AI community actually ends up kind of gravitating into X slash Twitter or formerly known as Twitter. And there's a lot of these informal spaces where something will happen this and people just hop in and start talking. And that's kind of where I met LDJ, and it's also met you on the Thursday I podcast with Alex Bulkoff. An amazing show. It goes into a lot of details of all the bottles. That's every Thursday morning. That's when they drop the show. So if you haven't had a chance, definitely check that out. I would definitely say. And so I think somebody had a question here about the, what's the proof that only took six million apart from what they say? Can we trust them? Is it noise? Go ahead. Yeah. Yeah, so on that, actually I can, I think maybe I could send you this calculator, this training calculator I was talking about earlier. Or is it possible I could easily share my screen or no? Yeah, yeah, you can just share. If you should share, then I'll be able to put it on the screen. Yeah. Here, let me try to be quick with this. Cool. Yeah, also, it's kind of fun to see everyone's thoughts here. Hertzfeld Labs, welcome to the show. Awesome. Let's see. Is there... I guess the short answer while I pull this up is you, since they publish the research openly, you can actually quite easily do the calculations of roughly how much would it cost based on their method of a training compute cost, which is actually a pretty typical method of measuring it. And it does come out to, for my estimate, roughly in the range of $5 million to $11 million, depending on how you price the hours. But here, if I share my screen right now, which I think should hopefully work. Yeah. Okay. Cool. Yeah let me see if I can put this on to the guest spot here And then whoops Let me sorry guest two And then let let me assign you to guest one here and then put you into this screen share Okay, let's see. I usually have a specific screen share for this, for this view. Give me a second. Let me get this popped up here. Is it this one? Yeah, I have this thing, and I think I have screen share above it. Is it? Screen share? Yeah, I guess too. Here we go. Cool. There we go. The SDF here. So adding it to it. There's an audio issue, by the way. I think it's kind of glitching out from your end. Yeah. I think if you disconnect, I'm going to disconnect the screen share. Maybe I'll come back and then come back. Hello? Hello, testing. Okay. Yeah, this is better. Huh. Interesting. When you share the screen. Yeah. Huh. Maybe you could try the screen share again. We'll see if that can kind of work. Okay. Here. Oh, I think no one of the X-Gere's a Gehrieck or it. Yeah, I think the audio really just kind of gets glitchy. , it just got glitchy again. Okay, so there's a little angle that says also share tab audio. I'm going to make sure to disable that. Oh, okay, maybe that's what it's doing. It's kind of. Okay, perfect. Okay, perfect. Yeah, there we go. Yeah, it's better enough. Nice. Sweet. Yeah, so as you could see, I was adding some additions to this, but I made sure that this code is working and it's proper to the actual formulas for calculating training cost and flops. Wow. And here, for the most part, it's actually pretty basic calculation. You really just need two main variables. The active parameter count of the model, which in this case is about $38 billion for Deepseek V3, if I remember right? And then about $15 trillion tokens for the amount of tokens during training. And then for H-100, oh, seems to be actually maybe, Oh, I missed a Did I put an extra L or? 15 trillion. Okay, let me just put the full numbers that hopefully work. It's , we can't accept commas. And then 38 billion. Yeah, and you see here about a 4.4. $1.7 million training costs. Oh, my goodness. And then, yeah, so it's right about where the deep seek V3 paper says it is. Depending on how you calculate it, there may be even being a little bit kind of humble, and it could even be kind of a little bit lower than that depending on how they're implementing some of their efficiencies. So if you scroll down, it says estimate of 2026 GPT5 scales in 2H, second half of 2025 or first half of 26, that's going to cost how much? Yeah. So this is, I was saying earlier, this would be, I think there's something wrong with the cost here. Yeah. Because it's paying the same cost as this. So there might be some bug there or something. Oh, no, this is, it's not calculated from the calculator code. This is something I told it to add manually as a card. So ignore this. Yeah, it's a template. It should be $5 billion. Five. I'm closer to five. Five billion for GPT-5. Oh, my goodness. Yes. So, I mean, here we have, there's a, remember earlier how I was saying the factor of there is a somewhat predictable algorithmic advancement rate over time. And if you actually look at GPT 3 to 4, it is estimated by some people formerly at OpenAI, Leopold Ashen Brenner, as well as some. some other people that there's maybe around, you could estimate, roughly around 5x or 10x efficiency gain, actually between GPT 3 to 4. And a lot of people think it's actually just simply naive scale up alone. But there is actually several things, such as a mixture of experts training, as well as something called instruct GPT that actually allows it to much more effectively, work in a chat manner. And those various things kind of end up counting towards. towards what you call these training efficiency gains. Oh, wow. And so what you can call when you combine the, let's say you have, for example, a 10x efficiency factor combined with a 100x increase in raw compute, you would roughly call that 1,000x in effective compute, which basically means it's about the same capabilities as if you naively scaled up just 1,000x alone, if that makes sense. So you're saying, , numbers to numbers, it's adding additional, you know, tricks and things to do, to get leverage out of the same compute, right, in some way? Yeah, so you could think of it in a sense. It's estimated that let's say they didn't change anything between GPT3 and 4 architecture or training recipe or anything that. And let's say they just naively scaled up the amount of tokens and parameters. then the actual scale up that the capabilities roughly matched to is you could say roughly a 500x or 1,000x more compute model. Wow. And so with that, then something , let's say these upcoming models $500 million scale, some people might say, oh, well, maybe that should be GBT5 abilities because you have this, you know, 10x increase. increase in raw compute, but then also this 10x efficiency advancement from something Deepseek v3. But I think in order to really feel a full generation leap, you might need that kind of deep seek v3 level of efficiency combined with this 100x raw compute to kind of really get a similar feel to that leap we had between three and four. Wow. Oh, okay. So this is making a little more sense. It's kind of getting harder for people to compare because raw costs. really won't tell you that it is something that is this next generational leap. And then necessarily more compute either or more specific types of data. It's it's a lot of combinations of things. And I guess just anything, right, as we go, we learn and everyone learns the new techniques. It becomes public and everyone takes that and then rolls it into the next thing. And so, but the, you said there seems to be sort of this clear delimiter of , okay, yeah, that's definitely the 100x mark or the 1,000x mark. And you would characterize that by. Yeah. So I would say, , for these runs that these models that might release, let's say, in second half of 2025 or first half of 2026, or at least finished training by them, it seems those would, , pretty much in whatever way you look at it would match the, , the defining factor of being. GPT5 scale and would be deserving of that title. But when you look at these shorter term things, sure, it is possible that they might have such big advancements and maybe such big improvements of how you use it that they might even call these earlier ones GPT5. But it's, even if they don't become such a name, I wouldn't call that late or I wouldn't say things are, you know, delayed or anything that. because I don't think it's really deserving of that title until you get to this point anyways here. Okay. Yeah, that's amazing to kind of see how this is all playing out in this giant landscape of things. Yeah, I was wondering if you want to kind of touch on anything else that has come to mind or things that you've seen online that you kind of want to talk about because now we have about 500 people plus on the stream. And, you know, it's really amazing how, you know, we're all curious. I'm trying to build with AI. And I may not have a lot of technical details, but I always enjoy kind of chatting with you because you always just provide me a lot of insights and kind of cut through the nonsense from stuff I hear online. Yeah, I think one of the most prevalent misconceptions or I guess maybe hard to call it a misconception because it's not exactly a firm statement. But something that's been said a lot lately is that DeepSeek is just a side project or just a side project that, side project that a few quants decided to work on together. And I would say that's probably not true in the way that people think it is, at least. It has been basically its own organization for a little over a year at this point and kind of putting out already a decently well-known open source models in the open source space for about a year. And they've been progressively doing more impressive things, originally doing things on the scale of 1 billion parameter and 7 billion parameter based models, and then releasing more impressive things with some really cool, small coding models that people use, including me back then. And they are, I think, I want to say on the Deepseek v3 or R1 paper, it's something on the order of over 100 or 200 authors just on those papers. And so that is similar scale to the amount of authors you had even on the GEOP. UBT4 paper, for example. And so this is a pretty well-developed organization at this point already. It's been said that they've been even hiring some of the top people from top research organizations across Asia and top universities and even poaching people from Alibaba and BightDance and big companies that. So I think they are a serious competitor. I don't think people should consider them just a company in a garage. What they are doing is very impressive. And, yeah, I think we'll continue to see good research coming from them, even if they might not necessarily be continuing to match the very top frontier models that release from American Labs over the next couple of years. Oh, that's a really good note because bigger companies here in the U.S., META, who are actively open sourcing things and providing stuff for the AI community, you know, have bigger teams as well, right? And they're very well funded. And they're, you know, , I guess kind of mission driven to, you know, put this out there so that folks can take a look at it. So where, in the U.S., kind of what's the similar, you know, I think I was talking about meta, but who else is kind of doing this type of stuff in the U.S.? Hmm. So when you say this type of stuff, do you mean just open source research or maybe open source, research specific to reasoning models and things that? Yeah, I guess there's, you know, who is, yeah, I think, yeah, or , what is the equivalent? Is there something that here in the U.S.? how, you know, people joke about deep secret sounds or maybe think it's serious? It's , oh, yeah, we're just a small team of researchers coming together, but it's now a bigger thing. And, you know, in the U.S., there's this big investment, you know, from to put stuff into AI. And I feel there's a lot of different places where stuff is happening, but, and I kind of wanted to get your lay of the landscape. , how does it look from a developer standpoint? Maybe if you're a college student in thinking about getting into the space in the U.S., you know, where should they consider studying or doing some side studying while they're in school to, , and maybe advance themselves, where maybe should they be focusing if they want to get in the space to , you know, have this , I don't know. I feel this is a big, call to America engineers, if you're an engineer to, you know, and you really want to get in this space and make a big change. I think this is kind of your moment, right? So I feel there's there's a large landscape and I'm kind of trying to think about, you know, what can help someone or where would you guide them? Yeah. So I think in terms of the one you mentioned Meta, they're doing a lot of cool research, especially at the fair, which is called, I think it stands for fundamental AI research lab. And that's where a lot of cool things the Omnimodal research I was talking about earlier that they've open sourced in a lot of the precursor research that's planned to go into future Lama models. That's where a lot of cool things are happening there. And I think people should probably maybe apply or at least check out what they're doing to see if that's maybe a good fit for what they want to be a part of. And then you also have DeepMind, which they're not exactly known as a top open source player, but in terms of published research, they do have a lot of cool published research and a lot of things that, I would say, a much higher frequency of publishing things than other labs of their caliber. And they also have even open sourced LLMs that are trained to decent capabilities on the order of seven billion parameter scale and I think maybe even a little bit larger. And those models are they called Gemma and they have experimental versions of those on various architectures. You also have Allen AI and Cohere. I think they've been doing some cool work of just open sourcing models. Allen AI specifically has been doing really cool work with reasoning research and open sourcing their findings across that. And Nathan Lambert is a person at Allen AI has been spearheading a lot of that work. And I'm not sure if that's actually Canadian or American, but I guess just North America overall, right? Yeah, yeah. Yeah, I think I'm mostly named all the ones that I could think of right there. Those are the main ones. Oh, that's cool. That's super helpful as far as, yeah, those folks, because this field is just, for me, it feels it keeps getting wider and wider. And it's, you know, for me, a lot of my backgrounds, , oh, I want to get into this and then explore AI models. and you know I mean there just so many different problems to solve in this space from getting the models to train them And then people making products that use these models Distribution is a big deal. So getting these models actually put somewhere and then delivered onto devices and things that. There's just distribution of just sending around these weights and so forth. And a lot of engineering problems just from here to there that can be solved. And these different companies are kind of doing these types of things. And, you know, I just, I guess I encourage people if you're, if you're tuning in today just to tinker, start having some fun playing around. I feel for me, learning how to prompt and understand what you're putting into these models and seeing what you can get back out can really teach you a lot about a little bit of what's happening and kind of show some of the core intelligence. You'll start to, for me, it's kind of been fun really being a prompt engineer to put different things in and make these little templates in some ways. But that could be the basis for an app for a problem you solve every day for normal people and stuff. So yeah, that's really cool. Yeah. Wow. Maybe if you want to start going maybe more deeper on things, you could pull up the DeepSeek R1 paper and they have some cool charts that we could show of just how things , for example, how the length of its thinking kind of starts to naturally increase the more you train it. Wow. And it's effectively just the model is learning to more effectively use longer thinking times as it better learns to reason. Whoa. And here I think top right. If you click PDF somewhere in the top right, yep, there you go. Then I just scroll down until you see this big chart with a line going from bottom left to top right. Okay. Yeah, let's see. It's just length. So it's a big chart. Yeah. No, further below. Okay. There we go. That one. Okay. Yeah. Yeah. Yeah. And then you see on the Y axis, it says average length per response. And so there it starts off around, you know, less than 2,000 tokens is giving per response. And then by the end of that training, the average token length per response is, you know, on the order of 10,000 words or 10,000 tokens. And just for people that are unfamiliar with what a token is, it's basically. a word. It's three-fourths of a word. It depends on the tokenization method you're using. But here you could see that if you just want to read the figure three description and then you could read the few sentences below that. Oh cool. Yes. It says the average response length of DeepSeek R10 on the training set during the RL process. So this figure shows deep seek R10 naturally learns to solve reasoning tasks with more thinking time. And so I think it says meant through the training process. This improvement is not the result of external adjustments, but rather the intrinsic development within the model. Intrinsic development within the model. I think that's a really big pause there because that means that this is kind of that Frankenstein moment where it's alive kind of feeling comes out. And that means that the model itself is now figuring stuff out, which is kind of interesting, right? you making something and then it just kind of starts to take off on its own. I think that's kind of the maybe what people at OpenEI saw when they started to understand this concept. And this concept, I guess, is kind of what's we're seeing publicly here on this paper, which is really nice. DeepSeek R10 naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test time computation. Maybe the next sentence is probably worth reading too. Yes. The computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought process in greater depth. That's incredible. Yeah, I think that mainly summarizes it. Yeah. What you're saying about maybe this is what Open AI people saw, they did pretty much confirm. They upload a video to their YouTube of basically a kind of kind of. roundtable style talk of a bunch of people that worked on 01. And one of the things they did mention is one of the aha moments for them was they did mention how at some point they realized that the model was just kind of naturally developing with the ability to start questioning its own reasoning processes and then refine it from there. And just the deep seek people are saying here, without having anything that explicitly enforces the model to do that or without intentionally trying to get the model to do that necessarily. It just kind of ends up through this reinforcement learning process to figure out, , wait a second, , I should actually question this strategy and , and think about another strategy and so forth. I see. And so for those who are kind of getting new to this, could be a little bit French and stuff, reinforcement learning just means that, somebody is there to verify the output and then give the output a reward or score that this is good. And then that way next time when it connects them. Or system. Or system. , okay. Yeah. So, , in this case, it's actually more so, , the form of RL that most people are maybe familiar with you were describing is where the human is giving feedback. But in these processes, it's actually more non-human systems are giving feedback or even the AI itself giving feedback. And so you have, for example, things functional math verification systems built into the training, for example, to where it's actually able to objectively verify if it's correct or not, then rewarding the model, , live during the training process itself of, , doing that. And then so it can immediately kind of reinforce and go into that feedback loop of improving itself without having to wait this long process of having a bunch of human laborers label it, right? I see. And there's that combined with more kind of general feedback mechanisms that even meta has researched a lot into in the past few months, which is basically what you can call RLAIF, which is just reinforcement learning from AI feedback. And Anthropic actually has talked about using this a lot too for their models. Holy walkomoli. For the most part, it's really just for things that you can't really objectively verify much, you literally have the AI model itself or another AI model actually, , give a judgment of, hey, , is this a good response or is this meeting a certain criteria of what the creators are, are desiring? And then it goes from there. So if we think about this, I to try to anchor my thinking into other concepts that I know. So one thing I've been using a lot is chat GPT's O1 Pro models. And I notice it takes a longer time to think. think, and I'm thinking about the number of steps in the scale, right? So if it's very similar to this type of thing, when Sam Altman says it's, you know, even though we charge 200 bucks a month, we're still losing money on this. Can you see how this can be true, actually, given the calculations? Okay. I think something to clear up about the chat GPT pro subscription is the fact that I think a lot of the most, the most of the value within that is actually not so much the O1 Pro model. But it's rather you get unlimited advanced voice mode. Okay. You get unlimited 4-0. You get unlimited 01 access. And all of those are limited in the Plus plan. You only get unlimited with $200 per month. And then I think even unlimited slow generations with SORA. And then there's a cherry on top of 01 Pro. And then now an extra cherry on top of operator. Oh, right. So I think a lot of it is, I've ran some numbers, and it's actually not too hard if you really just are a power user for a few hours per day. You can really, , end up using, , over $1 of effectively advanced voice mode in 01 API costs in the pro subscription for just $200 per month. Oh. And so I think it's a lot of people, , somewhat abusing the system there. Not necessarily abusing, because, you know, it's meant to kind of use it as much. as you can. . But yeah, I think they just kind of have it priced in this way with the rate limits at the certain rates for, , of course these costs are going to fall, right? These operating costs for Open AI are going to fall. And so even though they might be kind of hurting on the pro subscription right now, it's only a matter of time, maybe even months from now, that they'll already be in a profit. That makes sense with the new investments of the new models and kind of this interesting almost a Mobius. It turns out. Yeah, feedback clip. Yeah. That's really, really cool. My goodness. Yeah, that's an interesting perspective as far as sometimes it's really hard to relate and get caught up in the noise of things. So there's a lot of different things that we cover. And I feel , you know, is America cooked, you know, are they? Yeah. I mean, I mentioned earlier, I think even if you kind of make the more optimistic case for a company Deepseek and others in China, you still need, you know, for example, at least a $500 million training run to match America's $5 billion training run. And those $5 billion training runs in America are likely going to happen and likely even within the next two years or sooner and or even sooner than 18 months. and I think when you have at that scale, I was actually just doing some rough napkin and math with a friend recently and assuming that they're not able to smuggle hundreds of thousands of H-100s, which, you know, maybe that would be a wrong assumption. I don't know. I'm not very confident in that, but let's say they're not able to do that. They would end up needing somewhere on the order of millions of, , millions of their internal Chinese chips in order to match what one of these training runs in the next 18 months in the West are going to be capable of. Maybe even north of 10 million in some sense. And actually that that might be pushing it. Maybe , let's say north of 5 million roughly. That's what I estimated. And then you do the math of can the chip fabs that exist in China even produce that amount of chips in a year? And it seems the answer might actually even be no for that. So it seems they either have to move very fast and or get to a lot of H-100s in the country, maybe by starting some shell companies in India and Bangladesh and then importing the H-100s from those countries into China. or it might even just be simply a matter of an invidia able to supply enough of the lower class H-800s to China that enable them to get there. I guess we would have to see what the restrictions for that end up being on America's side. Wow, wow. That is incredibly insightful. And I love that it's just not crazy theory, but it's actually trying to approach this from an engineering standpoint of, trying to calculate this, what's been published by them, looking at it from the very detailed level and trying to make these better decisions and try to get a really true understanding. So I feel a lot of people may want to reach out to you to learn more about, you know, how they can work with you to better understand, you know, not even this problem, but maybe other problems that they have. So what's the best way that people can get in contact with you? Yeah, so on LDJ confirmed, that is my, my Twitter and people could DM me there. I also have a blog. It's a LDJ AI, I think it's called. I actually just recently started it. So actually, let me double check on the name of that. The substack? Yeah, it's just LDJAI. Yeah. So that's my substack if people want to check out my blog post. I only have one there. But Ray liked it a lot, so maybe some of you viewing this might it too. and yeah, I'm also available for AI consulting. I've consulted various stealth labs and investment firms, and if you're interested in just having a conversation with me in that type of way, then yeah, feel free to reach out and ask me about it. Yeah, definitely check it out. Very detailed, very thoughtful. I feel you're one of the few people in the industry that's really taking that time to think things very clearly. you know, you don't not, and just kind of trying to objectively look at things, which is what I really appreciate, especially trying to, you know, on X or Twitter hopping on these spaces to just talk about this exactly as it comes out and share these insights, you know, for free and stuff, but also privately for those who are looking to understand maybe what the business impact is going to be for you if you're building with these models and trying to get them scaled up and then kind of what problems you can solve. So, yeah, LDJ, awesome, awesome show. I really appreciate you kind of coming on and helping me understand this type of thing, but I think it's also helping others understand the space. And for those who are just tuning in, or if you're just kind of following, I'll make sure you follow him. LDJ confirmed right down below the handle there. And then Ray Fernando, 1-337, you know, I'm just trying to explore this world of AI, share some information as I go, and kind of, you know, answer some of these questions that are in my mind, but also burning questions that other people have, too. Awesome. Awesome show. Any last thoughts or anything you want to share before we close it out? No, this was fun. And I think we should do another one, another time. And it's fun talking to you in this format. And I think we can help inform people on a lot of things this way. Yeah, it's cool. It's so awesome. You know, so , totally, totally cool. So thanks for tuning in, y'all. And make sure you stay subscribed and updated. And you'll be notified when we publish new shows and so forth. So stay tuned. And I'll see y'all later. Peace, y'all.