
原文来自于:
https://semianalysis.com/2025/01/31/deepseek-debates/
DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts
H100 Pricing Soaring, Subsidized Inference Pricing, Export Controls, MLA
The DeepSeek Narrative Takes the World by Storm
DeepSeek took the world by storm. For the last week, DeepSeek has been the only topic that anyone in the world wants to talk about. As it currently stands, DeepSeek daily traffic is now much higher than Claude,
Perplexity, and even Gemini.
But to close watchers of the space, this is not exactly “new” news. We have been talking about DeepSeek for months (each link is an example). The company is not new, but the obsessive hype is. SemiAnalysis has long maintained that DeepSeek is extremely talented and the broader public in the United States has not cared. When the world finally paid attention, it did so in an obsessive hype that doesn’t reflect reality.
We want to highlight that the narrative has flipped from last month, when scaling laws were broken, we dispelled this myth, now algorithmic improvement is too fast and this too is somehow bad for Nvidia and GPUs.
The narrative now is that DeepSeek is so efficient that we don’t need more compute, and everything has now massive overcapacity because of the model changes. While Jevons paradox too is overhyped, Jevons is closer to reality, the models have already induced demand with tangible effects to H100 and H200 pricing.
DeepSeek and High-Flyer
High-Flyer is a Chinese Hedge fund and early adopters for using AI in their trading algorithms. They realized early the potential of AI in areas outside of finance as well as the critical insight of scaling.
They have been continuously increasing their supply of GPUs as a result.
After experimentation with models with clusters of thousands of GPUs, High Flyer made an investment in 10,000 A100 GPUs in 2021 before any export restrictions. That paid off. As High-Flyer improved, they realized that it was time to spin off “DeepSeek” in May 2023 with the goal of pursuing further AI capabilities with more focus. High-Flyer self funded the company as outside investors had little interest in AI at the time, with the lack of a business model being the main concern. High-Flyer and DeepSeek today often share resources, both human and computational.
DeepSeek now has grown into a serious, concerted effort and are by no means a “side project” as many in the media claim. We are confident that their GPU investments account for more than $500M US dollars, even after considering export controls.
Let me provide a careful translation of this text into Simplified Chinese:
关于DeepSeek深度探索的辩论:中国在成本方面的领先地位、真实训练成本、封闭模型利润影响H100价格飙升、补贴推理定价、出口管制、MLA
DeepSeek话题席卷全球
DeepSeek引起了全球轰动。在过去的一周里,DeepSeek成为了全世界唯一想谈论的话题。目前,DeepSeek的日均流量已经远超Claude、Perplexity,甚至Gemini。
但对于该领域的密切关注者来说,这并非"新闻"。我们已经讨论DeepSeek数月之久(每个链接都是一个例子)。这家公司并不新,但疯狂的炒作却是新现象。SemiAnalysis一直认为DeepSeek极具才华,而美国大众此前并不关心。当世界终于开始关注时,却以一种不反映现实的疯狂炒作方式进行。
我们想强调的是,叙事已经从上个月的"扩展定律被打破"转变了,我们驳斥了这个说法,现在算法改进太快,这似乎也对英伟达和GPU不利。
现在的叙事是DeepSeek效率如此之高,以至于我们不需要更多计算能力,而且由于模型的变化,现在一切都出现了大量过剩产能。虽然杰文斯悖论也被过度炒作,但杰文斯更接近现实,这些模型已经在H100和H200的定价上产生了实际的需求效应。
DeepSeek与幻方量化
幻方是一家中国对冲基金,也是在交易算法中使用AI的早期采用者。他们早期就意识到AI在金融之外领域的潜力以及扩展的关键洞见。因此,他们一直在持续增加GPU供应。在使用数千个GPU集群进行模型实验后,幻方在2021年投资了10,000个A100GPU,*这是在美国施加出口限制之前*。这个投资得到了回报。
随着幻方的进步,他们意识到是时候在2023年5月分拆出"DeepSeek",目标是更专注地追求AI能力。由于当时外部投资者对AI兴趣不大,主要担忧是缺乏商业盈利模式,因此幻方对项目进行了自筹资金。
如今幻方和DeepSeek经常共享资源,包括人力和计算资源。
DeepSeek现已发展成为一个严肃的、协调一致的努力,绝非媒体所称的"副业项目"。我们相信,即使考虑到出口管制,他们在GPU上的投资也超过5亿美元。

The GPU Situation
We believe they have access to around 50,000 Hopper GPUs, which is not the same as 50,000 H100, as some have claimed. There are different variations of the H100 that Nvidia made in compliance to different regulations (H800, H20), with only the H20 being currently available to Chinese model providers today. Note that H800s have the same computational power as H100s, but lower network bandwidth.
We believe DeepSeek has access to around 10,000 of these H800s and about 10,000 H100s. Furthermore they have orders for many more H20’s, with Nvidia having produced over 1 million of the China specific GPU in the last 9 months. These GPUs are shared between High-Flyer and DeepSeek and geographically distributed to an extent. They are used for trading, inference, training, and research. For more specific detailed analysis, please refer to our Accelerator Model.
GPU情况我们认为他们拥有约50,000个Hopper GPU,这与一些人声称的50,000个H100不同。英伟达为了遵守不同的法规制造了多个H100变体(H800、H20),目前中国AI模型提供商只能获得H20。需要注意的是,H800的计算能力与H100相同,但网络带宽较低。
我们认为DeepSeek可以使用大约10,000个H800和约10,000个H100。
此外,他们还订购了更多H20,英伟达在过去9个月里已经生产了超过100万个专供中国市场的GPU。这些GPU在幻方和DeepSeek之间共享,并在地理位置上有所分布。它们被用于交易、推理、训练和研究。关于更具体的详细分析,请参考我们的加速器模型。

Source: SemiAnalysis, Lennart Heim
Our analysis shows that the total server CapEx for DeepSeek is ~$1.6B, with a considerable cost of $944M associated with operating such clusters. Similarly, all AI Labs and Hyperscalers have many more GPUs
for various tasks including research and training then they they commit to an individual training run due to centralization of resources being a challenge. X.AI is unique as an AI lab with all their GPUs in 1 location.
DeepSeek has sourced talent exclusively from China, with no regard to previous credentials, placing a heavy focus on capability and curiosity. DeepSeek regularly runs recruitment events at top universities like PKU and Zhejiang, where many of the staff graduated from. Roles are not necessarily pre-defined and hires are given flexibility, with jobs ads even boasting of access to 10,000s GPUs with no usage
limitations. They are extremely competitive, and allegedly offer salaries of over $1.3 million dollars USD for promising candidates, well over the competing big Chinese tech companies and AI labs like
Moonshot. They have ~150 employees, but are growing rapidly.
As history shows, a small well-funded and focused startup can often push the boundaries of what’s possible. DeepSeek lacks the bureaucracy of places like Google, and since they are self funded can move quickly on ideas. However, like Google, DeepSeek (for the most part) runs their own datacenters, without relying on an external party or provider. This opens up further ground for experimentation, allowing them to make innovations across the stack.
We believe they are the single best “open weights” lab today, beating out Meta’s Llama effort, Mistral, and others.
我们的分析显示,DeepSeek的服务器总资本支出约为16亿美元,运营这些集群的相关成本高达9.44亿美元。同样,由于资源集中化是一个挑战,所有AI实验室和超大规模企业用于研究和训练的GPU数量都远超过他们承诺用于单次训练的数量。X.AI作为一个AI实验室比较特殊,其所有GPU都集中在一个位置。
DeepSeek完全从中国招募人才,不过分看重以往资历,而是高度重视能力和好奇心。DeepSeek定期在北大和浙江大学等顶尖高校举办招聘活动,许多员工都是这些学校的毕业生。职位并不一定预先定义,新员工会获得灵活性,招聘广告甚至宣传可以不受限制地使用数万个GPU。他们提供的薪资极具竞争力,据称对有潜力的候选人提供超过130万美元的年薪,远超其他中国大型科技公司和Moonshot等AI实验室。他们目前有约150名员工,但正在快速扩张。
历史表明,一个资金充足且专注的小型创业公司往往能够推动可能性的边界。DeepSeek没有谷歌那样的官僚作风,而且由于是自筹资金,可以快速推进想法。然而,和谷歌一样,DeepSeek(在大多数情况下)运营自己的数据中心,不依赖外部方或供应商。这为实验开辟了更多空间,使他们能够在整个技术栈上进行创新。
我们认为他们是当今最好的"开放权重"实验室,超越了Meta的Llama项目、Mistral和其他机构。
DeepSeek’s Cost and Performancee
DeepSeek’s price and efficiencies caused the frenzy this week, with the main headline being the “$6M” dollar figure training cost of DeepSeek V3. This is wrong. This akin to pointing to a specific part of a
bill of materials for a product and attributing it as the entire cost.
The pre-training cost is a very narrow portion of the total cost.
Training Cost
We believe the pre-training number is nowhere the actual amount spent on the model. We are confident their hardware spend is well higher than $500M over the company history. To develop new architecture
innovations, during the model development, there is a considerable spend on testing new ideas, new architecture ideas, and ablations. Multi-Head Latent Attention, a key innovation of DeepSeek, took several months to develop and cost a whole team of manhours and GPU hours.
The $6M cost in the paper is attributed to just the GPU cost of the pre-training run, which is only a portion of the total cost of the model. Excluded are important pieces of the puzzle like R&D and TCO of the hardware itself. For reference, Claude 3.5 Sonnet cost $10s of millions to train, and if that was the total cost Anthropic needed, then they would not raise billions from Google and tens of billions from Amazon. It’s because they have to experiment, come up with new architectures, gather and clean data, pay employees, and much more.
So how was DeepSeek able to have such a large cluster? The lag in export controls is the key, and will be discussed in the export section
below.
Closing the Gap – V3’s Performance
V3 is no doubt an impressive model, but it is worth highlighting impressive relative to what. Many have compared V3 to GPT-4o and highlight how V3 beats the performance of 4o. That is true but GPT-4o was released in May of 2024. AI moves quickly and May of 2024 is another lifetime ago in algorithmic improvements. Further we are not surprised to see less compute to
achieve comparable or stronger capabilities after a given amount of time. Inference cost collapsing is a hallmark of AI improvement.
DeepSeek的成本和性能DeepSeek的价格和效率本周引发了轰动,主要头条是DeepSeekV3的"600万美元"训练成本。这是错误的。这就像是指向产品物料清单的某个特定部分,并将其视为全部成本。预训练成本只是总成本中很小的一部分。
训练成本
我们认为预训练数字远低于模型的实际支出。我们确信他们在公司历史上的硬件支出远超5亿美元。在开发新架构创新时,模型开发过程中需要大量支出用于测试新想法、新架构理念和消融实验。多头潜在注意力机制(Multi-Head Latent Attention)是DeepSeek的一个关键创新,花费了数月时间开发,消耗了整个团队大量的人力时间和GPU计算时间。
论文中提到的600万美元成本仅指预训练运行的GPU成本,这只是模型总成本的一部分。其中不包括研发和硬件本身的总拥有成本(TCO)等重要组成部分。作为参考,Claude 3.5 Sonnet的训练成本达数千万美元,如果这就是Anthropic所需的全部成本,他们就不会从谷歌筹集数十亿美元,从亚马逊筹集数百亿美元。这是因为他们需要进行实验、设计新架构、收集和清理数据、支付员工工资等等。
那么DeepSeek是如何拥有如此大规模的显卡和算力?美国出口管制的滞后性是关键,这将在下面的出口管制部分讨论。
DEEPSEEK 缩小差距 – V3的性能令人印象深刻
V3无疑是一个令人印象深刻的模型,但值得强调的是相对于什么而言令人印象深刻。许多人将V3与GPT-4o进行比较,并强调V3如何超越了4o的性能。这是事实,但GPT-4o是在2024年5月发布的。AI发展迅速,2024年5月在算法改进方面已经是很久以前的事了。
此外,在给定时间后,看到使用更少的计算资源就能达到相当或更强的能力,这并不令我们惊讶。推理成本的下降是AI进步的标志。

Source: SemiAnalysis
An example is small models that can be run on laptops have comparable performance to GPT-3, which required a supercomputer to train and multiple GPUs to inference. Put differently, algorithmic improvements allow for a smaller amount of compute to train and inference models of the same capability, and this pattern plays out over and over again.
This time the world took notice because it was from a lab in China. But smaller models getting better is not new.
举个例子,能在笔记本电脑上运行的小型模型,其性能可以与需要超级计算机训练和多个GPU进行推理的GPT-3相媲美。换句话说,算法改进使得训练和推理相同能力的模型所需的计算量更少,这种模式反复出现。这次世界格外关注是因为这来自于一个中国的实验室。但小模型变得更好并不是什么新鲜事。

Source: SemiAnalysis, Artificialanalysis.ai, Anakin.ai, a16z
So far what we’ve witnessed with this pattern is that AI labs spend more in absolute dollars to get even more intelligence for their buck. Estimates put algorithmic progress at 4x per year, meaning that for every passing year, 4x less compute is needed to achieve the same capability. Dario, CEO of Anthropic argues that algorithmic advancements are even faster and can yield a 10x improvement. As far as inference pricing goes for GPT-3 quality, costs have fallen 1200x.When investigating the cost for GPT-4, we see a similar decrease in cost, although earlier in the curve. While the decreased difference in cost across time can be explained by no longer holding the capability constant like the graph above. In this case, we see algorithmic improvements and optimizations creating a 10x decrease in cost and increase in capability.
迄今为止,我们观察到的模式是,AI实验室在绝对支出上投入更多,以获得*更高*的智能投资回报。估计算法进步每年达到4倍,这意味着每过一年,实现相同能力所需的计算量就减少4倍。Anthropic的CEO Dario认为算法进步甚至更快,可以带来10倍的提升。就GPT-3级别的推理定价而言,成本已经下降了1200倍。
在研究GPT-4的成本时,我们看到类似的成本降低趋势,尽管处于曲线的较早阶段。虽然随时间推移成本差异的减少可以解释为不再像上图那样保持固定的能力。在这种情况下,我们看到算法改进和优化创造了10倍的成本降低和能力提升。

Source: SemiAnalysis, OpenAI, Together.ai
To be clear DeepSeek is unique in that they achieved this level of cost and capabilities first. They are unique in having released open weights, but prior Mistral and Llama models have done this in the past
too. DeepSeek has achieved this level of cost but by the end of the year do not be shocked if costs fall another 5x.
Is R1’s Performance Up to Par with o1?
On the other hand, R1 is able to achieve results comparable to o1, and o1 was only announced in September. How has DeepSeek been able to catch up so fast?
The answer is that reasoning is a new paradigm with faster iteration speeds and lower hanging fruit with meaningful gains for smaller amounts of compute than the previous paradigm. As outlined in our scaling laws report, the previous paradigm depended on pre-training, and that is becoming both more expensive and difficult to achieve robust gains with.The new paradigm, focused on reasoning capabilities through synthetic data generation and RL in post-training on an existing model, allows for quicker gains with a lower price. The lower barrier to entry combined with the easy optimization meant that DeepSeek was able to replicate o1 methods quicker than usual. As players figure out how to scale more in this new paradigm, we expect the time gap between matching capabilities to increasee.
Note that the R1 paper makes no mention of the compute used. This is not an accident – a significant amount of compute is needed to generate synthetic data for post-training R1. This is not to mention RL.
R1 is a very good model, we are not disputing this, and catching up to the reasoning edge this quickly is objectively impressive. The fact that DeepSeek is Chinese and caught up with less resources makes it doubly impressive.
But some of the benchmarks R1 mention are also misleading. Comparing R1 to o1 is tricky, because R1 specifically doesn’t mention benchmarks that they are not leading in. And while R1 matches in reasoning
performance, it’s not a clear winner in every metric and in many cases it is worse than o1.
需要明确的是,DeepSeek独特之处在于他们首先达到了这种成本和能力水平。他们发布开放权重的做法很独特,但此前Mistral和Llama模型也这样做过。DeepSeek已经达到了这种成本水平,但到年底成本再下降5倍也不要感到惊讶。
R1的性能是否达到了o1的水平?
另一方面,R1能够达到与o1相当的结果,而o1仅在9月份才宣布。DeepSeek是如何能够如此快速追赶上的?
答案是推理是一个新范式,与之前的范式相比,它具有更快的迭代速度,且较小的计算量就能获得显著收益的低垂果实。正如我们在扩展定律报告中所述,之前的范式依赖于预训练,这种方式正变得既昂贵又难以实现稳健的收益。
新范式通过合成数据生成和在现有模型上进行后训练的强化学习,专注于推理能力,使得以更低的价格获得更快的进步成为可能。较低的准入门槛加上容易优化意味着DeepSeek能够比往常更快地复制o1的方法。随着各方逐渐摸索如何在这个新范式中扩展,我们预计达到匹配能力所需的时间差距会增加。
值得注意的是,R1论文完全没有提到使用了多少计算资源。这不是偶然的 -
为R1后训练生成合成数据需要大量计算资源。更不用说强化学习了。R1确实是一个非常好的模型,我们并不否认这一点,如此快速地追赶上推理能力的前沿确实令人印象深刻。考虑到DeepSeek是中国公司且用更少的资源追赶上来,这就更加令人印象深刻了。
但R1提到的一些基准测试也是具有误导性的。
将R1与o1进行比较很棘手,因为R1特意不提及那些他们没有领先的基准测试。虽然R1在推理性能上相匹配,但它并不是在每个指标上都明显获胜,在许多情况下它比o1还要差。
Source: (Yet) another tale of Rise and Fall: DeepSeek R1
And we have not mentioned o3 yet. o3 has significantly higher capabilities than both R1 or o1. In fact, OpenAI recently shared o3’s results, and the benchmark scaling is vertical. “Deep learning has hit a
wall”, but of a different kind.
我们还没有提到 o3。o3 的能力明显高于 R1 或 o1。事实上,OpenAI 最近分享了 o3 的结果,基准扩展是垂直的。“深度学习遇到了瓶颈”,但类型不同。

Source: AI Action SGoogle’s Reasoning Model is as Good as R1
While there is a frenzy of hype for R1, a $2.5T US company released a reasoning model a month before for cheaper: Google’s Gemini Flash 2.0 Thinking. This model is available for use, and is considerably cheaper than R1, even with a much larger context length for the model through API.
On reported benchmarks, Flash 2.0 Thinking beats R1, though benchmarks do not tell the whole story. Google only released 3 benchmarks so it’s an incomplete picture. Still, we think Google’s modelis robust, standing up to R1 in many ways while receiving none of the hype. This could be because of Google’s lackluster go to market strategy and poor user experience, but also R1 is a Chinese surprise.
谷歌的推理模型与R1不相上下
虽然R1引发了一阵炒作狂潮,但一个市值2.5万亿美元的美国公司在一个月前发布了一个更便宜的推理模型:谷歌的Gemini Flash 2.0 Thinking。这个模型现已可供使用,而且通过API提供的模型上下文长度更大,价格却比R1便宜得多。
在公布的基准测试中,Flash 2.0 Thinking击败了R1,不过基准测试并不能说明全部问题。谷歌只发布了3个基准测试,所以这是一个不完整的画面。
尽管如此,我们认为谷歌的模型很稳健,在许多方面可以与R1抗衡,却没有受到任何炒作。这可能是因为谷歌的上市策略乏力和用户体验不佳,但也因为R1是来自中国的意外之喜。

Source: SemiAnalysis
To be clear, none of this detracts from DeepSeek’s remarkable achievements. DeepSeek’s structure as a fast moving, well-funded, smart and focused startup is why it’s beating giants like Meta in releasing a reasoning model, and that’s commendable.
Technical Achievements
DeepSeek has cracked the code and unlocked innovations that leading labs have not yet been able to achieve. We expect that any published DeepSeek improvement will be copied by Western labs almost immediately.
What are these improvements? Most of the architectural achievements specifically relate to V3, which is the base model for R1 as well. Let’s detail these innovations.
Training (Pre and Post))
DeepSeek V3 utilizes Multi-Token Prediction (MTP) at a scale not seen before, and these are adde attention modules which predict the next few tokens as opposed to a singular token. This improves model performance during training and can be discarded during inference. This is an example of an algorithmic innovation that enabled improved performance with lower compute.
There are added considerations like doing FP8 accuracy in training, but leading US labs have been doing FP8 training for some time.
DeepSeek v3 is also a mixture of experts model, which is one large model comprised of many other smaller experts that specialize in different things, an emergent behavior. One struggle MoE models have faced has been how to determine which token goes to which sub-model, or
“expert”. DeepSeek implemented a “gating network” that routed tokens to the right expert in a balanced way that did not detract from model performance.
This means that routing is very efficient, and only a few parameters are changed during training per token relative to the overall size of the model. This adds to the training efficiency and to the low cost of inference.
Despite concerns that Mixture-of-Experts (MoE) efficiency gains might reduce investment, Dario points out that the economic benefits of more capable AI models are so substantial that any cost savings are quickly reinvested into building even larger models. Rather than decreasing overall investment, MoE’s improved efficiency will accelerate AI scaling efforts. The companies
are laser focused on scaling models to more compute and making them more efficient algorithmically.
In terms of R1, it benefited immensely from having a robust base model (v3). This is partially because of the Reinforcement Learning (RL).
There were two focuses in RL: formatting (to ensure it provides a coherent output) and helpfulness and harmlessness (to ensure the model is useful). Reasoning capabilities emerged during the fine-tuning of the model on a synthetic dataset. This, as mentioned in our scaling laws article, is what happened with o1. Note that in the R1 paper no compute is mentioned, and this is because mentioning how much compute was used would show that they have more GPUs than their narrative suggests. RL at this scale requires a considerable amount of compute, especially to generate synthetic data.
Additionally a portion of the data DeepSeek used seems to be data from OpenAI’s models, and we believe that will have ramifications on policy on distilling from outputs. This is already illegal in the terms of service, but going forward a new trend might be a form of KYC (Know Your Customer) to stop distillation.
And speaking of distillation, perhaps the most interesting part of the R1 paper was being able to turn non-reasoning smaller models into reasoning ones via fine tuning them with outputs from a reasoning model.
The dataset curation contained a total of 800k samples, and now anyone can use R1’s CoT outputs to make a dataset of their own and make reasoning models with the help of those outputs. We might see more
smaller models showcase reasoning capabilities, bolstering performance of small models.
需要明确的是,这些都不会削弱DeepSeek的卓越成就。DeepSeek作为一个快速发展、资金充足、智慧专注的创业公司,正是因为这种结构使其能够在发布推理模型方面击败*Meta*这样的巨头,这值得称赞。
技术成就
DeepSeek已经破解了密码,实现了领先实验室尚未能够实现的创新。我们预计,任何已发布的DeepSeek改进都会被西方实验室几乎立即复制。
这些改进是什么?大多数架构成就特别涉及V3,它也是R1的基础模型。让我们详细说明这些创新。
训练(预训练和后训练)
DeepSeek V3使用了前所未见规模的多令牌预测(MTP),这些是新增的注意力模块,可以预测接下来的几个令牌而不是单个令牌。这提高了训练期间的模型性能,并且可以在推理时丢弃。这是一个通过算法创新实现更低计算量更高性能的例子。
还有一些其他考虑因素,比如在训练中使用FP8精度,但领先的美国实验室已经进行FP8训练一段时间了。
DeepSeek v3也是一个专家混合模型,它是由许多专门处理不同事物的较小专家组成的一个大模型,这是一种涌现行为。
MoE模型面临的一个困难是如何确定哪个令牌去往哪个子模型或"专家"。DeepSeek实现了一个"门控网络",以平衡的方式将令牌路由到正确的专家,而不会影响模型性能。这意味着路由非常高效,相对于模型的整体大小,每个令牌在训练期间只需要改变很少的参数。这提高了训练效率和降低了推理成本。
尽管有人担心专家混合(MoE)效率提升可能会减少投资,但Dario指出,更强大的AI模型带来的经济效益如此巨大,以至于任何成本节省都会很快被重新投资于构建更大的模型。MoE的改进效率不会减少总体投资,反而会加速AI扩展努力。这些公司专注于扩展模型以获得更多计算能力,并在算法上使其更加高效。
就R1而言,它极大地受益于拥有一个强大的基础模型(v3)。这部分是因为强化学习(RL)。RL有两个重点:格式化(确保提供连贯的输出)和有用性与无害性(确保模型有用)。推理能力是在对合成数据集进行微调过程中出现的。这,正如我们在扩展定律文章中提到的,就是o1发生的情况。注意,在R1论文中没有提到计算量,这是因为提到使用了多少计算量会显示他们拥有的GPU比他们叙述中暗示的要多。这种规模的RL需要相当大的计算量,特别是在生成合成数据时。
此外,DeepSeek使用的部分数据似乎来自OpenAI的模型,我们认为这将对从输出中提取的政策产生影响。这在服务条款中已经是非法的,但未来可能会出现一种新趋势,即某种形式的KYC(了解你的客户)来阻止提取。
说到提取,R1论文中最有趣的部分可能是能够通过用推理模型的输出对较小的非推理模型进行微调,将其转变为推理模型。数据集包含总计80万个样本,现在任何人都可以使用R1的CoT输出来创建自己的数据集,并在这些输出的帮助下制作推理模型。我们可能会看到更多较小的模型展示推理能力,增强小型模型的性能。
Multi-head Latent Attention (MLA))
MLA is a key innovation responsible for a significant reduction in the inference price for DeepSeek. The reason is MLA reduces the amount of KV Cache required per query by about 93.3% versus standard attention. KV Cache is a memory mechanism in transformer models that stores data representing the context of the conversation, reducing unnecessary computation.
As discussed in our scaling laws article, KV Cache grows as the context of a conversation grows, and creates considerable memory constraints. Drastically decreasing the amount of KV Cache required per
query decreases the amount of hardware needed per query, which decreases the cost. However we think DeepSeek is providing inference at cost to gain market share, and not actually making any money. Google Gemini Flash 2 Thinking remains cheaper, and Google is unlikely to be offering that at cost. MLA specifically caught the eyes of many leading US labs.
MLA was released in DeepSeek V2, released in May 2024. DeepSeek has also enjoyed more efficiencies for inference workloads with the H20, due to higher memory bandwidth and capacity compared to the H100. They have also announced partnerships with Huawei but very little has been done with Ascend compute so far.
We believe the most interesting implications is specifically on margins, and what that means for the entire ecosystem. Below we have a view of the future pricing structure of the entire AI industry, and we detail why we think DeepSeek is subsidizing price, as well as why we see early signs that Jevons paradox is carrying the day. We comment on the implications on export controls, how the CCP might react with added DeepSeek domninance, and more.
多头潜在注意力(MLA)
MLA是一项关键创新,它使DeepSeek的推理价格显著降低。这是因为与标准注意力机制相比,MLA将每次查询所需的KV缓存量减少了约93.3%。KV缓存是transformer模型中的一种内存机制,用于存储表示对话上下文的数据,减少不必要的计算。
正如我们在扩展定律文章中讨论的,KV缓存随着对话上下文的增长而增长,并造成相当大的内存限制。大幅减少每次查询所需的KV缓存量,降低了每次查询所需的硬件量,从而降低了成本。然而,我们认为DeepSeek提供的推理价格是以成本价来获取市场份额,这个价格实际上并没有赚钱能力。谷歌的Gemini Flash 2 Thinking仍然更便宜,而谷歌不太可能以成本价提供服务。MLA特别引起了许多美国领先实验室的关注。MLA在2024年5月发布的DeepSeek
V2中首次亮相。由于H20相比H100具有更高的内存带宽和容量,DeepSeek在推理工作负载方面获得了更多的效率提升。他们还宣布与华为建立合作伙伴关系,但到目前为止在昇腾计算方面的进展很少。
我们认为最有趣的影响特别是在利润率方面,以及这对整个生态系统意味着什么。下面我们对整个AI行业未来的价格结构进行了展望,并详细说明了为什么我们认为DeepSeek在补贴价格,以及为什么我们看到杰文斯悖论正在发挥作用的早期迹象。我们还评论了对出口管制的影响,中方可能如何应对DeepSeek的主导地位增强等问题。