Featured image of post More about DeepSeek

More about DeepSeek

Founder and CEO of ShiXiang Tech, Li Guangmi, organized a closed-door discussion on DeepSeek.

From https://news.qq.com/rain/a/20250127A07SH000

The event brought together dozens of leading AI researchers, investors, and front-line AI professionals to explore and learn about the technical details, organizational culture, and the short-, medium-, and long-term impact of DeepSeek’s rise to prominence.

The discussion aimed to uncover a glimpse of this “mysterious force from the East” based on limited available information.

It is worth noticing that this discussion was an informal technical exchange and does not represent the views or positions of any individual or institution.

As Silicon Valley venture capitalist Marc Andreessen described DeepSeek-R1: “As open source, a profound gift to the world.” In line with this open-source spirit, the participants of the discussion have shared their collective thoughts publicly.

Below is a summary of the key points discussed:

01. The Mystery of DeepSeek: “The most important thing is to push intelligence forward.”

  1. The core leader: Founder and CEO Liang Wenfeng is the driving force behind DeepSeek. Unlike Sam Altman, he has deep technical expertise.

  2. Why DeepSeek has a good reputation: It was the first to release replications of MoE and o1 models. While they were early movers, there’s still room for improvement. Limited resources mean they focus only on the brightest opportunities. The team’s research capabilities and culture are solid, and with 100,000–200,000 GPUs, they could achieve even more.

  3. Long context capability: From preview to official release, DeepSeek’s long-context processing ability improved significantly. Achieving a 10k context length relied on straightforward methods.

  4. GPU resources: Despite claims by Scale.ai’s CEO that DeepSeek has 50,000 GPUs, public information suggests they likely have 10,000 older A100 GPUs and 3,000 pre-ban H800 GPUs. DeepSeek prioritizes compliance and hasn’t procured any non-compliant GPUs, limiting their resources. Compared to the U.S., they use GPUs much more efficiently.

  5. Strategic focus: DeepSeek focuses narrowly on specific challenges, sacrificing efforts on areas like multimodality. Their aim isn’t merely serving humans but advancing intelligence itself—this may be a key to their success.

  6. Commercialization and mission: In some sense, DeepSeek’s business model is tied to its emphasis on efficiency. Liang Wenfeng’s other venture, Huanfang (a quantitative trading firm), was a product of the last wave of machine learning. For DeepSeek, the priority is advancing intelligence, not money or commercialization. China needs a few leading AI labs to explore breakthroughs that can rival OpenAI. The journey toward AGI is long, and new innovations will emerge as the field evolves.

  7. Talent cultivation: From a technical perspective, DeepSeek acts as a “Huangpu Military Academy (means cradle of talent)” for talent development, spreading expertise across the industry.

  8. AI labs and business models: AI lab business models in the U.S. also face challenges. The current state of AI lacks a clear and successful business model. Liang Wenfeng is ambitious and focused solely on advancing AGI, regardless of its form.

  9. Efficient scaling techniques: Reading DeepSeek’s papers reveals that much of their innovation focuses on reducing hardware costs. Their scaling techniques significantly cut expenses in key areas.

  10. Short-term efficiency, long-term challenges: While DeepSeek’s work won’t impact overall computational demands in the long term, it does drive short-term efforts to make AI more efficient. The demand for computing power remains strong, and most organizations still face shortages.

  11. Insights on DeepSeek’s Organization:

  • Team composition: In investment, the focus is often on selecting top-tier talent. However, looking at DeepSeek’s model—composed mostly of smart young graduates from domestic universities—it seems that with time and teamwork, their abilities can evolve to a high level. Whether poaching a single individual would disrupt their synergy remains a question, but so far, such moves haven’t significantly impacted DeepSeek.

  • Cultural focus: While there’s plenty of money in the market, DeepSeek’s strength lies in its organizational culture. Their research culture is quite similar to ByteDance’s—fundamental and deeply rooted. A strong culture’s sustainability depends on having sufficient funding and long-term viability, often tied to robust business models. Both DeepSeek and ByteDance have excellent business models to support their cultures.

  1. Why Is DeepSeek Catching Up So Quickly?
  • Focus on reasoning models: The demand for reasoning models lies in higher-quality data and training. If the goal were long-text or multimodal tasks, catching up from scratch with a closed-source model would be more challenging. However, reasoning models themselves don’t require major architectural changes, making this a more achievable direction for catching up.

  • R1’s rapid progress: The reason R1 advanced quickly might be because the tasks weren’t particularly difficult. Reinforcement learning (RL) primarily made the model more precise in its selections. R1 didn’t break Consensus 32’s efficiency but instead operated at 32 times the cost, shifting exploration from parallel to serial. This didn’t extend the boundaries of intelligence but made it easier to achieve certain results.

02. Explorers vs. Followers: “AI Progress Resembles a Step Function” “For followers, compute demand is 10 times lower.”

  1. Step function progress: AI development resembles a step function. For followers, compute costs are much lower (about 10 times) compared to leading-edge researchers. While followers don’t need as much compute, explorers must train numerous models. The search for new algorithms and architectures never stops. Behind every step function is significant investment, which will continue as compute resources and product development advance. Beyond reasoning, many other directions also demand high compute. While exploratory projects may use vast amounts of GPUs behind the scenes, without such investments, the next step forward might not occur. Some remain dissatisfied with existing architectures and RL methods, pushing progress further.

  2. Compute thresholds: In exploratory directions, using 10,000 GPUs doesn’t always outperform 1,000 GPUs. However, there is a threshold—if only 100 GPUs are available, achieving results becomes unlikely due to long iteration times.

  3. Different priorities: Advancing fields like physics often relies on a dual approach—academic researchers exploring multiple directions without a focus on immediate returns, and industrial labs prioritizing efficiency improvements.

  4. Explorers vs. followers: Smaller companies with limited compute need to prioritize efficiency, while large companies aim for faster model development. Many methods that improve efficiency on 2,000 GPU clusters don’t scale well to 10,000 GPUs, where stability becomes more important.

  5. CUDA vs. latecomers: CUDA’s ecosystem advantages lie in the variety and completeness of its operators. Chinese companies like Huawei have focused on optimizing commonly used operators, leveraging latecomer advantages. However, if a company has 100,000 GPUs, the cost of being a leader is high, while followers achieve better efficiency. The key decision lies in choosing between these two paths. Domestically, the next area for catching up could be multimodal AI, especially since overseas projects like GPT-5 have yet to be released.

03. Technical Details: SFT “No need for SFT in reasoning tasks anymore.”

  1. The most shocking revelation from DeepSeek is not its open-source nature or low cost, but the fact that SFT (Supervised Fine-Tuning) is no longer necessary—at least for reasoning tasks. However, for tasks beyond reasoning, SFT may still be needed. This raises an interesting discussion: Has DeepSeek introduced a new paradigm or architecture that significantly increases data utilization efficiency in training models? Could this lead to faster iterations in model performance?

  2. DeepSeek-R1 demonstrates that using SFT for distillation can be highly beneficial. While R1 didn’t completely abandon SFT, it used it solely in the third step, followed by RLHF (Reinforcement Learning with Human Feedback) in the final alignment phase.

  3. R1 is essentially an SFT-trained model. What stands out is that the data was generated by a model trained with RLHF, indicating that overly complex methods aren’t necessary. As long as the method is robust, distillation with SFT can achieve great results.

  4. GRPO’s core idea is that the base model must be sufficiently intelligent. Generating a single prompt required 16 generations, with multiple attempts to achieve a high probability of correctness. A good base model combined with a verification mechanism is the approach R1 suggests. This method works particularly well for tasks like math and coding, as they are easier to verify, but theoretically, similar processes could be applied to other domains, ultimately leading to a general RL model.

  5. R1-Zero revealed CoT (Chain of Thought) reasoning without relying on SFT. The CoT process becomes increasingly lengthy, and this emergent phenomenon is significant. SFT serves as an auxiliary mechanism, accelerating generation but not strictly necessary for producing CoT.

  6. This suggests that many smaller model developers can use SFT to distill larger models effectively, achieving impressive results. However, SFT wasn’t entirely discarded during the R1 process.

  7. A sufficiently large LLM generating infinite-length CoT can theoretically function as a Turing machine, capable of solving highly complex computational problems. CoT is essentially an intermediate result in the model’s search for optimized outputs. Through repeated sampling and refinement, the model can generate correct outputs and steer toward more reliable predictions. Fundamentally, CoT represents the computation required to achieve such outcomes—it’s an emergent property of the model functioning like a computational engine.

  8. Although DeepSeek’s paper didn’t explicitly discuss long context windows, it seems R1-preview and R1 showed substantial improvements in this area. It’s speculated that enhancements like Long2Short CoT were implemented. Additionally, CoT used in the third phase of SFT may have been refined or removed during final generation. The released version likely leveraged cleaner CoT data for SFT.

  9. SFT data falls into several categories:

  • Cold start data: This initializes the model with a strong strategy or foundation, allowing it to explore more effectively. In RL, one optimization goal is to align with the original strategy.
  • Generated data post-RL: After generating data through RL, it’s combined with other datasets and used for SFT on the base model. Each domain has its own data-processing pipeline, and the model’s capabilities stem from the base model. Distillation here is lossless, and combining data across domains can enhance generalization.
  1. It’s unclear how efficient R1’s data usage is. It’s speculated that OpenAI employs similar strategies to improve data efficiency, such as fine-tuning. In R1’s third phase, instead of using the RL-trained model as the base, data was generated and then used for SFT to create R1. The dataset includes 600K reasoning data and 200K non-reasoning data. The second-phase model may exhibit problem-solving abilities even in domains beyond its training examples, leading to the generation of reasoning data. The non-reasoning data, part of V3’s SFT dataset, likely helped the model infer CoT reasoning. With only 800K data points, this process seems remarkably efficient.

04 Technical Details: Data “DeepSeek places great emphasis on data labeling.”

  1. Scale.AI may not fail, as RL across various domains—especially math and coding—still requires expert annotation. While labeling is becoming more complex, the market demand persists.

  2. Multimodal data shows little impact in training, likely due to high costs. There’s currently no evidence of its utility, though it may have potential in the future.

  3. DeepSeek heavily focuses on data labeling, with reports that even founder Wenfeng Liang participates in labeling. In AI, precision in data is as crucial as algorithms and techniques. For instance, Tesla’s annotation costs are 20 times higher than those of Chinese autonomous driving companies. Tesla emphasizes experienced and skilled individuals for annotations, achieving smoother results, which DeepSeek mirrors. This commitment to high-quality labeling underpins its model efficiency.

05 Technical Details: Distillation “The downside of distillation is reduced model diversity.”

  1. Relying on distillation to sidestep understanding technical pain points in model training may lead to issues when the next-generation technology emerges.

  2. Distilling large models into smaller ones (teacher to student) improves performance but may reduce capabilities when the teacher model lacks domain expertise. However, distilling smaller models often results in clear performance gains. For instance, R1’s distillation followed by RL showed significant improvement, even with mismatched data.

  3. The drawback of distillation is the loss of diversity, which limits a model’s potential. While it’s a viable short-term approach, it cannot surpass the strongest models.

  4. Distillation introduces risks, such as models relying on pre-learned answers rather than genuine reasoning. Early RL often involves subtle “hacks,” where models appear to think but merely retrieve memorized data. If distillation occurs without proper labeling, subsequent RL with verifiable rewards (RLVR) may favor shortcuts over genuine problem-solving.

  5. Shortcut approaches that replicate existing solutions may face hidden pitfalls. For example, without transformative advances in long-context handling, problem-solving potential may remain capped. Starting fresh with approaches like R1-zero might avoid these limitations.

  6. Distillation can yield good results for other models, establishing a teacher-student ecosystem in model development. Being an effective “student” could itself become a viable business model.

  7. R1’s impact on technical innovation is less groundbreaking than AlphaGo, but its commercial reach far surpasses it.

  8. Over-reliance on distillation without establishing unique systems or verifiable rewards risks dependency and stagnation. General-purpose domains are challenging to distill due to reward validation issues and CoT artifacts. OpenAI-distilled models may retain training artifacts, while R1-zero’s RL success highlights the importance of a refined base model capable of reflection.

  9. It’s unlikely that a model trained purely on unrefined internet data could exhibit advanced behavior, as high-quality data on the internet is rare.

  10. Only a few top labs are exploring optimal annealing stages and data ratios. Distillation is merely one aspect of RL algorithms; SFT, as behavioral imitation, has limitations in diversity and potential when used alone.

  11. DeepSeek excites startups in the AI field. Its potential for continued iteration and small, distilled versions usable on mobile devices could greatly enhance AI flexibility and raise application ceilings.

  12. Distillation alone cannot surpass OpenAI, as OpenAI does not rely on data distillation. A different approach is necessary to outperform it.

  13. Future models may need to learn “jump-step” reasoning, improving performance limits within fixed context lengths.

06 Technical Details: Process Reward “The ceiling of process supervision is human; result supervision defines the model’s ceiling.”

  1. Process rewards aren’t inherently flawed but can be prone to reward hacking—where the model inflates rewards without truly learning. For example, in solving math problems, a model might generate 1,000 outputs without any approaching the correct solution. In such cases, methods like RLVR cannot effectively train the model. However, a reasonably reliable process reward can guide the model in the right direction, especially for complex problems where process rewards can still be helpful. Factors like the problem’s difficulty and the reliability of process rewards determine their effectiveness.

  2. In Process Reward Modeling (PRM), process scores can be hacked if they deviate from reality. While process supervision is theoretically viable, its effectiveness depends on the reliability of process metrics and how rewards are assigned. Current methods for result supervision often match extracted answers but lack robust, hack-proof scoring mechanisms. Models iterating on their own are particularly vulnerable to hacking. Annotating processes isn’t inherently difficult—enumerating them is possible—but remains largely unexplored. This could be a promising direction.

  3. The ceiling of process supervision is human reasoning, which is often limited. In contrast, result supervision defines the true potential of a model.

  4. AlphaZero’s success stems from its ability to evaluate win/loss states at the end of a game, with rewards calculable based on win rates. In contrast, LLMs lack a clear mechanism to determine if continuous generation will yield a correct answer. This is akin to genetic algorithms, where the potential is higher, but reward hacking remains a risk.

  5. A key advantage of AlphaGo to AlphaZero was Go’s fixed rules. Models starting with math and coding succeed because these domains are easier to verify. However, the quality of verification methods significantly affects the RL outcome. Without well-defined rules, models may satisfy the rules while producing undesired outputs, exploiting loopholes in the process.

07 Why Haven’t Other Companies Adopted DeepSeek’s Approach? “Big companies must remain cautious with their models.”

  1. OpenAI and Anthropic haven’t pursued DeepSeek’s approach due to differences in strategic focus. They may believe that allocating compute resources to other areas offers more value.

  2. Unlike larger companies, DeepSeek concentrates solely on language rather than multimodal capabilities, allowing for focused advancements. While big companies possess equally strong models, they tend to maintain a lower profile, avoiding over-publicizing developments. Currently, multimodality is less critical, as intelligence primarily stems from language, which plays a greater role in enhancing overall intelligence.

08: Divergence and Bets in 2025 Technology “Can we find alternatives to Transformers?”

  1. Models will diverge in 2025, with new paths to push the boundaries of intelligence, such as synthetic data and alternative architectures.

  2. In 2025, attention will focus on new architectures beyond Transformers. These could reduce costs while exploring the limits of intelligence. Additionally, RL’s full potential remains untapped, and there’s rising interest in agents, though large-scale applications are still lacking.

  3. Multimodal systems might emerge to challenge the dominance of ChatGPT-like models in 2025.

  4. The low-cost, high-efficiency results from R1 and V3 demonstrate a viable path forward. This approach does not conflict with scaling hardware and model parameters, but domestic constraints in China favor the former.

  5. DeepSeek aligns with scaling laws but also explores smaller, distilled models. This benefits the development of increasingly large, closed models. The absence of “anti-scaling” metrics in AI today suggests scaling remains dominant. Open-source advancements can replicate closed models, potentially reducing costs while benefiting closed systems as well.

  6. Meta is reportedly working to replicate DeepSeek but hasn’t yet impacted its infrastructure or long-term roadmap. Cost efficiency and boundary exploration remain critical to unlocking new possibilities.

09: Developer Migration to DeepSeek? “Not yet.”

  1. Developers haven’t migrated en masse to DeepSeek, as leading closed models still excel in coding and instruction adherence. However, this advantage may diminish over time.

  2. Models like Claude-3.5-Sonnet, which specialize in tool use, are well-suited for agent tasks. DeepSeek doesn’t yet offer this capability but presents significant potential.

  3. For users, DeepSeek V2 already meets most needs. While R1 improves speed, it hasn’t added substantial value, and deeper reasoning capabilities sometimes lead to errors in previously correct answers.

  4. Application users often simplify problems through engineering approaches. 2025 may be an “application year,” where industries leverage existing capabilities, possibly hitting a bottleneck since many use cases don’t require highly intelligent models.

  5. RL currently solves problems with clear answers but hasn’t surpassed AlphaZero’s achievements. Distillation is effective for tasks with clear solutions, which is why RL and distillation methods have advanced so rapidly.

  6. Humanity’s demand for intelligence is vastly underestimated. Challenges like cancer and advanced materials for SpaceX remain unresolved, suggesting tremendous growth potential. The pursuit of intelligence must continue.

10: OpenAI Stargate 500B and Compute Demands “DeepSeek raises questions about OpenAI’s 500B narrative.”

  1. DeepSeek has prompted skepticism about NVIDIA and OpenAI’s 500B plans, which seem like a “lifeline” for OpenAI.

  2. There’s uncertainty around OpenAI’s 500B infrastructure investments, as commercial companies taking on debt could face risks.

  3. The 500B goal is ambitious and may take 4-5 years to realize. SoftBank (funding) and OpenAI (technology) are key players, but both face financial limitations, making full execution challenging.

  4. While the 500B compute plan has merits, early exploration stages incur high costs due to trial-and-error. Later-stage players benefit from following successful outcomes without bearing the exploration costs.

  5. Anthropic might transition all inference operations to TPU or AWS chips in the future.

  6. Chinese companies, once limited by compute resources, now prove that efficient models can yield significant results. Custom chips on platforms like AMD or ASIC could further enhance cost efficiency, offering investment opportunities despite NVIDIA’s dominance.

  7. DeepSeek’s success doesn’t directly threaten NVIDIA but highlights China’s efficiency and potential. NVIDIA’s strength lies in its ecosystem, which takes time to build. The real challenge for NVIDIA will arise if AI technology matures into a standardized industry with specialized ASIC chips optimized for specific tasks.

11: Impact on Secondary Markets “Short-term pressure, long-term potential.”

  1. DeepSeek has shaken the U.S. AI community, creating short-term impacts on stock prices. Slowing pretraining demand and lagging post-training and inference scaling have created gaps in corporate narratives, affecting trading sentiment.

  2. DeepSeek’s focus on FP8 contrasts with the U.S.’s FP16. Its efficiency in resource utilization was widely discussed in North America, influencing capital expenditure expectations for Meta but causing declines in NVIDIA and TSMC stock, while Broadcom saw gains.

  3. DeepSeek creates short-term pressure on stocks, particularly for compute-related companies, but the long-term AI narrative remains strong.

  4. Concerns over NVIDIA’s transition from H cards to B cards, combined with DeepSeek’s impact, could weigh on stock prices in the short term but also present long-term opportunities.

  5. DeepSeek’s low-cost training underscores short-term sentiments, such as NVIDIA’s stock reaction. However, AI remains a growing market, with significant long-term potential if CUDA remains the preferred ecosystem.

12: Open Source vs. Closed Source “If capabilities converge, closed models face challenges.”

  1. DeepSeek highlights the debate between open and closed AI development.

  2. Leading companies might hide their best models, but DeepSeek’s release may force others to follow suit.

  3. DeepSeek’s cost optimizations haven’t yet impacted major companies like Amazon, which continue with existing plans. Open- and closed-source models coexist, with universities and smaller labs favoring DeepSeek due to cost.

  4. Open-source models can pressure the margins of closed models. If open-source capabilities approach 95% of closed models, the latter’s high costs may no longer be justifiable.

13: DeepSeek’s Broader Impact “Vision matters more than technology.”

  1. DeepSeek showcases China’s AI strength, narrowing a perceived two-year gap with the U.S. to just 3-9 months, with superiority in some areas.

  2. Historically, China has excelled when overcoming U.S. technological restrictions. DeepSeek’s success continues this trend.

  3. DeepSeek didn’t arise overnight. The R1 results resonated with core U.S. decision-makers.

  4. While DeepSeek leverages existing advances, frontier exploration still requires significant time and cost. R1 doesn’t guarantee reduced training costs for future models.

  5. As a latecomer, China can capitalize on engineering efficiencies. Its ability to achieve results with fewer resources could reshape the U.S.-China AI dynamic.

  6. Current reasoning methods stem from OpenAI’s o1, but the next breakthroughs in reasoning will define future AI lab competitiveness.

  7. The core difference between AI labs lies in their visions for the future, not just their technologies.

  8. Ultimately, vision outweighs technology.