Shuchao Bi – "Advancing the Frontier of Silicon Intelligence: the Past, Open Problems, and the Future" [2025]

Notes for Shuchao's talk at Columbia University (6/21/2025):

Yong Zheng-Xin (Yong)

Jul 01, 2025

Youtube link: Advancing the Frontier of Silicon Intelligence: the Past, Open Problems, and the Future

Personal thoughts: It seems like the conclusion drawn from the talk aligns with my personal take on several things:

can RL learn new stuff? yes. (speculation at this point, but I’m on the optimistic side of things)
next scaling might be self-play: we need to show that models can take rule book (human instructions) and generate training data best for them to learn from.
can models reason well now? yes, for verifiable tasks because those tasks represent environment where we can give models scalable and meaningful signals to learn from easily.

New things I’ve learned are definitely:

how to interpret Sutton’s ‘era of experience’: turning compute to data.
seems like the key idea for next paradigm breakthrough is how to learn on abstract level:
- humans don’t learn from NTP. we learn from abstract levels.
- how to assign FLOPS per intelligence instead of FLOPS per token.

Part 1: Past

Self-supervised learning

A lot of things I’ve learned from graduate school about statistics, the intuition is wrong. The reason is because intuition learned before from low-dimension space does not generalize to high-dimensional space. –– Shuchao

Previous FUDs (Fear, uncertainty, and doubt) about deep learning:

Non-convex optimization will trap neural networks in bad local minima. → wrong because there isn’t a lot of bad local optima, and at high dimensional spaces, bad local optima are rare. Even then, many local minimal are flat instead of needle points. Aka it is really easy to escape from local minima because it has so many degrees of freedom.
Overparameterization leads to inevitable overfitting → deep learning models learn the patterns first before they learn the noise, because when SGD learns, it will learn the feature space with the largest eigenvalues.

Scaling laws: why scaling works is probably because large-scale data reflects the structure of the dataset. It can learn from internal data complexity. And you need more compute to detect things more rare in the data.

The analogy that Shuchao gave was that average doctors can solve common diseases but highly intelligent one solves rare diseases (🤔 not true because I think highly intelligent ones can solve common diseases like cancer, but they are more capable of drawing insightful connections between seemingly disconnected items)

My own thought: My explanation is that learning from scaled data, assuming that model can learn the pattern first before the noise, then the model will learn to capture the manifold that generates that data. And you need more capacity to figure out the patterns behind rare occurences.
Things to chew on though: perplexity is smooth, but because of power law in data, emergent capability came. “when you have 10x compute, you suddenly can understand calculus and do calculus.”

Compression via Prediction → Understanding and Intelligence: Compression must be understanding of how the world works if the compressed models can predict observations/next word tokens. E.g., Physics have a lot of compressed rules to explain observations. Information theory says that intelligence = reduce surprise by modeling patterns in data.

Intelligence: arise from data and interactions rather than from brain.

Deep reinforcement learning

Takeaways: because RL back then creates highly specialized AI systems.

Pretrain + Low-compute RL: Train models to follow instructions from human feedback (InstructGPT + ChatGPT) → generalizability is mainly coming from the prior of pretrained LLMs (next-token prediction with minimal inductive biases).

Pretrain + high-compute RL: apply more search (i.e., exploring and backtracking).

current approaches only work well with verifiable reward.
era of experience: convert compute into experience/data.

Same FLOPs per intelligence: chain-of-thoughts and test-time scaling where we spend more compute on harder problems.

Part 2: Open Problems

Problem 1: Scaling law never failed, but data did.

“transformer-based next-token prediction perplexity scales down log-linearly with compute and parameter count.” (?)
task-distribution should be well-covered in pretraining data, and that’s not the case right now. Also, not all tokens in the training data deserve the same amount of FLOPs. → ”Direction: equalize intelligence per token when we train models”
Search in RL: we cannot convert compute to data yet because:
- limited to domains where there are verifiable rewards.
- cannot effectively generate rollouts outside the support of the pre-trained policy.
- unlike MCTS in AlphaGo, random exploration is unlikely to generate any incremental intelligence.

Problem 1: how to generate more, better data faster

Human data: task → learn knowledge → new ideas → get feedback → distilled into findings.
Task: learn to propose task instead of solving it.
Learn: weights update or ICL.
New ideas (exploration): generate new ideas not from the support of pretrained policy. Or is interpolation enough?1
- Encouraging examples: AlphaEvolve,
Interacting with environment (feedback): you almost need perfect simulator / world models. But once enough data are collected to build such simulator, this positive flywheel can lead to superintelligence. “How to enhance LLM RL exploration to enable endlessly self-improving models?”
Distillation: models can do that very efficiently already.

Problem 2: how to improve data efficiency through new objectives

humans do not learn from predict word tokens – we learn through higher-level abstraction.
human-defined rewards in RL are less general and more prone to “reward hacking”. How to make RL reasoning even more generalizable aka robust zero-shot performance for OOD tasks? E.g., LLM given Chess rules should be able to practice and reach AlphaZero-like performance.

Problem 3: how to scale up more

# layers → # experts → test-time compute → RL → ???
- scaling tools
- scaling self-play (Absolute Zero paper)
- scaling context and memory
- scaling lifelong learning

Problem 4: Safety

generate unsafe content for users. (traditional trust and safety work: sycophancy, promoting violence)
bad actors leveraging models for harm. (jailbreaking)
misalignment: model itself becomes “bad”. A misaligned models could damage human society, intentionally or unintentionally, while pursuing a goal.

Part 3: Future (superintelligence)

Hypothesis: generalizable human prior + unbounded compute in RL + environment → superintelligence.

agents: numerous agents operating in real environments + tool use + generating economic value.
- how it arises? reasoning → generalization → agentic capabilities.
- what to focus on: core capabilities, evaluation, environments, enabling arbitrary long-horizon tasks with near-infinite memory/context, multimodality
- “this is just an execution problem. we already know how”
AI for science: this is a search problem essentially, and AI can reduce scientific search space a lot. And yeap, I also shared the future vision where AI will replace scientists and run the experiments themselves.
AI for Enterprise R&D:
AI for Education: lower the barrier (personal tutoring) + increasing the ceiling (articulate things much more efficiently than most humans)
AI for medical: memory – holistic context of patients’ history.
Embodied AI: we don’t have a lot of data.

He cited the paper “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” and disagreed with it.

antifragile systems

Discussion about this post