Emergence, compositional generalization
Deep Learning presents a challenge to classical statistical learning theory. Neural networks often achieve zero training error, yet they generalize well to unseen data. This contradicts traditional expectations and makes many classical generalization bounds ineffective.
Sparse activation and the Superposition Hypothesis have been proposed as possible explanations for the Grokking phenomenon, where models learn to activate sparsely and generalize well after initially overfitting when trained on very large datasets.
Observation
- computing cost is decreasing exponentially
- low level (transformer) for high incentive structure (intelligence)
- unlike human, machine has different time budget
Intuition
- some abilities emerge with scale
- Emergent ability this idea doesn't work yet
- for such scalability and generalization, time and computing required
Approach
- make learn them how we think
- matmul + length + dimension
- For superintelligence, we don't necessarily need to follow human methods (loss 0)
- learning objective and reasoning from induced incentive
- larning something general for milons of real world task

Emergent Ability does not follows Scaling Law and does not occur on small model, which is aligned with the view of Induction head generation.
Emergent Abilities of Large Language Models
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon...
https://arxiv.org/abs/2206.07682

OpenAI perspective
Large Language Models (in 2023)
I gave a talk at Seoul National University.
I titled the talk “Large Language Models (in 2023)”. This was an ambitious attempt to summarize our exploding field.
Trying to summarize the field forced me to think about what really matters in the field. While scaling undeniably stands out, its far-reaching implications are more nuanced. I share my thoughts on scaling from three angles:
1:02 1) Change in perspective is necessary because some abilities only emerge at a certain scale. Even if some abilities don’t work with the current generation LLMs, we should not claim that it doesn’t work. Rather, we should think it doesn’t work yet. Once larger models are available many conclusions change.
This also means that some conclusions from the past are invalidated and we need to constantly unlearn intuitions built on top of such ideas.
7:12 2) From first-principles, scaling up the Transformer amounts to efficiently doing matrix multiplications with many, many machines. I see many researchers in the field of LLM who are not familiar with how scaling is actually done. This section is targeted for technical audiences who want to understand what it means to train large models.
27:52 3) I talk about what we should think about for further scaling (think 10000x GPT-4 scale). To me scaling isn’t just doing the same thing with more machines. It entails finding the inductive bias that is the bottleneck in further scaling.
I believe that the maximum likelihood objective function is the bottleneck in achieving the scale of 10000x GPT-4 level. Learning the objective function with an expressive neural net is the next paradigm that is a lot more scalable. With the compute cost going down exponentially, scalable methods eventually win. Don’t compete with that.
In all of these sections, I strive to describe everything from first-principles. In an extremely fast moving field like LLM, no one can keep up. I believe that understanding the core ideas by deriving from first-principles is the only scalable approach.
Disclaimer: I give my personal opinions and the talk material doesn't reflect my employer's opinion in any way.
https://www.youtube.com/watch?v=dbo3kNKPaUA

[1hr Talk] Intro to Large Language Models
This is a 1 hour general-audience introduction to Large Language Models: the core technical component behind systems like ChatGPT, Claude, and Bard. What they are, where they are headed, comparisons and analogies to present-day operating systems, and some of the security-related challenges of this new computing paradigm.
As of November 2023 (this field moves fast!).
Context: This video is based on the slides of a talk I gave recently at the AI Security Summit. The talk was not recorded but a lot of people came to me after and told me they liked it. Seeing as I had already put in one long weekend of work to make the slides, I decided to just tune them a bit, record this round 2 of the talk and upload it here on YouTube. Pardon the random background, that's my hotel room during the thanksgiving break.
- Slides as PDF: https://drive.google.com/file/d/1pxx_ZI7O-Nwl7ZLNk5hI3WzAsTLwvNU7/view?usp=share_link (42MB)
- Slides. as Keynote: https://drive.google.com/file/d/1FPUpFMiCkMRKPFjhi9MAhby68MHVqe8u/view?usp=share_link (140MB)
Few things I wish I said (I'll add items here as they come up):
- The dreams and hallucinations do not get fixed with finetuning. Finetuning just "directs" the dreams into "helpful assistant dreams". Always be careful with what LLMs tell you, especially if they are telling you something from memory alone. That said, similar to a human, if the LLM used browsing or retrieval and the answer made its way into the "working memory" of its context window, you can trust the LLM a bit more to process that information into the final answer. But TLDR right now, do not trust what LLMs say or do. For example, in the tools section, I'd always recommend double-checking the math/code the LLM did.
- How does the LLM use a tool like the browser? It emits special words, e.g. |BROWSER|. When the code "above" that is inferencing the LLM detects these words it captures the output that follows, sends it off to a tool, comes back with the result and continues the generation. How does the LLM know to emit these special words? Finetuning datasets teach it how and when to browse, by example. And/or the instructions for tool use can also be automatically placed in the context window (in the “system message”).
- You might also enjoy my 2015 blog post "Unreasonable Effectiveness of Recurrent Neural Networks". The way we obtain base models today is pretty much identical on a high level, except the RNN is swapped for a Transformer. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- What is in the run.c file? A bit more full-featured 1000-line version hre: https://github.com/karpathy/llama2.c/blob/master/run.c
Chapters:
Part 1: LLMs
00:00:00 Intro: Large Language Model (LLM) talk
00:00:20 LLM Inference
00:04:17 LLM Training
00:08:58 LLM dreams
00:11:22 How do they work?
00:14:14 Finetuning into an Assistant
00:17:52 Summary so far
00:21:05 Appendix: Comparisons, Labeling docs, RLHF, Synthetic data, Leaderboard
Part 2: Future of LLMs
00:25:43 LLM Scaling Laws
00:27:43 Tool Use (Browser, Calculator, Interpreter, DALL-E)
00:33:32 Multimodality (Vision, Audio)
00:35:00 Thinking, System 1/2
00:38:02 Self-improvement, LLM AlphaGo
00:40:45 LLM Customization, GPTs store
00:42:15 LLM OS
Part 3: LLM Security
00:45:43 LLM Security Intro
00:46:14 Jailbreaks
00:51:30 Prompt Injection
00:56:23 Data poisoning
00:58:37 LLM Security conclusions
End
00:59:23 Outro
https://www.youtube.com/watch?v=zjkBMFhNj_g
![[1hr Talk] Intro to Large Language Models](https://www.notion.so/image/https%3A%2F%2Fi.ytimg.com%2Fvi%2FzjkBMFhNj_g%2Fmaxresdefault.jpg?table=block&id=01a6b1c5-3d3a-4d7b-afa8-5b84e47b6467&cache=v2)
Adversarial opinion
Nonlinear metrics can exhibit exponential-like increases, leading to the misinterpretation of emergent abilities
Are Emergent Abilities of Large Language Models a Mirage?
Recent work claims that large language models display \textit{emergent abilities}, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent...
https://openreview.net/forum?id=ITw9edRDlD

Emergent Abilities of Large Language Models
Emergence can be defined as the sudden appearance of novel behavior. Large Language Models apparently display emergence by suddenly gaining new abilities as they grow. Why does this happen, and what does this mean?
https://www.assemblyai.com/blog/emergent-abilities-of-large-language-models/

Transformer can Extrapolation and outperform without RL
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure...
One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can...
https://arxiv.org/abs/2406.14546

Transcendence: Generative Models Can Outperform The Experts That Train Them
Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives.
In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset.111To play with our models, code, and data, please see our website at https://transcendence.eddie.win. We theoretically prove that transcendence is enabled by low-temperature sampling, and rigorously assess this experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.
https://arxiv.org/html/2406.11741v1
for non experts
2024's Biggest Breakthroughs in Computer Science
The year's biggest breakthroughs in computer science included a new understanding of what’s going on in large language models (LLMs) and a breakthrough in computing Hamiltonians — models that represent complex quantum systems. Read more at Quanta Magazine: https://www.quantamagazine.org/the-year-in-computer-science-20241219/?swcfpc=1
0:04 - Can Large Language Models Understand?
Are chatbots "stochastic parrots"? A new evaluation called Skill Mix suggests that the biggest large language models seem to learn enough skills to understand the words they’re processing.
Read more: https://www.quantamagazine.org/new-theory-suggests-chatbots-can-understand-text-20240122/
6:14 - Hamiltonian Learning Algorithm
After years of false starts, a team of computer scientists has found a way to efficiently deduce the Hamiltonian of a quantum system at any constant temperature.
Read more: https://www.quantamagazine.org/scientists-find-a-fast-way-to-describe-quantum-systems-20240501/
- VISIT our website: https://www.quantamagazine.org
- LIKE us on Facebook: https://www.facebook.com/QuantaNews
- FOLLOW us on Twitter: https://twitter.com/QuantaMagazine
Quanta Magazine is an editorially independent publication supported by the Simons Foundation: https://www.simonsfoundation.org/
https://youtu.be/fTMMsreAqX0?si=Oyv23Cq1BJeILtBP

Utility Engineering
Value systems about AI preference with high degrees of structural coherence which emerges in scale

Seonglae Cho
