Natural Abstraction Hypothesis

NAH

The efficient abstractions learned by AI reflect the inherent characteristics of the environment itself

Abstractability - The physical world can be abstracted, and it can be summarized with information of a much lower dimension than the overall complexity of the system

Human-Compatibility - Low-dimensional abstraction aligns with the abstractions humans use

Convergence - Various cognitive structures are likely to use similar abstractions

Currently, the best world modeling approaches are

Noise Reduction for visual processing and

Attention Mechanism for language processing.

Multimodal Neuron from OpenAI (2021,
Gabriel Goh)

In 2005, a letter published in Nature described human neurons responding to specific people, such as Jennifer Aniston or Halle Berry. The exciting thing was that they did so regardless of whether they were shown photographs, drawings, or even images of the person’s name. The neurons were multimodal. You are looking at the far end of the transformation from metric, visual shapes to conceptual information.

Multimodal Neurons in Artificial Neural Networks

We report the existence of multimodal neurons in artificial neural networks, similar to those found in the human brain.

https://distill.pub/2021/multimodal-neurons/

Multimodal neurons in artificial neural networks

We’ve discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually. This may explain CLIP’s accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and biases that CLIP and similar models learn.

https://openai.com/index/multimodal-neurons/

Neuron Activation in
Left Prefrontal cortex respond to work such as AI Neuron Activation (actually word embedding in the paper)

Semantic encoding during language comprehension at single-cell resolution

www.nature.com

https://www.nature.com/articles/s41586-024-07643-2

Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns

Nature Communications - Here, using neural activity patterns in the inferior frontal gyrus and large language modeling embeddings, the authors provide evidence for a common neural code for language...

https://www.nature.com/articles/s41467-024-46631-y

Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns

Neuromorphic computing

Neuromorphic computing is an approach to computing that is inspired by the structure and function of the human brain.[1][2] A neuromorphic computer/chip is any device that uses physical artificial neurons to do computations.[3][4] In recent times, the term neuromorphic has been used to describe analog, digital, mixed-mode analog/digital VLSI, and software systems that implement models of neural systems (for perception, motor control, or multisensory integration). Recent advances have even discovered ways to mimic the human nervous system through liquid solutions of chemical systems.[5]

https://en.wikipedia.org/wiki/Neuromorphic_computing

The Natural Abstraction Hypothesis: Implications and Evidence — LessWrong

This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory…

https://www.lesswrong.com/posts/Fut8dtFsBYRz8atFF/the-natural-abstraction-hypothesis-implications-and-evidence

World model Interpretability with
Internal Interface Theory

If the way AI interacts with various modules through internal interfaces is consistently formed, the possibility increases that humans can understand the format of these interfaces and interpret the entire world model at once.

World-Model Interpretability Is All We Need — LessWrong

Summary, by sections: • 1. Perfect world-model interpretability seems both sufficient for robust alignment (via a decent variety of approaches) and…