London Anthropic Questions

Why Anthropic? *

Why do you want to work at Anthropic? (We value this response highly - great answers are often 200-400 words.)

My genuine interest in Artificial Intelligence lies in Explainable AI, specifically Mechanistic Interpretability. This interest guided my research toward interpretable decision-making in AI summarization. Additionally, my three years of experience in the autonomous driving technology sector have reinforced my belief in the necessity of deep understanding to ensure the safe and precise operation of AI systems.

My firm belief in an interpretable approach to AI spurred me to actively engage in scholarly dialogue on Interpretable AI. Drawing from Anthropic’s research on the interpretability of Transformer models, I explored the roles of induction and copying heads in forming in-context learning and conducted in-depth research on LLM functions using OpenAI's Transformer Debugger (TDB) and Anthropic’s Mono-semanticity research's Feature Browser. I reinterpreted this insightful research and shared it with the AGI Korea group, aiming to underscore my perspective and disseminate information about Interpretable AI research using a neuron analogy.

Besides the academic contributions and industrial impact of Anthropic AI, there are countless reasons to work at Anthropic. Cooperation with communities like DeepMind, LessWrong, and the AI Alignment Forum has profoundly changed my view of commercial AI companies. Additionally, there are several heroes I have not yet mentioned. Chris Olah’s innovative insights and rigorous research analysis from Distill.pub have directly influenced me. Anthropic AI, with its many distinguished researchers and engineers, stands out as almost the only group moving in the right direction.

Please link to your publications/research outputs, e.g. Google Scholar or Semantic Scholar (if you don't have any, please apply to our Research Engineer posting instead)* *

Google scholar

https://scholar.google.com/citations?user=XIMB1PoAAAAJ

https://github.com/seonglae/RTSum

https://github.com/seonglae/ReSRer

Other useful URLs (e.g. technical blog posts)

https://seongland.medium.com/superposition-hypothesis-for-steering-llm-with-sparse-autoencoder-c07b74d23e96

https://seongland.medium.com/reversing-transformer-to-understand-in-context-learning-with-phase-change-feature-dimensionality-13cbf8a2f984

What’s a technical achievement you’re proud of?

Three years at a startup and a major tech company in South Korea’s mobility sector are among the most enriching experiences of my life. Starting with two years at the startup Stryx, I began by connecting software and hardware, gaining hands-on experience in mounting sensor equipment onto car roof racks and setting up hardware. It enabled me to develop a deep understanding of sensor fusion, the starting point of mobility data pipeline. When I later took responsibilities in vector map generation, providing a comprehensive understanding of the map data flow. This vertical understanding was instrumental in improving the Autonomous Driving data pipeline, which established me as a key player of the company. My significant contributions led to the delivery of map data for Korea’s first driverless service company, 42Dot, and facilitated the successful acquisition of Stryx by Kakao Mobility. At Kakao Mobility, a dominant company in Korea's mobility sector, I designed an API protocol for an AI feature bridging machine learning algorithms and a mapping platform. Above that contribution, Kakao Mobility’s testing of a robo-taxi for public use under real-world conditions in Pangyo, offered me firsthand experience in the technology, which profoundly changed my attitude of AI. I experienced some stiffness due to the discrete vector map because I know the data under the hood. I realized that precise control over the mechanics of AI, such as those required in robo-taxi technology, is crucial for AI safety.

Why do you want to work on the Anthropic interpretability team? (We value this response highly - great answers are often 200-400 words. We like to see deeper and more specific engagement with Anthropic and our interpretability agenda than simply interest in AI or LLMs.) *

My primary goal in contributing to Anthropic's Interpretability team is to steer Transformer models as humans intend. Anthropic’s research has shown that dictionary learning via Sparse AutoEncoders can steer language models by controlling activation. However, there are no silver bullets; this is just a starting point for directly influencing LLMs in a mechanistic way. Beyond merely controlling topics or text formats, there is potential to directly manipulate the attention layer's values to control in-context learning abilities with a steering vector.

Anthropic is making outstanding contributions to the academic field, and I am eager to contribute to these research efforts for AI alignment. First, the mathematical framework for analyzing Transformer circuits, which includes induction heads with Kronecker product representation and a bottom-up approach to In-context learning focusing on phase changes, along with the 'Toy Model of Superposition,' has steadily reinforced my belief that future alignment will be based on Mechanistic Interpretability. I was particularly impressed by the research on mono-semanticity using Sparse AutoEncoder, showing that humanity can finally steer LLMs as intended.

These research efforts benefit from their mutual logical relationships within the same company, showcasing a heavily cooperative culture at Anthropic. I believe Anthropic's significant success in this industry heavily relies on this bottom-up academic approach to building Opus, achieving both performance and safety. With three years of professional engineering experience and insights gained from projects like RTSum and ReSRer, along with innovative ideas from my development of Treensformer, I am prepared to contribute significantly to Anthropic, driving forward both the technology and its alignment with human values.

In one paragraph, provide an example of something meaningful that you have done in line with your values. Examples could include past work, volunteering, civic engagement, community organizing, donations, family support, etc. *

Throughout middle and high school, I engaged in activities with disabled children during each vacation at Jaramter Kindergarten. That was the most emotional experience of my life, changing my stereotypes through direct interactions with them as a friend and assistant. Without initially considering the meaningfulness of the activity, I came to realize that individuals are distinct regardless of the group they belong to.

In one paragraph, provide an example of a time you were curious about something in the world and how you went about investigating it. *

My intellectual curiosity drives me to explore new terminology and construct relationships between knowledge across domains through analogies. For example, I might connect a Non-Deterministic Turing Machine to an Autoregressive Model due to their similar decoding properties. Similarly, my investigation of Euler’s theorem and Euler's Totient function deepens my understanding of RSA in cryptography. When curious about my reward system to become more productive and avoid dopamine addiction, I explore the Opponent Process Theory and the Opioid System to understand the internal dopamine processes more deeply. These traits—searching and connecting—enable a broader understanding across fields and result in creative ideas and rapid growth.

Please share a link to the piece of work you've done that is most relevant to the interpretability team, along with a brief description of the work and its relevance. *

Driven by my resolve to delve into the mechanisms of LLM, I worked under the guidance of Professor Dongha Lee, leading to the development of RTSum(Relation-Triple Summarization), which splits and recombines information at a granular level. I created a multi-level saliency visualization demo, contributing to our paper's acceptance at NAACL 2024.

Inspired by the concept of the Residual Stream as a communication channel, I devised Treensformer to store context information by compressing and decompressing it in an embedding vector. It reuses the final context embedding as input, controlling the ability to expand and shrink the context tensor’s length. I utilized a Pytorch hook similar to the methods Anthropic employed with Garcon for internal interpretable research. Treensformer is on its way to preventing distributional changes during reuse, building upon similar research such as Activation Beacon and Infini Transformer.

In one sentence each, what are three open research questions related to AI safety that you would be interested in contributing to. (We’re especially interested in questions that you think are likely to be neglected by the current AI research ecosystem.) *

AI safety

Furthermore, Anthropic is not only a leader in AI interpretability research but also a major advocate for AI safety. The recent release of the Defection Probe and many-shot jailbreaking techniques are testament to this leadership.

Knowledge base

My genuine interest in Artificial Intelligence lies in Explainable AI, specifically Mechanistic interpretability. My firm belief in an interpretable approach to AI spurred me to actively engage in scholarly dialogue on Interpretable AI. Anthropic’s research has shown that dictionary learning via Sparse AutoEncoders can steer language models by controlling activation. Drawing from Anthropic’s research on the interpretability of Transformer models, I explored the roles of induction and copying heads in forming in-context learning. Furthurmore, I conducted in-depth research on LLM functions using OpenAI's Transformer Debugger (TDB) and Anthropic’s Mono-semanticity research's Feature Browser. I reinterpreted this insightful research and shared technical writings with the AGI Korea group, aiming to underscore my perspective and disseminate information about Interpretable AI research using a neuron analogy.

Anthropic is making outstanding contributions to the academic field, and I am eager to contribute to these research efforts. There are countless reasons to work at Anthropic like the academic contributions and industrial impact of Anthropic AI. Cooperation with communities like DeepMind, LessWrong, and the AI Alignment Forum has profoundly changed my view of commercial AI companies. I believe Anthropic's significant success in this industry heavily relies on this bottom-up academic approach to building Opus, achieving both performance and safety. With my three years of professional engineering experience and insights gained from projects RTSum and ReSRer, I am prepared to contribute significantly to Anthropic, driving forward both the technology and its alignment with human values.

Level 4: Bigram Attention Head

We've got a functioning attention head. Hooray!

In this level our return value will be logits for the next token in the sequence only. So it will have (n_vocab,) = output_activations.shape. This is in contrast to previous levels!

Now we'll construct our own qk and ov matrices so our attention head predicts a single bigram. Though we don't think this is the primary way transformers implement bigrams, it's a useful exercise.

Our output output_activations should be logits for predicting the next token in the sequence only. So it will have (n_vocab,) = output_activations.shape. Note this is in contrast to previous levels, where we returned the output_activations for all tokens in the sequence, not just the last one.

From here forward we'll assume that n_vocab = 4 and call the four tokens a, b, c, and d with token indexes 0, 1, 2, and 3, respectively.

The bigram we'd like to predict is that c comes after b. Otherwise, we predict a uniform probability over all 4 tokens. We'll accomplish this by:

Each token attends to itself with attention score self_score: float, which is an input argument.

All other tokens are attended to with attention score 0.

If we've attended to b, we predict c with logit value given by the input argument bigram_logit: float.

Everything else we predict logit 0.


def solution(input_activations, self_score, bigram_logit):
    ...

[execution time limit] 4 seconds (py3)

[memory limit] 1 GB

[input] array.array.float input_activations

input activations for a single sequence. Array of shape (n_tokens, n_vocab). Will be a list of lists, rather than a numpy array, in test cases.

[input] float self_score

attention score with which each token attends to itself (other attention scores are zero)

[input] float bigram_logit

output logit assigned to predicting c if we attend exclusively to token b. Other logits are zero.

[output] array.float

output_activations for the last token only, which are representing the logits for the next token in the sequence. So (n_vocab,) = output_activations.shape. Can be a list or a numpy array.

[Python 3] Syntax Tips


# Prints help message to the console
# Returns a string
def helloWorld(name):
    print("This prints to the console when you Run Tests")
    return "Hello, " + name

Question 1 (Level 5) of 1Submitting...0:00:00+0:00:34

DescHistoryRulesReadmeSettings codewriting Level 5: Bigram+Trigram Attention Head You've made it to the last level! Now we're going to add something that's more suited to an attention head: predicting a skip-trigram. As in the last level, we're still considering n_vocab=4 with tokens (a, b, c, d), with our return value being the logit predictions for the next token in the sequence (n_vocab,) = output_activations.shape. To simplify things even further, we'll only consider sequences that end with b in this level. Our goal is to predict both the bigram b c but also the skip-trigram a..b d, where the .. indicates that there may be 0 or more tokens between a and b. We'll do this as follows: 1. retain all behavior from the previous level (still attend-to-self, still predict bigram with same logit). 2. In additional to attending to itself, b also attends to a with attention score given by the input parameter b_to_a_score: float 3. If we've attended to a, we predict d with logit given by the trigram_logit: float input parameter. 4. Everything other than the bigram and trigram, we predict logit 0.

def solution(input_activations, self_score, bigram_logit, b_to_a_score, trigram_logit):
    ...

Note that by making b_to_a_score > self_score and/or trigram_logit > bigram_logit, our attention head can predict d when a comes before c, but predict c otherwise. Neat! • [execution time limit] 4 seconds (py3) • [memory limit] 1 GB • [input] array.array.float input_activations input activations for a single sequence. Array of shape (n_tokens, n_vocab). Will be a list of lists, rather than a numpy array, in test cases. • [input] float self_score attention score with which each token attends to itself (other attention scores are zero) • [input] float bigram_logit output logit assigned to predicting c if we attend exclusively to token b. Logits for other tokens should be 0 in this case. • [input] float b_to_a_score attention score with which token b attends to a • [input] float trigram_logit output logit assigned to predicting d if we attend exclusively to token a. • [output] array.float output_activations for the last token only, which are representing the logits for the next token in the sequence. So (n_vocab,) = output_activations.shape. Can be a list or a numpy array. [Python 3] Syntax Tips

# Prints help message to the console
# Returns a string
def helloWorld(name):
    print("This prints to the console when you Run Tests")
    return "Hello, " + name

main.py3SavedPython 3v3.10.6181920212223242526272829303132333435363738394041424344454647 ov = numpy.array(ov) n_tokens, n_vocab = embeddings.shape # compute a = attention_pattern(input_activations, qk) return numpy.dot(numpy.kron(a, ov), embeddings.flatten()).reshape(n_tokens, n_vocab)def bigram(input_activations, self_score, bigram_logit): # init embeddings = numpy.array(input_activations) n_tokens, n_vocab = embeddings.shape # compute scores = numpy.zeros(n_tokens) scores[-1] = self_score weights = softmax(scores, -1) output = numpy.zeros((n_tokens, n_vocab)) print(output) for i, embedding in enumerate(embeddings): token = numpy.argmax(embedding, axis=0) if token == 1: output[i, 2] = bigram_logit else: output[i] = numpy.zeros(n_vocab) output = numpy.sum(weights[:, numpy.newaxis] * output, axis=0) print('output', output.shape) return output def solution(input_activations, self_score, bigram_logit, b_to_a_score, trigram_logit): passTestsCustom TestsRun testsMore0/300View DiffSubmitting...PrevReview your answers

1. retain all behavior from the previous level (still attend-to-self, still predict bigram with same logit).

2. In additional to attending to itself, b also attends to a with attention score given by the input parameter b_to_a_score: float

3. If we've attended to a, we predict d with logit given by the trigram_logit: float input parameter.

4. Everything other than the bigram and trigram, we predict logit 0.

[execution time limit] 4 seconds (py3)

[memory limit] 1 GB

[input] array.array.float input_activations

input activations for a single sequence. Array of shape (n_tokens, n_vocab). Will be a list of lists, rather than a numpy array, in test cases.

[input] float self_score

attention score with which each token attends to itself (other attention scores are zero)

[input] float bigram_logit

output logit assigned to predicting c if we attend exclusively to token b. Logits for other tokens should be 0 in this case.

[input] float b_to_a_score

attention score with which token b attends to a

[input] float trigram_logit

output logit assigned to predicting d if we attend exclusively to token a.

Cannot edit in read-only editor

chatgpt history

ChatGPT

A conversational AI system that listens, learns, and challenges

https://chatgpt.com/c/66f15cd7-7fa0-8007-80fb-e538f8ba4613

London Anthropic Questions

Knowledge base

Level 4: Bigram Attention Head

Question 1 (Level 5) of 1Submitting...0:00:00+0:00:34

chatgpt history

Recommendations