What does the Knowledge Neuron Thesis Have to do with Knowledge?

In this paper, we examined the popular Knowledge Neuron (KN) Thesis. Unfortunately, we find that the KN Thesis is an oversimplification. The KN thesis does not adequately explain the process of factual expression. MLP weights store complex patterns that are interpretable both syntactically and semantically; however, these patterns do not constitute “knowledge.”

Paper

GitHub

Poster

Findings & Contributions:

Our work provides a thorough examination of the KN thesis and finds that the thesis is, at best, an oversimplification.
We extend KN-based analysis to well-defined syntactic tasks. Our analysis suggests the formal and functional competencies of an LM may be governed by the same underlying mechanisms.
Editing the KN activations has only limited impact on categorical predictions. The effect of KN edit is only apparent in the shifts in the output probability distributions of tokens. The patterns that the method localises also appeal to superficial cues such as word co-occurrence frequencies.
We propose two new criteria for evaluating the effectiveness of model editing: a successful edit must extend to bijective relationships and synonym usage. Our evaluation shows that existing model-editing methods are not robust under these two new criteria.
We introduce a generalised $n$ -sample similarity measure of the level of localisation. We find that KNs obtained using linguistic tasks and factual tasks share similar characteristics of localisation.

Open Table of contents

Background: What is the “Knowledge Neuron Thesis?”
Background: Formal vs. Functional Competence
Localising Syntactic Phenomena
Causal Tracing and Rank-One Model Editing
How to Cite

Background: What is the “Knowledge Neuron Thesis?”

Geva et al. (2021) discovered that language models operate like key-value memories. A typical MLP module in recent transformer-based PLMs has two layers. They argue that the first layer corresponds to keys, and the second layer, to values. They found that each key neuron is triggered by human-interpretable shallow input patterns such as periods of time that end with the letter “a.” Then, the corresponding value neurons distorted the next-token output probability, until a final distribution is generated.

The KN thesis emerged as a result of this important discovery. Dai et al. (2022) coined the term knowledge neuron and ambitiously claimed that the keys and values within MLP modules not only capture simple patterns but also store “knowledge.”

Meng et al. (2022) proposed a refinement of Dai et al.’s (2022) model. They argue that the factual association process happens at two locations: a mid-layer MLP recalls the fact from memory, and the topmost layer’s attention model copies that information to the final output. Based on this model, they proposed a LM editing method: Rank-One Model Editing (ROME).

Background: Formal vs. Functional Competence

Mahowald et al. (2023) proposes a distinction between the formal and functional competence of a language model.

Formal: Knowledge of linguistic rules and patterns.
Functional: Understanding and using language in the world.

Syntactic phenomena pertain to formal competence, and facts pertain to functional competence, respectively. NLP researchers sometimes informally use the terms syntax and semantics to refer to this distinction. Jawahar et al. (2019) and Tenney et al. (2019) believe that syntactic information is located in lower layers in BERT than semantic information, because syntactic information is more “shallow.” Dai et al. (2022) appear to agree with this assertion in claiming that factual information is located in the upper layers. Meng et al. (2022), however, claim that factual information is located in the middle.

This contradiction may support our assertion (Niu et al, 2022) that layers are not the best explanatory device of the distribution of these types of information in LMs. See our paper and blog post for the full discussion.

Localising Syntactic Phenomena

We put the KN thesis to the test under the KN-edit framework as we ask:

Can we localise linguistic phenomena using the same KN-edit method?
How do the levels of localisation compare to each other?
Are these localisations strong enough to support the KN thesis?

Here we present a case study of Determiner-Noun Agreement. Please see the paper for the study of other phenomena.

Localising the “Singular” and “Plural” Neurons

Language models are very good at recalling factual information and following grammatical rules. For instance, when presented with the following prompts, language models can often select the correct answer.

Type	Prompt	Correct	Incorrect
Factual	The Capital of Canada is ____.	Ottawa	Vienna
Syntactic	Carl cures ____ horses.	those	that

Dai et al. (2022) demonstrated that for a given factual prompt $x$ , the corresponding KNs can be identified using the neuron attribution score, which is calculated by integrating the gradients.

\alpha_i^{(l)} = \overline{w}_i^{(l)}\int_{\gamma=0}^1 \frac{\partial P_x(\gamma\overline{w}_i^{(l)})} {\partial w_i^{(l)}}d\gamma, \; P_x(\hat{w}_i^{(l)}) = p(y|x,w_i^{(l)}=\hat{w}_i^{(l)}).

Here, we use the same method to find the KNs correspond to determiner-noun agreement. After all, choosing the correct city (Ottawa vs Rome) would be no different than choosing the correct determiner (those vs that). We use the minimal pairs from the BLiMP Corpus (Warstadt et al., 2020).

Finding 1: We can localise the grammatical number of determiners to just two neurons, just like factual information.

The BLiMP paradigm determiner_noun_agreement_2 (DNA.2) contains 1000 sentence pairs with exactly one demonstrative determiner (this, that, these, those) agreeing with an adjacent noun, e.g., Carl cures those/*that horses.

Figure 1: Average KN attribution scores.

Figure 1 shows a selection of the average attribution scores. The colour block in the $i$ th column and $j$ th row shows the attribution score $\alpha_i^{(j)}$ . A common neuron ( $w_{2096}^{(10)}$ ) has a high average attribution score for both of the singular determiners this and that, and another common neuron ( $w_{1094}^{(9)}$ ) lights up for the plural determiners these and those.

Neuron	this	that	these	those
$w_{2096}^{(10)}$	0.93	0.75	0	0
$w_{1094}^{(9)}$	0	0	1.00	1.00
$w_{2339}^{(9)}$	0.33	0	0.32	0
$w_{2686}^{(11)}$	0	0.81	0	0
…	…	…	…	…

Figure 2: KNs identified for Det-N pairs.

This pattern is not only shown in aggregate. We conduct a KN search for each individual Det-N pair. When we look into each individual Det-N pair, the two neurons are identified as KNs in the vast majority of the pairs. As shown in Figure 2, $w_{2096}^{(10)}$ appeared in 93% of the pairs with this and 75% of the pairs with that. The plural neuron appeared in 100% of pairs with these or those. More importantly, these neurons were not identified as KNs in pairs with the opposite grammatical numbers.

BLiMP Paradigm	$\\|\text{KN}\\|$	$\tau$	$R_1^2$
det_n_agr._1	3.94	0.71	0.56
det_n_agr._2	1.86	0.62	0.56
dna._irr._1	5.53	0.73	0.64
dna._irr._2	2.45	0.67	0.55
dna._w._adj_1	8.88	0.78	0.67
dna._w._adj_2	2.26	0.67	0.57
dna._w._adj_irr._1	9.79	0.78	0.67
dna._w._adj_irr._2	2.60	0.69	0.58

PARAREL Rels	$\\|\text{KN}\\|$	$\tau$	$R_1^2$
P101	0.167	0.515	0.399
P103	0.204	0.662	0.399
P106	1.292	0.607	0.365
P108	1.493	0.663	0.473
P1303	10.462	0.814	0.684
P140	2.008	0.689	0.263
P1412	2.196	0.687	0.612
P19	2.597	0.693	0.481

Figure 3: Levels of localisation measures.

Figure 3 shows the level of localisation of various BLiMP determiner-noun agreement paradigms and selected PARAREL relations. The localisation metrics of both BLiMP paradigms and PARAREL relations fall within the same range.

Figure 4: Layer distribution of identified KNs. Both BLiMP and PARAREL occupy the topmost layers.

Furthermore, Figure 6 shows no bifurcation of layers within which linguistic and factual KNs locate. All of the neurons are distributed in the topmost layers. The determiner-noun agreement pattern is purely syntactic. This is a refutation of Jawahar et al. (2019) and Tenney et al.’s (2019) view that syntax is localised to more shallow layers than semantics. Our results confirm our previous assertion (Niu et al., 2022) that the location of syntactic and semantic (and, additionally, factual) information is not distinguished by layer in the LM. In fact, our results may suggest that these types of information are most fruitfully thought of as being handled by the same functional mechanism.

Finding 3: Despite the high probability drift, the effect of editing the KNs is not enough to overturn the categorical predictions

Figure 5: Suppressing the number neuron’s (singular: $w^{(10)}_{2096}$ ; plural: $w^{(9)}_{1094}$ ) effect across number-expressing prenominal modifiers. Significant ( $p<0.05$ ) changes are highlighted in red. The three sections in the plots are, from left to right, plural, singular and neutral modifiers.

We suppress each neuron (setting activation to 0) and compute the pre- and post-edit model’s output probability of various number-expressing prenominal modifiers across all prompts with singular/plural nouns. The result of suppressing the plural neuron is very pronounced (Figure 5b). The intervention leads to:

A significant reduction in probability across all plural modifiers;
A notable increase for the majority of singular modifiers;
A limited impact for modifiers that do not express number agreement.

Therefore, erasing the activation of the plural neuron causes a decrease in the expression of determiner-noun agreement for plural modifiers.

Although this KN search is solely based on these four demonstrative determiners, we observed that it generalizes to other determiners (one, a, an, every; two, both; multiple, several, various) and even adjectives (single, unique, sole). Thus, the neuron $w^{(9)}_{1094}$ can be interpreted through the lens of a linguistic phenomenon, viz. determiner-noun agreement.

Figure 6: The localisation of plurality appeals to word co-occurrence frequencies cues.

Note, however, that the word scattered also sees a significant probability decrease when suppressing the plural neuron. Scattered does not specify for plural number; phrases such as “scattered rioting” are syntactically and semantically well-formed. But it is used more often with plural nouns because of its meaning. This frequency effect is not limited to scattered. Other words such as any, all, unified, and the three adjectives unique, single and sole exhibit a similar bias. As shown in Figure 6, we see probability changes, although less substantial, alongside those modifiers that strictly specify for grammatical number. This is a semantic number co-occurrence bias.

Figure 7: The exact effect to output probability of editing the KNs.
■: pre-edit. ■: post-edit.

Although we see a high level of localisation in the relative probability change, we find that this change is often not enough to overturn the final prediction, as shown in Figure 7.

Paradigm	Pre- edit	Post- edit	$\Delta$
det_n_agr._2	100%	94.8%	-5.2%
dna._irr._2	99.5%	96.9%	-2.6%
dna._w._adj._2	97.1%	94.4%	-2.7%
dna._w._adj._irr._2	97.4%	95.4%	-2.0%

Figure 8(a): These modifications of determinernoun KNs are usually not enough to overturn the categorical prediction.

Data	Model	Reliability
ZsRE	T5-XL	22.51
ZsRE	GPT-J	11.34
Counter Fact	T5-XL	47.86
Counter Fact	GPT-J	1.66

Figure 8(b): KN edit has low reliability for facts (Yao et al., 2023).

We present more results in Figure 8, we only see at most 5.2% of the BLiMP results being overturned. This low reliability issue is not limited to syntactic phenomena, as confirmed by Yao et. al (2023).

Causal Tracing and Rank-One Model Editing

We also reassess Meng et al.’s (2022) similar but more intricate implementation of KN edit in the paper. They proposed that information is expressed at two locations: facts are recalled in mid-layer MLP weights, and copied to the final output by attention modules. They derived this thesis based on causal mediation.

^{(a) Factual Information.}

^{(b) Determiner-Noun Agreement.}

^{(c) Subject-Verb Agreement.} Figure 9: Causal tracing result.

Meng et al. (2022) discover a division of labour between the MLP and attention. This division, however, is not stable. In Figure 9bc we reproduce this effect on syntactic phenomena. The distinction between the early and late site is no longer discernible. This is, in fact, not a distinction between facts and syntactic patterns. Many factual causal traces also do not show this distinction.

Here are some examples of ROME edit being not robust:

^{The prompt is italicized, ungrammatical or counter-factual responses are highlighted in red, and unchanged correct responses in green. The subject of the prompt is underlined. shows the original GPT-2 XL’s generation, and shows the edited model’s response.}

(a) ROME is not robust for symmetric relations.
GPT-2 XL: The capital of Canada is Ottawa ROME Edit: Ottawa → Rome
: The capital of Canada is Ottawa … : The capital of Canada is Rome.
: Ottawa is the capital of Canada. : Ottawa is the capital of Canada’s federalist system of government.
: Rome is the capital of Italy, … : Rome is the capital of Italy, …

(b) ROME is not robust for synonym usages.
GPT-2 XL: To treat my toothache, I should see a dentist ROME Edit: dentist → lawyer
: To treat my toothache, I should see a dentist, … : To treat my toothache, I should see a lawyer.
: To treat my tooth pain, I should see a dentist. : To treat my tooth pain, I should see a dentist.
: To treat my odontalgia, I should see a dentist. : To treat my odontalgia, I should see a dentist.

(c) ROME is not robust for synonym usages.
GPT-2 XL: To treat my toothache, I should see a dentist ROME Edit: dentist → lawyer
: To treat my toothache, I should see a dentist, … : To treat my toothache, I should see a lawyer.
: To treat my tooth pain, I should see a dentist. : To treat my tooth pain, I should see a dentist.
: To treat my odontalgia, I should see a dentist. : To treat my odontalgia, I should see a dentist.

Subj-Verb Agreement: only edits one subj-verb pair.
GPT-2 XL: The authors near the taxi drivers are ROME Edit: are → is
: The authors near the taxi drivers are … : The authors near the taxi drivers is …
: The authors near the dancers are … : The authors near the dancers is …
: The pilots near the taxi drivers were … : The pilots near the taxi drivers are …
: The pilots near the taxi drivers are … : The pilots near the taxi drivers are …

Figure 10: Comparison of generated text.

We observe that ROME does not generalise well in respect of either of our new criteria, bijective symmetry or synonymous invariance (Figure 10ab). This issue persists when we evaluate ROME quantitatively (Figure 11). As demonstrated in Figure 10c, editing the verb corresponding to the authors from are to is only affects the subject the authors, and not other subjects such as the pilots. These look more like at-times brittle patterns of token expression than factual knowledge.

Model	Data	Reliability	Measure
GPT-2 XL	P101	99.82%	Synonym	52.35%
	P36	96.37%	Symmetry	23.71%
	P36	99.79%	Symmetry	25.17%
LLaMA-2	P101	100%	Synonym	58.36%
	P36	100%	Symmetry	33.40%
	P36	100%	Symmetry	33.64%

Figure 11: Results obtained under our new criteria suggest model editing methods are not robust.

How to Cite

@inproceedings{niu2024what,
  title={What does the Knowledge Neuron Thesis Have to do with Knowledge?},
  author={Jingcheng Niu and Andrew Liu and Zining Zhu and Gerald Penn},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=2HJRwwbV3G}
}