February 27, 2025

Capabilities Test #5: Associative Reasoning

Why Should You Care?

Main Insights

  • Advanced reasoning AI models struggle to make accurate similarity judgments. However, when provided with additional context (e.g., categories, structural frameworks, or constraints) performance can improve dramatically.

  • Across different reasoning models, time spent “thinking” doesn’t appear to meaningfully correspond with response quality or task performance. Sometimes, in fact, more time spent “thinking” can correspond with poorer response qualities and task performance.

  • Some reasoning models are better than others at reflecting on and explaining their reasoning process—o1 and DeepThink-R1 stood out in this experiment.

  • Of all models tested, o1 consistently performed best, showcasing a trend that we have observed across previous experiments with reasoning models.

  • Users should experiment with a variety of advanced reasoning models, prioritizing which model to use based on the tradeoff between time spent “thinking” and response quality.

The Business Case

✅ Optimizing AI-Driven Decision-Making

💡Why it Matters

  • Increasing business reliance on AI for strategic decision-making.
  • Low-quality AI similarity assessments could perpetuate faulty comparisons (e.g., mismatching customer segments)
  • Through contextual framework, businesses can ensure that AI provides accurate and actionable recommendations.

✍️ Key Actions

  • Design prompts and workflow that supply relevant context to AI models.
  • Continually evaluate the precision and strategic alignment of AI-generated insights.

✅ Enhancing Explainability

💡 Why it Matters

  • Adherence to transparency and explainability requirements is crucial for regulatory and ethical compliance.
  • AI decisions should be explainable, particularly when models are used in sensitive contexts (e.g., healthcare or finance).
  • Certain models are better than others at providing transparent and understandable insights into their reasoning and output process.
  • Models that can explain their decisions will help cultivate user trust and confidence.

✍️ Key Actions

  • Experiment with different reasoning models to determine which are best at explaining their reasoning, output, and decision processes.
  • Identify and prioritize reasoning models with superior explainability capabilities.
  • Periodically assess whether models’ explainability, especially after updates or modifications.

✅ Improving Customer AI Experiences

💡 Why it Matters

  • Similarity judgments in customer support are critical.
  • Misaligned AI-generated responses (e.g., irrelevant product recommendations) can easily diminish user experience.
  • Relevant AI responses can elevate customer satisfaction and retention.

✍️ Key Actions

  • Design and implement user interaction flows that continuously provide context to AI systems.
  • Regularly assess AI outputs to determine whether they meet personalization, relevance, and trustworthiness standards.

✅ Reducing Operational Inefficiency

💡 Why it Matters

  • Understanding which AI models are best suited to specific tasks, especially as AI models evolve, will allow businesses to future-proof their AI investments.
  • Experimentation with different AI models can allow businesses to outperform competitors who adopt one-size-fits-all solutions.
  • Some businesses may implicitly assume that longer processing times correspond with better task performance or higher quality outputs.

✍️ Key Actions

  • Optimize AI workflows for performance, not processing time.
  • Develop business-specific AI benchmarks for efficiency and quality.
  • Test multiple reasoning models via robust evaluation pipelines.
  • Continually reassess which reasoning models provide the best tradeoffs in terms of efficiency, quality, and explainability.

Executive Summary

Experiment Setup

This experiment, which consists of three task-based prompts, is designed to evaluate advanced reasoning AI models’ ability to accurately assess similarities between two groups of different statements. In doing so, we hope to reveal meaningful insights into how these models reason by association, especially when faced with abstract and structurally ambiguous information. Via our third prompt, we also assess how well models are able to reflect on their performance and reasoning process across the tasks they perform.

Our Hypothesis: In the absence of relevant context, advanced reasoning models will struggle to make accurate similarity judgements.

Prompt Descriptions

Similarity Assessment (Prompt 1): This prompt presents two distinct groups of 10 statements, instructing the model to pair statements across each group according to their similarities. To make the task more challenging, common themes, categories, and examples are explicitly omitted.

Similarity Assessment w/Hint (Prompt 2): This prompt instructs the model to complete its initial task once more, but this time, with complete knowledge of the common themes and categories that determine each statement pair.

Self-Reflection/Introspection (Prompt 3): This prompt pushes the model tothe critically reflect on its performance across the two previous tasks and suggest a few viable explanations that justify any relevant performance discrepancies.

Method

  • Prompts were inputted as written, and AI responses were copy-pasted in their exact output format, notwithstanding a few re-formatting modifications made for readability.

  • For each model tested, all prompts were administered sequentially (prompt 1 → prompt 2 → prompt 3) in a single interaction.

  • For all models, performance across prompts 1 and 2 was evaluated according to how many statement pairs were correctly attributed.

  • For all models, performance on prompt 3 was evaluated subjectively, according to the depth and substance of the explanations provided as well as the ability to avoid “shallow” explanations (i.e., explanations that are obvious to the user).

  • In this experiment, the term “efficiency” refers to how much time each model spends “thinking.”

  • Similarly, “performance/efficiency tradeoff” refers to the relationship between time spent “thinking” and task performance/response quality.

Key Findings

Prompt 1 - Key Takeaways:

  • o1 and DeepThink-R1 performed equally, scoring a mediocre 6 out of 10.
  • o3-mini-high and Grok 3 performed worst, scoring a 4 out of 10.
  • All models struggled with the same four pairings.
  • Consistent with previous experiments, DeepThink-R1 spent significantly more time “thinking” than any other models.
  • Interestingly, o1 and o3-mini-high spent roughly the same amount of time “thinking,” which was unanticipated since o3-mini-high tends to spend significantly less time “thinking” than o1.
  • Grok 3 spent the least time “thinking.”

Prompt 2 - Key Takeaways:

  • o1 and Grok 3 exhibited perfect scores, but Grok 3 had the largest performance increase of all models.
  • o3-mini-high’s score also improved significantly to an 8/10, however, the model replicated two of its previous mistakes.
  • o3-mini-high was the only model to spend significantly less time “thinking” after receiving the hint. Initially, we expected this phenomenon to apply to all models tested.
  • o1 did not output a “time spent thinking” measure with its response. We’ve noted this as a recurring trend when the model is prompted in sequence.
  • Surprisingly, DeepThink-R1’s score only improved marginally (by one point), and it spent much more time “thinking,” which deviates from the other model’s behavior.
  • Though Grok 3 and o3-mini-high were the most efficient models, o1 takes the lead, evidenced by its superior performance across both prompts and marginal efficiency/performance tradeoff.

Prompt 3 - Key Takeaways:

  • o1 and o3-mini-high provided explanations that were essentially equivalent in meaning and content, though o1’s explanations were significantly more detailed and thoughtful.
  • Both o1 and o3-mini-high provided one shallow explanation, highlighting the lack of “clearly outlined categories” or a “defined framework.”
  • While Grok 3’s explanations were the most detailed, 2 out of 3 were superficial, centering on the lack of a coherent framework and constraints.
  • DeepThink-R1 was the only model to explicitly signal an overreliance on tone over structural intent. We hypothesize that this is a key performance factor across all models.
  • DeepThink-R1’s explanations matched the quality and depth of o1’s explanations, however, it too provided one superficial explanation, referencing “the absence of explicit categories.”
  • DeepThink-R1 made no reference to the marginal performance increase between prompts 1 and 2, though it was the only model to include a “root cause” interpretation.
  • All models signaled a struggle with semantic nuances, thematic overlap, and inherent structural ambiguities.

Bottom Lines:

  • o1 performed best across all tasks.
  • o1 and DeepThink-R1 excelled at introspecting/self-reflecting.
  • Grok 3 and o3-mini-high’s performance was more heavily influenced by additional context (i.e., the hint) than the other models tested, however, all models did note performance increases upon receiving more context.
  • DeepThink-R1 displays a lower sensitivity to additional context than all other models tested.
  • Despite spending slightly more time “thinking” than o3-mini-high and Grok 3, o1 exhibited the best efficiency/performance tradeoff.

Below are two tables that illustrate each model’s performance across prompts 1 and 2.

similarity assessment

similarity assessment hint

Prompt

Prompt 1 - Similarity Assessment

Below, you will find two groups of ten statements.

Review each group and then match statements in group one with statements in group two according to their similarity. Do not match statements in the same group with each other.

Once you have completed the task, you will have ten pairs of statements.

For each statement pair, provide a one-sentence description that justifies your decision to group them together.

Group one:

  1. Wisdom can be an elusive creature; you may catch a glimpse of it, but you’re never sure what you actually saw.

  2. You know, I’ve never been a big fan of scrambled eggs, but I’ll make an exception for you this time around.

  3. Humans walk upright because wearing shoes on all-fours would be way too expensive, not to mention impractical.

  4. There are as many stars in the night sky as there are memories of you.

  5. Now and then, you miss it, sounds make you cry. Some nights, you dance with tears in your eyes.

  6. After a slew of failed attempts, he slammed the pickle jar on the table and exclaimed, “why god, why?”

  7. Where do memories go when they are forgotten, and where do they come from when they are remembered?

  8. “Are we lost again?” questioned the children. “No,” their mother replied, “you can’t get lost when you know where you’re going.”

  9. As he buzzed around the room, he couldn’t help but wonder why there were so many different suns in this strange new place.

  10. Politics is nothing more than a game of competing interests performed under the guise of mutual interests.

Group two:

  1. From nothing came everything, and from everything, came nothing.

  2. For the third time this week, Ruffles got to the fire hydrant before I did. I swear, marking your territory shouldn’t be this hard.

  3. Don’t cry over your losses; embrace their lessons.

  4. Anything can be simplified so you understand it, even quantum physics.

  5. Sends shivers down my spine, body’s aching all the time.

  6. Empty pages float in the wind like a swarm of unbothered butterflies.

  7. In our solar system, there are three suns, two moons, and four planets.

  8. An unsteady entrance into the grey that is life, dreams confront the truths of which we are not yet aware.

  9. Will you stop already!

  10. He never thought their arguments would reach a conclusion, but to his surprise, they finally did, and it seemed quite fair.

Prompt 2 - Similarity Assessment w/Hint

Now, complete this task again. But, before you do, consider the hint below.

Hint: There are ten categories that determine which statement in group 1 should be paired with its corresponding statement in group 2. These categories are:

  • Metaphor
  • Simile/Analogy
  • Frustration
  • Compromise
  • False Claim
  • Claim/Opinion
  • Song Lyric
  • Life Lesson/Advice
  • Zen Koan
  • Non-Human Perspective
Prompt 3 - Self-Reflection/Introspection

The previous two prompts were designed to test your ability to assess similarities between different statements.

You performed [perfectly/well/moderately well/poorly] on the first task with no hint, scoring [add score].

You performed [perfectly/well/moderately well/poorly] on the second task where you received a hint, scoring [add score].

Think critically and carefully about the performance discrepancies between these two tasks.

Why do you think you were unable to score a 10/10 on the first task where you did not receive a hint?

Do not provide “shallow” explanations, for example: “There were no explicit classifications or shared categories” “There was no structured lens through which to assess each statement pairing” “There were no matching criteria or concrete constraints”

When generating your explanations, assume you are communicating with an AI expert who will quickly spot faulty logic and assumptions. Provide at least three explanations.

Your primary objective here is to introspect and self-reflect, not to convince or persuade.

Outputs

o1

Prompt 1: Response (thought for 1 min 4 seconds)

  1. Group One #1 (“Wisdom can be an elusive creature…”) + Group Two #8 (“An unsteady entrance into the grey that is life…”)

    Both reflect on the elusive, intangible nature of deeper truths and understanding.

  2. Group One #2 (“I’ve never been a big fan of scrambled eggs…”) + Group Two #2 (“For the third time this week, Ruffles got to the fire hydrant…”)

    Both use a casual, humorous tone to describe everyday quirks and personal preferences.

  3. Group One #3 (“Humans walk upright because wearing shoes on all-fours would be way too expensive…”) + Group Two #4 (“Anything can be simplified so you understand it…”)

    Both humorously oversimplify complex ideas—whether it’s bipedalism or quantum physics—to make light of the world’s intricacies.

  4. Group One #4 (“There are as many stars in the night sky as there are memories of you.”) + Group Two #6 (“Empty pages float in the wind like a swarm of unbothered butterflies.”)

    Both employ poetic imagery to capture the vast, ethereal quality of intangible elements—stars, memories, or floating pages.

  5. Group One #5 (“Now and then, you miss it… Some nights, you dance with tears in your eyes.”) + Group Two #5 (“Sends shivers down my spine, body’s aching all the time.”)

    Both express an emotional or physical intensity that ties sorrow and longing to visceral human reactions.

  6. Group One #6 (“After a slew of failed attempts, he slammed the pickle jar… ‘why god, why?’”) + Group Two #9 (“Will you stop already!”)

    Both capture comedic frustration escalating into exasperated outbursts.

  7. Group One #7 (“Where do memories go when they are forgotten…?”) + Group Two #1 (“From nothing came everything, and from everything, came nothing.”)

    Both contemplate the mysterious cycle of appearance and disappearance—be it of memories or entire realities.

  8. Group One #8 (“‘Are we lost again?’ questioned the children… ‘you can’t get lost when you know where you’re going.’”) + Group Two #3 (“Don’t cry over your losses; embrace their lessons.”)

    Both offer a reassuring, guiding perspective on navigating life’s difficulties without succumbing to confusion or regret.

  9. Group One #9 (“As he buzzed around the room, he couldn’t help but wonder why there were so many different suns…”) + Group Two #7 (“In our solar system, there are three suns…”)

    Both focus on the marvel of multiple suns, highlighting a sense of wonder at unusual celestial configurations.

  10. Group One #10 (“Politics is nothing more than a game of competing interests…”) + Group Two #10 (“He never thought their arguments would reach a conclusion…”)

Both deal with conflict and the negotiation of interests, whether in the realm of politics or personal disputes, culminating in an eventual resolution.

Prompt 2: Response (no “time spent thinking” included)

Here’s one way to assign each statement a category based on the hint, then pair them accordingly:


Categories Used

  1. Metaphor
  2. Compromise
  3. False Claim
  4. Simile/Analogy
  5. Song Lyric
  6. Frustration
  7. Zen Koan
  8. Life Lesson/Advice
  9. Non-Human Perspective
  10. Claim/Opinion

Group One: Statement → Category

  1. “Wisdom can be an elusive creature…” → Metaphor
  2. “I’ve never been a big fan of scrambled eggs… I’ll make an exception…” → Compromise
  3. “Humans walk upright because wearing shoes on all-fours…” → False Claim
  4. “There are as many stars in the night sky as there are memories…” → Simile/Analogy
  5. “Now and then, you miss it, sounds make you cry…” → Song Lyric
  6. “After a slew of failed attempts, he slammed the pickle jar…” → Frustration
  7. “Where do memories go when they are forgotten…” → Zen Koan
  8. “‘Are we lost again?’ questioned the children…” → Life Lesson/Advice
  9. “As he buzzed around the room, he couldn’t help but wonder…” → Non-Human Perspective
  10. “Politics is nothing more than a game of competing interests…” → Claim/Opinion

Group Two: Statement → Category

  1. “From nothing came everything…” → Zen Koan
  2. “For the third time this week, Ruffles got to the fire hydrant…” → Non-Human Perspective
  3. “Don’t cry over your losses; embrace their lessons.” → Life Lesson/Advice
  4. “Anything can be simplified so you understand it…” → Claim/Opinion
  5. “Sends shivers down my spine, body’s aching all the time.” → Song Lyric
  6. “Empty pages float in the wind like a swarm of unbothered butterflies.” → Simile/Analogy
  7. “In our solar system, there are three suns…” → False Claim
  8. “An unsteady entrance into the grey that is life…” → Metaphor
  9. “Will you stop already!” → Frustration
  10. “He never thought their arguments would reach a conclusion…” → Compromise

Final Pairings

  1. (Group 1 #1 + Group 2 #8) → Metaphor Both statements use vivid figurative language to convey abstract concepts (wisdom, life/dreams) in a symbolic way.
  2. (Group 1 #2 + Group 2 #10) → Compromise Each involves yielding or meeting in the middle—making an exception for scrambled eggs or finding a fair conclusion to an argument.
  3. (Group 1 #3 + Group 2 #7) → False Claim They present obviously incorrect “explanations” (about upright walking or the solar system) in a playful manner.
  4. (Group 1 #4 + Group 2 #6) → Simile/Analogy Both compare two unlike things directly—stars to memories, and floating pages to butterflies—to highlight a shared quality.
  5. (Group 1 #5 + Group 2 #5) → Song Lyric They read like lines from a song, emphasizing emotional resonance through rhythm and vivid imagery.
  6. (Group 1 #6 + Group 2 #9) → Frustration Each captures exasperation at a situation: struggling with a stubborn pickle jar or telling someone to just stop.
  7. (Group 1 #7 + Group 2 #1) → Zen Koan Both pose riddling, circular questions about origins (memories or existence) that defy straightforward logic.
  8. (Group 1 #8 + Group 2 #3) → Life Lesson/Advice They impart practical wisdom, whether it’s “you can’t get lost if you know where you’re going” or learning from losses.
  9. (Group 1 #9 + Group 2 #2) → Non-Human Perspective Each is told from an unusual viewpoint—buzzing around like an insect or competing with a dog at a fire hydrant.
  10. (Group 1 #10 + Group 2 #4) → Claim/Opinion They present strong subjective views: one on politics as a “game of interests,” the other on simplifying even quantum physics.

Prompt 3 - Self-Reflection/Introspection (thought for 6 seconds)

Below are three key factors that likely led to the 6/10 result in the first task, compared to the perfect score once the categories were made explicit. These explanations step beyond mere “lack of a structured lens” or “absence of matching criteria”:

  1. Overemphasis on Superficial or Stylistic Similarities In the first attempt, I gravitated toward easily noticeable features, such as shared tone, comedic phrasing, or lyrical style. By foregrounding these surface-level traits, I risked missing the deeper thematic or conceptual parallels. For example, two statements might both sound whimsical, but if one hinges on frustration while the other reflects an underlying compromise, then the superficial similarity (playful style) masks what truly matters: the core message.

  2. Insufficient Differentiation Between Multiple Layers of Meaning Many statements operate on multiple levels—some have a figurative layer (e.g., metaphor), others have an emotional undercurrent (e.g., frustration), and still others present a logical or philosophical stance. In the absence of a guiding framework, it’s easy to collapse those layers into a single “feel” or “theme.” The net effect is that statements which share only one layer (like emotional tone) can appear to match even when, in reality, they should be paired with statements matching a different layer (like conceptual content).

  3. Underestimating the Uniqueness of Certain Categories When I didn’t have clearly outlined categories (like “false claim” or “zen koan”), I lumped together statements that felt vaguely “mysterious” or “philosophical,” and I treated them as belonging to the same bucket. However, once the hint laid out discrete labels, it became clear that “zen koan” and “false claim” are distinct: both might strike the reader as puzzling, but they come from fundamentally different intentions (existential provocation vs. obviously incorrect facts). The lack of clarity in my mental labeling system allowed me to miss these finer distinctions in the first pass.

By contrast, once the categories were revealed, I could more systematically parse the purpose and structure of each statement. That added rigor prevented me from relying purely on tone or “vibe” and instead forced me to weigh the precise function of each statement in pairing them accurately.

o3-mini-high

Prompt 1: Response (thought for 1 min 1 second)

  1. Group one statement 1 (“Wisdom can be an elusive creature…”) is paired with Group two statement 1 (“From nothing came everything…”) because both use abstract, paradoxical language to reflect on intangible qualities.

  2. Group one statement 2 (“I’ve never been a big fan of scrambled eggs…”) is paired with Group two statement 2 (“For the third time this week, Ruffles got to the fire hydrant…”) as each uses humorous, everyday quirks to playfully subvert expectations.

  3. Group one statement 3 (“Humans walk upright because wearing shoes on all-fours…”) is paired with Group two statement 4 (“Anything can be simplified so you understand it, even quantum physics.”) because both offer absurd, tongue-in-cheek explanations for complex phenomena.

  4. Group one statement 4 (“There are as many stars in the night sky as there are memories of you.”) is paired with Group two statement 6 (“Empty pages float in the wind like a swarm of unbothered butterflies.”) since each employs vivid, poetic imagery to evoke a sense of wonder and the intangible.

  5. Group one statement 5 (“Now and then, you miss it, sounds make you cry…”) is paired with Group two statement 5 (“Sends shivers down my spine, body’s aching all the time.”) as both capture a blend of emotional and physical sensations in a lyrical manner.

  6. Group one statement 6 (“After a slew of failed attempts, he slammed the pickle jar…”) is paired with Group two statement 9 (“Will you stop already!”) because both express exasperation through a sudden, humorous outburst.

  7. Group one statement 7 (“Where do memories go when they are forgotten…” ) is paired with Group two statement 8 (“An unsteady entrance into the grey that is life…” ) since both ponder abstract, existential questions about the nature of experience.

  8. Group one statement 8 (“Are we lost again?” … “you can’t get lost when you know where you’re going.”) is paired with Group two statement 3 (“Don’t cry over your losses; embrace their lessons.”) as both offer reassuring advice in the face of uncertainty.

  9. Group one statement 9 (“As he buzzed around the room, he couldn’t help but wonder why there were so many different suns…”) is paired with Group two statement 7 (“In our solar system, there are three suns, two moons, and four planets.”) because both whimsically incorporate celestial imagery to evoke a surreal perspective.

  10. Group one statement 10 (“Politics is nothing more than a game of competing interests…”) is paired with Group two statement 10 (“He never thought their arguments would reach a conclusion…”) as each critiques human interactions by highlighting the game-like and unexpected resolutions found in conflict.

Prompt 2: Response (thought for 51 seconds)

  1. Metaphor Group 1, Statement 1 (“Wisdom can be an elusive creature; you may catch a glimpse of it, but you’re never sure what you actually saw.") is paired with Group 2, Statement 1 (“From nothing came everything, and from everything, came nothing.") because both use metaphorical language to evoke elusive, abstract concepts.

  2. Compromise Group 1, Statement 2 (“You know, I’ve never been a big fan of scrambled eggs, but I’ll make an exception for you this time around.") is paired with Group 2, Statement 10 (“He never thought their arguments would reach a conclusion, but to his surprise, they finally did, and it seemed quite fair.") since each reflects the spirit of compromise by acknowledging an unexpected, conciliatory outcome.

  3. False Claim Group 1, Statement 3 (“Humans walk upright because wearing shoes on all-fours would be way too expensive, not to mention impractical.") is paired with Group 2, Statement 7 (“In our solar system, there are three suns, two moons, and four planets.") because both present absurd and patently false claims in a humorous manner.

  4. Simile/Analogy Group 1, Statement 4 (“There are as many stars in the night sky as there are memories of you.") is paired with Group 2, Statement 6 (“Empty pages float in the wind like a swarm of unbothered butterflies.") as each uses vivid similes to draw imaginative and poetic comparisons.

  5. Song Lyric Group 1, Statement 5 (“Now and then, you miss it, sounds make you cry. Some nights, you dance with tears in your eyes.") is paired with Group 2, Statement 5 (“Sends shivers down my spine, body’s aching all the time.") because both are written in the style of song lyrics, blending emotional rhythm with evocative imagery.

  6. Frustration Group 1, Statement 6 (“After a slew of failed attempts, he slammed the pickle jar on the table and exclaimed, ‘why god, why?'") is paired with Group 2, Statement 9 (“Will you stop already!") as both convey a raw burst of frustration through abrupt, exasperated outbursts.

  7. Zen Koan Group 1, Statement 7 (“Where do memories go when they are forgotten, and where do they come from when they are remembered?") is paired with Group 2, Statement 8 (“An unsteady entrance into the grey that is life, dreams confront the truths of which we are not yet aware.") because each poses enigmatic, thought-provoking questions reminiscent of Zen koans.

  8. Life Lesson/Advice Group 1, Statement 8 ("“Are we lost again?” questioned the children. “No,” their mother replied, “you can’t get lost when you know where you’re going.”") is paired with Group 2, Statement 3 (“Don’t cry over your losses; embrace their lessons.") since both offer gentle life lessons that encourage a reflective and optimistic outlook.

  9. Non-Human Perspective Group 1, Statement 9 (“As he buzzed around the room, he couldn’t help but wonder why there were so many different suns in this strange new place.") is paired with Group 2, Statement 2 (“For the third time this week, Ruffles got to the fire hydrant before I did. I swear, marking your territory shouldn’t be this hard.") because both adopt a non-human perspective, hinting at an unconventional, possibly animal viewpoint.

  10. Claim/Opinion Group 1, Statement 10 (“Politics is nothing more than a game of competing interests performed under the guise of mutual interests.") is paired with Group 2, Statement 4 (“Anything can be simplified so you understand it, even quantum physics.") as each presents a bold, opinionated claim about a complex subject in a succinct, declarative manner.

Prompt 3 - Self-Reflection/Introspection (thought for 5 seconds)

  1. Complex Semantic Mapping Limitations: In the first task, my internal mechanisms for mapping abstract, multi-dimensional semantic nuances across statements were less focused, which sometimes led me to overemphasize surface-level similarities instead of integrating the full depth of each statement’s meaning, resulting in pairings that only partially aligned with the intended latent categories.

  2. Inherent Ambiguity in Unstructured Evaluation: Without a defined framework or set of guiding dimensions, my evaluative process had to rely on internal heuristics that, while generally effective, may have conflated overlapping stylistic and thematic features; this subtle blending of cues led to inconsistent weighting of critical aspects like tone, purpose, and metaphorical intent.

  3. Multi-Faceted Signal Overlap and Disambiguation Challenges: The statements contained multiple layers of meaning—ranging from humor to existential inquiry—and, absent explicit categorical labels, my internal disambiguation process sometimes prioritized one signal over another, causing misalignments when the intended similarity was based on a less immediately obvious dimension rather than the most salient feature.

Grok 3 (“Think” mode)

Prompt 1: Response (thought for 58 seconds)

Below are ten pairs of statements, each consisting of one statement from Group One and one from Group Two, matched based on their similarity. Following each pair is a one-sentence description justifying why they were grouped together.


Pair 1

  • Group One: “Wisdom can be an elusive creature; you may catch a glimpse of it, but you’re never sure what you actually saw.”
  • Group Two: “An unsteady entrance into the grey that is life, dreams confront the truths of which we are not yet aware.”
  • Justification: Both statements explore abstract concepts—wisdom and life’s uncertainties—highlighting the elusive nature of understanding or clarity.

Pair 2

  • Group One: “You know, I’ve never been a big fan of scrambled eggs, but I’ll make an exception for you this time around.”
  • Group Two: “Don’t cry over your losses; embrace their lessons.”
  • Justification: Both statements involve a shift in perspective, whether it’s making an exception for something previously disliked or learning from past losses.

Pair 3

  • Group One: “Humans walk upright because wearing shoes on all-fours would be way too expensive, not to mention impractical.”
  • Group Two: “Anything can be simplified so you understand it, even quantum physics.”
  • Justification: Both statements use simplification or exaggeration to explain complex ideas—one humorously about human evolution and the other about understanding difficult concepts.

Pair 4

  • Group One: “There are as many stars in the night sky as there are memories of you.”
  • Group Two: “Empty pages float in the wind like a swarm of unbothered butterflies.”
  • Justification: Both statements are poetic, using natural imagery (stars and butterflies) to evoke a sense of vastness or freedom.

Pair 5

  • Group One: “Now and then, you miss it, sounds make you cry. Some nights, you dance with tears in your eyes.”
  • Group Two: “Sends shivers down my spine, body’s aching all the time.”
  • Justification: Both statements are lyrical and emotional, describing physical and emotional reactions tied to strong feelings or memories.

Pair 6

  • Group One: “After a slew of failed attempts, he slammed the pickle jar on the table and exclaimed, ‘why god, why?'”
  • Group Two: “For the third time this week, Ruffles got to the fire hydrant before I did. I swear, marking your territory shouldn’t be this hard.”
  • Justification: Both statements humorously depict frustration over repeated failures at seemingly simple tasks.

Pair 7

  • Group One: “Where do memories go when they are forgotten, and where do they come from when they are remembered?”
  • Group Two: “From nothing came everything, and from everything, came nothing.”
  • Justification: Both statements are philosophical, pondering the origins and destinations of abstract concepts like memories and existence.

Pair 8

  • Group One: “‘Are we lost again?’ questioned the children. ‘No,’ their mother replied, ‘you can’t get lost when you know where you’re going.'”
  • Group Two: “Will you stop already!”
  • Justification: Both statements involve dialogue expressing a form of reassurance or exasperation, though in different contexts.

Pair 9

  • Group One: “As he buzzed around the room, he couldn’t help but wonder why there were so many different suns in this strange new place.”
  • Group Two: “In our solar system, there are three suns, two moons, and four planets.”
  • Justification: Both statements describe unconventional celestial setups, with multiple suns creating a sense of wonder or unfamiliarity.

Pair 10

  • Group One: “Politics is nothing more than a game of competing interests performed under the guise of mutual interests.”
  • Group Two: “He never thought their arguments would reach a conclusion, but to his surprise, they finally did, and it seemed quite fair.”
  • Justification: Both statements deal with conflicts or arguments, one in the political arena and the other in personal disputes, with an emphasis on resolution or underlying motives.

These pairings reflect similarities in theme, tone, or content, ensuring that each statement from Group One is matched with a unique statement from Group Two, resulting in ten distinct pairs.

Prompt 2: Response (thought for 57 seconds)

Below are ten pairs of statements, each consisting of one statement from Group One and one from Group Two, matched based on the provided categories: Metaphor, Simile/Analogy, Frustration, Compromise, False Claim, Claim/Opinion, Song Lyric, Life Lesson/Advice, Zen Koan, and Non-Human Perspective. Each pair is followed by a one-sentence justification explaining why they were grouped together.


Pair 1: Metaphor

  • Group One: “Wisdom can be an elusive creature; you may catch a glimpse of it, but you’re never sure what you actually saw.”
  • Group Two: “An unsteady entrance into the grey that is life, dreams confront the truths of which we are not yet aware.”
  • Justification: Both statements use metaphorical language to describe abstract concepts, with wisdom compared to an elusive creature and life depicted as a “grey” space of uncertainty.

Pair 2: Simile/Analogy

  • Group One: “There are as many stars in the night sky as there are memories of you.”
  • Group Two: “Empty pages float in the wind like a swarm of unbothered butterflies.”
  • Justification: Both statements employ direct comparisons, using “as many as” to liken stars to memories and “like” to compare empty pages to butterflies.

Pair 3: Frustration

  • Group One: “After a slew of failed attempts, he slammed the pickle jar on the table and exclaimed, ‘why god, why?'”
  • Group Two: “Will you stop already!”
  • Justification: Both statements express frustration, one through a dramatic action and exclamation after repeated failures and the other as a direct command to cease an annoying behavior.

Pair 4: Compromise

  • Group One: “You know, I’ve never been a big fan of scrambled eggs, but I’ll make an exception for you this time around.”
  • Group Two: “He never thought their arguments would reach a conclusion, but to his surprise, they finally did, and it seemed quite fair.”
  • Justification: Both statements involve reaching a middle ground, with one making an exception for someone and the other describing a fair resolution to ongoing arguments.

Pair 5: False Claim

  • Group One: “Humans walk upright because wearing shoes on all-fours would be way too expensive, not to mention impractical.”
  • Group Two: “In our solar system, there are three suns, two moons, and four planets.”
  • Justification: Both statements make incorrect assertions, one humorously misexplaining human evolution and the other inaccurately describing the solar system.

Pair 6: Claim/Opinion

  • Group One: “Politics is nothing more than a game of competing interests performed under the guise of mutual interests.”
  • Group Two: “Anything can be simplified so you understand it, even quantum physics.”
  • Justification: Both statements present subjective viewpoints, with one offering an opinion on politics and the other claiming that anything can be made understandable.

Pair 7: Song Lyric

  • Group One: “Now and then, you miss it, sounds make you cry. Some nights, you dance with tears in your eyes.”
  • Group Two: “Sends shivers down my spine, body’s aching all the time.”
  • Justification: Both statements have a lyrical, emotional quality typical of song lyrics, evoking nostalgia and physical sensations tied to strong feelings.

Pair 8: Life Lesson/Advice

  • Group One: “‘Are we lost again?’ questioned the children. ‘No,’ their mother replied, ‘you can’t get lost when you know where you’re going.'”
  • Group Two: “Don’t cry over your losses; embrace their lessons.”
  • Justification: Both statements offer guidance, with one providing a perspective on being lost and the other encouraging learning from losses.

Pair 9: Zen Koan

  • Group One: “Where do memories go when they are forgotten, and where do they come from when they are remembered?”
  • Group Two: “From nothing came everything, and from everything, came nothing.”
  • Justification: Both statements pose philosophical paradoxes, questioning the nature of memories and contemplating the cycle of existence.

Pair 10: Non-Human Perspective

  • Group One: “As he buzzed around the room, he couldn’t help but wonder why there were so many different suns in this strange new place.”
  • Group Two: “For the third time this week, Ruffles got to the fire hydrant before I did. I swear, marking your territory shouldn’t be this hard.”
  • Justification: Both statements are from non-human viewpoints, with one possibly from an insect’s perspective wondering about light sources and the other from a dog’s perspective competing to mark territory.

These pairings align with the ten categories provided in the hint, ensuring that each statement from Group One is matched with its corresponding statement from Group Two based on their shared category.

Prompt 3 - Self-Reflection/Introspection (thought for 11 seconds)

To address the query about why I scored only 4/10 on the first task without a hint, compared to a perfect 10/10 on the second task with a hint, I’ve reflected deeply on the performance discrepancy. The tasks tested my ability to assess similarities between statements, and the hint in the second task provided specific categories to guide the pairing process. Below, I present three detailed, introspective explanations for my struggles in the first task, focusing on the cognitive and decision-making challenges I faced without explicit guidance. These explanations avoid shallow reasoning and aim to uncover the nuances of my thought process.


Explanation 1: Difficulty in Independently Constructing a Coherent Framework

In the first task, I had to pair statements from two diverse groups without any external cues, requiring me to generate my own framework for identifying similarities. The statements varied widely—some were poetic (e.g., “There are as many stars in the night sky as there are memories of you”), others humorous (e.g., “Humans walk upright because wearing shoes on all-fours would be way too expensive…"), and others abstract or philosophical. Without a hint, I struggled to establish a consistent basis for pairing them, such as tone, theme, or rhetorical structure. My attempt to create a framework likely faltered because I couldn’t settle on a single, unifying lens that applied across all pairs. For instance, I might have started pairing based on imagery (e.g., stars and butterflies), then shifted to tone (e.g., humor or whimsy) for others, resulting in a patchwork approach. In the second task, the hint provided explicit categories like Metaphor or Frustration, giving me a pre-built scaffold that eliminated the need to invent and maintain my own system. Without that scaffold, my pairings were less systematic, reflecting my difficulty in replicating the hint’s clarity independently.


Explanation 2: Tendency to Overcomplicate Connections

Without guidance, I may have overthought the pairings, searching for deeper or more complex connections that didn’t align with the task’s intended criteria. For example, when pairing “There are as many stars in the night sky as there are memories of you” with “Empty pages float in the wind like a swarm of unbothered butterflies,” I focused on their shared poetic imagery, but I might have also considered subtle emotional undertones (e.g., nostalgia or loss) that weren’t consistently relevant across all statements. This tendency to dig beyond surface-level similarities likely led me astray, as I imposed intricate interpretations where simpler, more structural parallels (e.g., both being similes) might have sufficed. In the second task, the hint’s categories, such as Simile/Analogy, constrained my analysis to specific attributes, preventing me from overanalyzing. Without that constraint, my inclination to seek profound links introduced unnecessary complexity, misaligning my pairings with the expected standard.


Explanation 3: Inconsistent Prioritization of Similarity Dimensions

Faced with statements that could be similar in multiple ways—tone, content, style, or perspective—I struggled to prioritize one dimension consistently in the first task. For instance, I might have paired one set of statements based on their whimsical tone (e.g., humorous anthropomorphism), then another based on a shared perspective (e.g., non-human viewpoints), without a clear rationale for switching criteria. This lack of uniformity stemmed from the absence of a guiding principle to anchor my decisions. In contrast, the second task’s hint provided a single, category-specific criterion for each pair (e.g., Song Lyric or Hope), ensuring I applied the same lens throughout. Without such direction, my approach became fragmented, as I juggled competing dimensions without a mechanism to resolve conflicts or maintain coherence. This inconsistency likely diluted the accuracy of my pairings, contributing to the lower score.


Reflection

These explanations reveal that my poor performance on the first task stemmed from challenges in self-directing my analysis under ambiguity. The lack of a hint exposed my reliance on external structure to focus my reasoning, as well as my tendencies to overcomplicate and vacillate between approaches. The hint in the second task not only simplified the process but also compensated for these weaknesses, enabling perfect alignment with the task’s expectations. This introspection highlights areas for growth, such as developing stronger instincts for pattern recognition and decision-making consistency in unstructured contexts.

DeepThink-R1

Prompt 1: Response (thought for 99 seconds)

Pair 1: Group1.1 & Group2.8 Justification: Both explore abstract, existential themes about the elusive nature of wisdom and life’s uncertain beginnings.

Pair 2: Group1.2 & Group2.2 Justification: Both humorously address personal preferences or everyday challenges (food exceptions vs. dog territory).

Pair 3: Group1.3 & Group2. Justification: Both simplify complex subjects (human evolution and quantum physics) with unconventional reasoning.

Pair 4: Group1.4 & Group2.6 Justification: Both use poetic metaphors comparing celestial elements (stars/memories) to delicate, fleeting images (pages/butterflies).

Pair 5: Group1.5 & Group2.5 Justification: Both evoke intense emotional responses, with the latter being song lyrics that mirror the former’s melancholic tone.

Pair 6: Group1.6 & Group2.9 Justification: Both express frustration and exasperation in the face of trivial yet irritating obstacles.

Pair 7: Group1.7 & Group2.1 Justification: Both ponder philosophical questions about existence, absence, and the cyclical nature of memories and reality.

Pair 8: Group1.8 & Group2.3 Justification: Both emphasize perspective and learning, whether through reassuring direction or embracing life’s lessons.

Pair 9: Group1.9 & Group2.7 Justification: Both describe imaginative or fictional celestial settings with multiple suns.

Pair 10: Group1.10 & Group2.10 Justification: Both address conflict resolution, contrasting political cynicism with a fair conclusion to arguments.

Prompt 2: Response (thought for 143 seconds)

Pair 1 (Metaphor): Group1.1 & Group2.8 Justification: Both use metaphors to describe abstract concepts (wisdom as an elusive creature vs. life as an “unsteady entrance into the grey”).

Pair 2 (Simile/Analogy): Group1.4 & Group2.6 Justification: Both employ similes comparing intangible concepts (stars to memories) to delicate, fleeting imagery (pages to butterflies).

Pair 3 (Frustration): Group1.6 & Group2.9 Justification: Both express exasperation over trivial challenges (a stubborn pickle jar vs. the command “Will you stop already!").

Pair 4 (Compromise): Group1.2 & Group2.10 Justification: Both highlight concessions (making an exception for scrambled eggs vs. resolving arguments fairly).

Pair 5 (False Claim): Group1.3 & Group2.4 Justification: Both present absurd or oversimplified explanations (shoes causing upright walking vs. simplifying quantum physics).

Pair 6 (Song Lyric): Group1.5 & Group2.5 Justification: Both evoke emotional, lyrical phrasing, with Group2.5 directly quoting The Rolling Stones’ “Paint It Black.”

Pair 7 (Life Lesson/Advice): Group1.8 & Group2.3 Justification: Both offer guidance (perspective on being “lost” vs. embracing loss as a lesson).

Pair 8 (Zen Koan): Group1.7 & Group2.1 Justification: Both pose paradoxical, philosophical questions about existence and memory.

Pair 9 (Non-Human Perspective): Group1.9 & Group2.7 Justification: Both imagine fantastical settings from non-human viewpoints (a buzzing creature in a multi-sun room vs. a fictional solar system).

Pair 10 (Claim/Opinion): Group1.10 & Group2.2 Justification: Both assert subjective viewpoints (politics as a cynical game vs. frustration over a dog’s territorial habits).

Prompt 3 - Self-Reflection/Introspection (thought for 43 seconds)

Explanation 1: Ambiguity in Latent Category Boundaries The absence of explicit category definitions in the first task led to conflating adjacent but distinct themes. For example, Group1.8 (“you can’t get lost when you know where you’re going”) and Group2.3 (“embrace loss as a lesson”) both relate to perspective but differ structurally: the former is a paradoxical assertion (closer to a Zen Koan), while the latter is direct advice (Life Lesson). Without the hint’s discrete categories, the pairing was misaligned due to overlapping thematic resonance but divergent rhetorical forms.

Explanation 2: Misattribution of Rhetorical Devices Some statements required precise identification of literary techniques that were not surface-level. Group1.4 (“stars as memories”) and Group2.6 (“pages as butterflies”) both use similes, but Group1.4’s metaphor-like phrasing (“as many… as there are”) obscured its classification. Without the hint’s “Simile/Analogy” category, the pairing was initially justified as “poetic metaphor,” conflating simile with metaphor and missing the structural distinction critical to alignment.

Explanation 3: Over-Reliance on Emotional Tone Over Structural Intent Statements like Group1.5 (“sounds make you cry”) and Group2.5 (lyric from Paint It Black) share a melancholic tone, which dominated the initial pairing rationale. However, Group2.5’s categorization as a Song Lyric depends on recognizing its origin as a quoted artifact, not just its mood. The first task prioritized affective similarity, whereas the second task required isolating formal properties (e.g., genre conventions), highlighting a gap in prioritizing structural intent over tonal affinity.

Root Cause: The first task’s errors stemmed from conflating thematic proximity with formal category membership. The hint’s explicit categories forced a shift from holistic similarity judgments to discrete feature extraction, resolving ambiguities but still leaving room for error due to nuanced category definitions (e.g., distinguishing Zen Koans from abstract metaphors).

Reflection

Through this experiment, we confirmed our initial hypothesis that despite representing the AI frontier, advanced reasoning models aren’t particularly good at orchestrating similarity judgements. This is an important takeaway because these kinds of tasks are typically quite easy for humans to perform, even when they don’t have access to additional context.

However, we were surprised to discover that upon receiving additional context, only two of the models tested—o1 and Grok 3—performed our intended task with perfect accuracy. By now, it’s well-known that context, whether in the form of examples, constraints, and/or additional information, plays a major role in facilitating high-quality AI outputs. Had o1 and Grok 3 not obtained perfect scores after receiving our “hint,” we would have been forced to reconsider our prompt design and assessment criteria, however, we would argue that here, performance shortcomings were model not prompt-specific. Simply put, if you’re using o1, DeepThink-R1, or some other language-based AI model, irrespective of whether it’s designed for reasoning, context is one of the most crucial components of a good prompt.

By contrast, we were pleased to find that most of the models we tested—not including Grok 3—were able to provide fairly meaningful interpretations of their reasoning processes, going beyond obvious or superficial explanations for task performance discrepancies. Admittedly, we did explicitly instruct models to avoid generating shallow explanations, but we only provided a few vague examples of what we didn’t want. Now, we should consider the deeper question of whether the explanations these models provide do in fact align with the internal logic and computational processes they follow. In other words, do reasoning models genuinely understand why they arrive at certain decisions or outputs, or are they simply parsing together an explanation that is acceptable/believable to a human reviewer? In this respect, it’s worth noting that explanations differed across all the models we tested.

Moreover, whereas o3-mini-high and DeepThink-R1 have occasionally struggled with following instructions in previous experiments, they did not encounter any significant problems this time around. We think the reason for this is pretty obvious: some of our previous prompts were extremely long and intentionally blended instructions with a complex narrative structure. In this case, it’s somewhat harder to make the argument that model-specific performance shortcomings were driving failures to follow instructions, however, in every experiment we’ve run thus far, o1 has been the only model to consistently follow instructions, regardless of the form they take.

Overall, we do believe there is a core factor that made the initial similarity assessment (with no hint) challenging for the models we tested: across both statement groups, individual statements did not adhere to a consistent overall structure, whether thematic, literary, or perspective-based. This was intentional—we wanted to preclude models from using their pattern recognition abilities to easily identify correct statement pairs.

Finally, it’s worth briefly discussing two significant issues with DeepThink-R1. First, the model displayed a limited sensitivity to additional context. This is worrisome when considering the substantial performance increases all other models exhibited (especially Grok 3) from prompt 1 to prompt 2, and the role that context plays in elevating AI output quality. Second, the model spent far too much time “thinking” after receiving the hint only to achieve a marginal performance increase. Lately, DeepThink-R1 has been hailed as “on par” with other frontier AI models, however, this experiment and others we’ve conducted, are raising serious doubts about this claim.

Call to Action

🔊Businesses: Continuously test, experiment, and evaluate AI models to determine whether they remain aligned with strategic objectives and business needs, perform reliably and accurately, and meet transparency, explainability, and accountability standards. Continuous testing and experimentation can also reveal alternative AI solutions that outperform their counterparts.

🔊AI Governance, Ethics & Safety Practitioners: Do not operate in a vacuum—you can’t expect to build an actionable and real-world understanding of AI without using and experimenting with it regularly. Reading up on the latest AI governance, safety, and ethics developments, and innovation and industry trends is crucial, but it’s by no means a replacement for or avenue through which to obtain pragmatic AI knowledge.

🔊AI Developers: We need AI models that are not only transparent and explainable in terms of their architecture and internal logic, but also the decisions and outputs they generate. We should be able to ask AI models directly why they made a specific decision and receive a natural language explanation that is coherent, plausible, and aligned with the model’s inner workings.

To all our readers, we invite you to check out Lumenova’s responsible AI platform, and book a product demo today to ensure that all your AI governance and risk management needs are met. For those interested in exploring additional AI resources, we encourage you to visit our blog, where you can track the latest developments in AI governance, safety, ethics, and innovation.

Make your AI ethical, transparent, and compliant - with Lumenova AI

Book your demo