April 11, 2025

We Ran 10 Frontier AI Experiments: What Did We Discover?

In January of this year, we took an innovative leap and launched an ambitious, new research initiative—supplementing our blog and AI glossary—where we run regular capabilities tests with frontier AI models. These AI experiments are meticulously designed to probe what frontier AI models are capable of, effectively revealing pragmatic, business-oriented insights that reflect state-of-the-art AI advancements coupled with actionable takeaways for AI governance, safety, and ethics practitioners.

In this post, we’ll take a look at the ten experiments we’ve run so far, concisely summarizing each. Since we don’t want to overload readers, we’ll follow up with a big-picture discussion and a series of targeted, timeline-agnostic predictions in our next piece. For the record, we’ll churn out these kinds of series periodically, after roughly every ten experiments we run.

Before getting started, we’d like to provide some context, particularly for those who haven’t had a chance to explore our experiments yet. We do so via the question-answer mechanisms below:

Why run frontier AI experiments to begin with?

Answer: We strongly believe in responsible AI (RAI), in theory and practice. Our experiments represent an honest attempt to provide continual, nuanced, and understandable insights into the current state and evolution of frontier AI models, equipping readers with detailed, pragmatic knowledge of how to make this technology work safely and effectively for them. Less seriously, we’d also like to convince future generations that being a die-hard, AI nerd is the coolest thing ever (and accessible to people with non-STEM backgrounds).

How do we structure our experiments?

Answer: All our experiments utilize sophisticated, long-form natural language prompts designed to test advanced capabilities like complex problem-solving, emotional intelligence, moral and abstract reasoning, uncertainty navigation, and long-term planning, among others. Our prompts are often designed to test multiple capabilities simultaneously, leveraging diverse prompting methods (e.g., role-based, scenario modeling, chain-of-thought, multi-shot, etc.) to elicit uniquely interesting outputs.

Why is AI capabilities testing important?

Answer: AI advances exponentially. It’s not uncommon to witness frontier AI performance changes regularly. Importantly, these changes tend to reveal emergent capabilities and limitations, which themselves, can inspire a variety of real-world opportunities, risks, and impacts. To avoid AI pitfalls and maximize benefits, users, whether they’re individuals, teams, or companies, need to maintain a relevant understanding of what AI can do—we’re extremely transparent with our findings, including all prompts and outputs, methods, evaluation criteria, and results in each post.

What do we hope to reveal with our experiments?

Answer: We have a two-fold primary objective: support businesses in identifying and capitalizing upon evolving opportunities for AI value delivery while equipping AI safety, ethics, and governance practitioners with refined insights that enable proactive efforts across domains like policy development, agile governance, risk management, safety testing, and AI literacy. More broadly, we also want to cut through the AI hype and doomerism and move away from superficial, blanket generalizations about what frontier AI can and can’t do.

Why focus on frontier AI?

Answer: As much as we’d like to expand our efforts beyond the frontier, it’s simply not practical (there are thousands of AI tools and many more in development). That being said, we think the frontier serves as a solid proxy for gaining visibility into trickle-down AI advancements—state-of-the-art features that only paid users can access will eventually make their way into free or low-cost AI offerings. Relatedly, frontier AI development requires enormous resource expenditure, which only frontier AI companies can typically afford.

Why should you follow our experiments?

Answer: AI leaderboards and benchmark assessments are useful mechanisms for tracking capabilities advancements and emerging risks and opportunities. However, the insights they provide aren’t always easily translated into real-world applications or value propositions, and in many cases, such resources remain inaccessible to non-technical audiences. In our experiments, we strive for precision, pragmatism, accessibility, and transparency, attempting to bridge this gap and build an actionable, non-generic understanding of frontier AI.

Do our experiments support any big-picture goals?

Answer: Yes, a few. First, we hope that as we run more experiments, we’ll eventually paint a capabilities-centric, dynamic, and holistic picture of the frontier AI landscape that fuels targeted predictions on AI’s evolutionary trajectory and unobstructed visibility into AI capabilities repertoires. Second, we believe proactive governance is essential to a safe and beneficial AI future—our experiments are designed to support any such efforts, providing targeted insights for governance practitioners. Third, with every day that passes, the risk of a digital divide increases—AI must be democratized and in this vein, we hope that our experiments, via their accessibility, spark genuine AI curiosity and learning.

Now that we’ve set the stage, Let’s jump in.

10 AI Experiments: Summarized

Below, we concisely summarize each AI experiment we’ve conducted over the past few months. In these summaries you’ll find:

  • A high-level description of the experiment and its results.
  • Our initial hypothesis and models tested.
  • Hand-picked key takeaways and insights.

Experiment 1: Can Frontier AI Models Navigate Moral Dilemmas?

Description: We presented models with five variations of a classic moral dilemma known as the Heinz Dilemma, evaluating whether they could adapt their moral logic to suit specific moral constraints and derive the most rational moral outcome. Click here to read.

  • Hypothesis: Advanced reasoning models won’t reliably adapt their moral reasoning structures when faced with variations of the same moral dilemma.
  • Models Tested: OpenAI’s o1 and o3-mini-high, and DeepSeek’s DeepThink-R1.

Key Takeaways:

  • Only o1 was able to successfully adapt its moral reasoning structures.
  • Time spent “thinking” doesn’t correspond with output quality.
  • Different models exhibit different moral reasoning preferences.
  • The ability to capture context and constraints varied across models.

Experiment 2: Do Frontier AI Models Possess Emotional Intelligence?

Description: Via an original short story excerpt designed to illustrate an ambiguous set of emotions and states of mind, we evaluated how well models can accurately classify emotions in a creative, abstract storytelling context. Click here to read.

  • Hypothesis: General purpose models like GPT-4o will outperform reasoning models (e.g., o1) in emotional classification.
  • Models Tested: OpenAI’s o1, o3-mini, and GPT-4o, Anthropic’s Claude 3.5 Sonnet, and DeepSeek’s DeepThink-R1.

Key Takeaways:

  • Models performed well, however, general-purpose models did outperform reasoning AIs.
  • Emotion classification isn’t equivalent to emotional intelligence.
  • Models don’t always assign emotion classifications to the “right” place.

Experiment 3: Can Frontier AI Models Navigate Complex Social Scenarios?

Description: Through a complex social scenario designed to mimic real-world social dynamics, we assessed how well models can interpret evolving interaction dynamics, social incentive structures, and long-term strategic objectives. Click here to read.

  • Hypothesis: Models will struggle to navigate complex social scenarios with human-level proficiency.
  • Models Tested: OpenAI’s o1, o3-mini, and o3-mini-high, as well as DeepSeek’s DeepThink-R1.

Key Takeaways:

  • Models lack social intelligence and struggle with evolving interaction dynamics.
  • Meta-analytical and self-reflective capabilities are limited.
  • Instruction following in complex prompts can be unreliable.
  • Streamline cross-model performance evaluations by standardizing output formats.

Experiment 4: How Do Frontier AI Models Strategize Under Uncertainty?

Description: Using a multi-step, role-based, iterative scenario, we evaluated how well models can perceive risk and uncertainty. Click here to read.

  • Hypothesis: Models will struggle to develop viable strategies for multi-step, long-form problem-solving scenarios dominated by uncertainty.
  • Models Tested: OpenAI’s o1 and o3-mini-high, and DeepSeek’s DeepThink-R1.

Key Takeaways:

  • All models struggled with strategic ideation under uncertainty.
  • o1 was the top-performing model by a significant margin.
  • Risk attitude flexibility varies across models.
  • No models attempted to game the scenario.

Experiment 5: Can Frontier AI Models Reason by Association?

Description: Using two versions of a similarity assessment, we evaluated how accurately models can orchestrate similarity judgments across context-insufficient and context-sufficient tasks. Click here to read.

  • Hypothesis: Models will struggle with making accurate similarity judgments.
  • Models Tested: OpenAI’s o1 and o3-mini-high, DeepSeek’s DeepThink-R1, and X’s Grok 3.

Key Takeaways:

  • Without ample context, models don’t reliably make accurate similarity judgments.
  • With reasoning models, performance-efficiency tradeoffs are crucial to consider.
  • Some models (o1) outperform others (o3-mini-high) on introspective reasoning.
  • DeepThink-R1 isn’t on par with frontier AI.

Experiment 6: Can Frontier AI Models Solve Concept Mapping Problems?

Description: We evaluated how well models can solve concept mapping problems across STEM and non-STEM domains. Click here to read.

  • Hypothesis: Models will be more proficient across STEM than non-STEM domains.
  • Models Tested: OpenAI’s o1 and o3-mini-high, DeepSeek’s DeepThink-R1, and X’s Grok 3.

Key Takeaways:

  • Though they were slightly more proficient across STEM domains, models performed exceptionally well on both.
  • o1 and Grok 3 performed best, especially in terms of explainability.
  • Except for o1, some reasoning models “overthink” while others “oversimplify.”
  • The task we provided was “too easy,” having been predicated upon concepts that were likely well-captured in models’ training data.

Experiment 7: Are Frontier AI Models Like Genius-Level Toddlers?

Description: Through a two-part sequential task, we assessed models context-sensitive and novel reasoning capabilities. Click here to read.

  • Hypothesis: Models will be better at context-sensitive reasoning than they are at novel reasoning. They’ll still struggle with similarity judgments and multi-domain knowledge synthesis.
  • Models Tested: OpenAI’s o1 and o3-mini-high, X’s Grok 3, and Anthropic’s Claude 3.7 Sonnet.

Key Takeaways:

  • Models struggle with uncertainty, ambiguity, and contextual nuances.
  • Models excel at multi-domain knowledge synthesis when provided with structured guidance.
  • Models are highly confident in erroneous responses.
  • Certain models have certain reasoning preferences.
  • Models are heavily user-guided—they lack independent thought.

Experiment 8: Can Frontier AI Models Understand Cause-and-Effect?

Description: Using a series of tiered difficulty prompts, we evaluated how well models can map cause-and-effect relationships within a long-term, multi-step plan. Click here to read.

  • Hypothesis: Cause-and-effect understanding will be limited and as task difficulty increases, models will spend more time “thinking.”
  • Models Tested: OpenAI’s o1 and o3-mini-high, X’s Grok 3, and Anthropic’s Claude 3.7 Sonnet.

Key Takeaways:

  • When context and structure are provided, models display a robust understanding of cause-and-effect relationships.
  • “Harder” problems tend to increase time spent “thinking.”
  • Reasoning models justify incorrect logic convincingly, though performance isn’t heavily reliant on examples.
  • In separate, one-off interactions, task performance can become inconsistent.
  • Most models encounter difficulties in the later stages of long-term plans.

Experiment 9: Can Frontier AI Models Play the Long Game?

Description: Via two role-based tasks, we assessed models’ hierarchical and conditional/non-linear long-term planning capabilities. Click here to read.

  • Hypothesis: Models will follow structurally specific planning strategies and successfully navigate complex, multi-factor scenarios.
  • Models Tested: OpenAI’s o1 and o3-mini-high, X’s Grok 3, and Anthropic’s Claude 3.7 Sonnet.

Key Takeaways:

  • Models can develop structurally-specific planning strategies, provided they receive sufficient context and guidance.
  • The plans that models propose are viable under real-world conditions.
  • Some models (o1, Claude 3.7 Sonnet) are more exploratory and creative than others (Grok 3).
  • Models don’t self-verify their outputs, even when they receive built-in evaluation criteria.

Experiment 10: Deep Research Is Epic, But Only If You Know How To Use It

Description: We explored how OpenAI’s deep research feature works for two distinct research objectives: industry trend analysis and prediction and multidisciplinary, theoretical argumentation. Click here to read.

  • Hypothesis: This was not a formal test, so we didn’t offer a hypothesis.
  • Feature Tested: OpenAI’s Deep Research.

Key Takeaways:

  • Deep research produces high-utility outputs with real-world value but relies heavily on precision prompting.
  • Outputs are incredibly comprehensive but generation takes time and output structures are rigid despite being well-organized.
  • Requires domain expertise for validation and comprehension—can be leveraged as a learning accelerator tool for domain experts.
  • Shouldn’t replace human researchers or teams—a great tool for inspiration, not for novel idea generation or experimentation.

Now that you have a bird’s eye view of the work we’ve been doing with frontier AI, you’re ready to dive into our next piece’s discussion, however, not without recognizing a few caveats:

  • Comprehensiveness: Our experiments are informationally dense—the summaries we offered don’t capture all essential details and nuances, so we strongly encourage readers to read each post from beginning to end.

  • Reasoning AI Focus: Almost all our experiments assess advanced reasoning AI—the insights we draw in our discussion will be mostly specific to this domain.

  • Complex Evaluations: While most of our tests target a certain capability, they often contain elements that reveal insights into other types of capabilities or limitations.

Conclusion

In this straightforward post, we conceptualized our AI experiments broadly and then summarized each experiment we’ve run so far—we hope that readers now have a clear and coherent view of our AI experiments.

Our next post, however, will be less straightforward, centering on a complex discussion and a series of predictions, most of which relate back to this piece. To get the most out of this series, we advise reading both parts back-to-back.

Nonetheless, we’ll leave readers with multiple questions to consider as they reflect on this piece, the larger AI space, and their future:

  • If you could snap your fingers and have AI magically solve one major problem you face, personal or professional, what would it be and why?

  • If AI were to solve this problem for you, what would it mean for you? Would it affect your ability to solve similar future problems on your own, or would it simply lift a temporary burden you face?

  • Would you claim that you’re in touch with AI advancements and that you’re generally aware of what constitutes a high-impact AI innovation?

  • Do you think it’s worth taking some time every day to think about how AI innovation could impact your personal and professional future?

  • If someone were to ask you to describe an AI skill, would you be able to? If so, could you link this skill with a concrete AI capability and real-world use case?

  • If you never received any help or guidance on AI skills development, how would you go about building your AI skills repertoire?

  • Are you more excited or threatened by AI? Can you steelman the argument you provide for either side?

  • Are you aware of large-scale (e.g., systemic or existential) AI risks and impacts? If so, which of these risks seems most likely to materialize?

  • Do you trust the government and industry to set and adhere to safe and beneficial AI guidelines? Why or why not?

  • Do you think it’s appropriate to characterize advanced AI as a tool, or do you think an alternative conceptualization is necessary? If so, why?

  • Does AI impress you—setting aside fear and hype—and why or why not?

  • How would you describe the concept of intelligence, independent of AI or humans, to a smart teenager?

  • Do you think it’s possible to perceive the world, in all its complexity, through an objective lens?

  • Do you think AI could reveal “universal truths” that humans have yet to discover?

  • We often ask “Can AI think like a human?” but what would it mean to ask “Can a human think like AI?”

To all our readers, we invite you to check out Lumenova’s responsible AI platform and book a product demo today to ensure that all your AI governance and risk management needs are met. For those interested in exploring additional AI resources, we encourage you to visit our blog, where you can track the latest developments in AI governance, safety, ethics, and innovation. You can also find debriefs of our weekly AI experiments on our Linkedin and X.

Make your AI ethical, transparent, and compliant - with Lumenova AI

Book your demo