Quantum superintelligence: pursuing type-2 generalization
Building the first AI that can transcend human conceptual frameworks
Yoshi and I are forming QuSu Labs to answer two questions:
Can artificial systems learn to do what Einstein did to Newton - not merely solve harder problems, but restructure conceptual foundations?
If so, can such a system solve open problems like quantum gravity within a decade?
This capability is not one research direction among many. Superintelligence is not a matter of scale. It is a system capable of something current AI cannot do: recognize when an entire way of thinking is inadequate, and construct a new one. That is what we are building.
Quantum gravity is our long-range beacon. No compelling solution has emerged from decades of work within the existing frameworks of general relativity and quantum mechanics - frameworks that appear jointly inadequate, not merely incomplete. We aim to win both the Breakthrough Prize and the Nobel Prize in Physics for a contribution to a solution that remains unimagined today.
Along the way, we commit to a concrete, falsifiable intermediate target:
Within two years, demonstrate an AI system that solves problems provably unsolvable within its starting formalism, where independent evaluation confirms the solution reflects genuine conceptual reorganization, not notational manipulation or latent encoding.
This is a stronger claim than current AI benchmarks attempt to measure. Existing benchmarks—from mathematical olympiads to SWE-bench to ARC-AGI—test how effectively systems operate within established frameworks (e.g., fixed priors and representational structures). We are testing whether systems can originate new frameworks when existing ones are structurally inadequate.
We assign a medium probability to directly solving quantum gravity. But the infrastructure this attempt forces us to build—rigorous benchmarks for conceptual innovation, evaluation methods for distinguishing genuine reorganization from clever encoding, tools that expose how AI systems represent structure—is independently valuable. If we succeed, we will show how. If we fail, we will explain why.
We’re assembling unprecedented talent density to attack this problem, building an N of 1 team whose specialization is taking high-level notions (quantum information, consciousness, vagueness, time) and making them formally representable using tools from abstract math. Our bet is simple: if you put these people in one room and ask them to do the same for reasoning, you can expand what AI is for.
The capability we care about
Most of what we call "intelligence" in AI today can be described as Type-1 generalization: learning to do more things inside a fixed representational scheme.
Language models, theorem provers, search-based agents, game-playing systems, and specialized "world models" all operate this way. They learn to recognize patterns, compress data, search combinatorial spaces, and generate new constructions — but always within a space of representations and rules that is established before learning begins.
Type-1 generalization is not "just interpolation." It can look remarkably creative. AlphaGo's Move 37, novel proofs from automated theorem provers, unexpected solutions from program synthesis — all of these are genuine discoveries. But there is a line they do not cross: the formalism in which they operate is not itself under revision.
The capability we care about is different.
Type-2 generalization is the ability to recognize when your current formalism is structurally inadequate for the class of problems you face, and to construct and adopt a better one.
A note on terminology. The AI field is saturated with claims about "true reasoning," "self-revising systems," "meta-cognition," and "discovery." These terms are often used loosely, conflating capabilities that are fundamentally distinct.
Type-2 generalization is not meta-cognition in the standard sense; a system monitoring its own performance and adjusting strategies accordingly remains within-framework optimization. It is not the frame problem, which concerns representing what remains unchanged across state transitions within a fixed ontology. It is not recursive self-improvement as typically described, where systems rewrite their own code or weights to perform better on fixed objectives. It is not "discovery" in the sense of finding novel instances, proofs, or constructions within an established formal space. Nor is it "System 2" reasoning in the Kahneman-Bengio sense; deliberate, sequential cognition as opposed to fast pattern-matching. System 2 describes a mode of processing within a fixed representational structure. Type-2 concerns the structure itself.
Type-2 is narrower, more specific, and more consequential: the capacity to recognize that a class of problems cannot be solved within the current representational structure, and to construct and adopt a new structure in which they become solvable. This is what distinguishes a paradigm shift from normal progress. It is not a difference of degree but of kind.
We use this terminology precisely because the field lacks it. Without clear distinctions, every system that performs well on novel tasks can be described as "reasoning" or "discovering." That obscures rather than illuminates. Type-1 and Type-2 are offered as formal categories - falsifiable, measurable, and distinct.
Why this matters
Humans achieve Type-2 generalization rarely, but it is a real cognitive capacity. When we do achieve it, the impact is disproportionate. The shift from Newtonian mechanics to relativity, the transition from classical to quantum descriptions of matter, the introduction of natural selection as an explanatory framework — these are not new theorems within existing systems. They are new systems: new primitives, new admissible relations, new criteria for what counts as an explanation. The pattern extends beyond science. The emergence of democratic institutions, the formulation of universal rights, the development of probability as a framework for reasoning under uncertainty — each required abandoning assumptions that prior frameworks took as fixed.
The most consequential breakthroughs in civilizational history came from restructuring the frame itself, not from working harder inside the existing frame.
Current AI systems are superhuman at Type-1. They are not capable of Type-2. The question we are asking is whether that boundary is fundamental, or merely a feature of current approaches that can be overcome.
The path to superintelligence
Type-2 generalization is the definition of superintelligence, not an interesting research direction.
A system that operates within human-provided frameworks is bounded by human cognition. It can be faster, more thorough, more precise. It cannot exceed what humans can conceive. It is an amplifier of existing intelligence, not a new kind of intelligence.
A system that reasons about its own frameworks — recognizing their inadequacy, constructing better ones, adopting them — is something else. That is what the word "superintelligence" actually means, if it means anything at all.
The entire field is scaling Type-1 — even those who recognize the limitation. Correctly diagnosing the problem does not mean your architecture can solve it. Betting that sufficient capability within inherited structures will somehow transcend those structures is a category error.
The civilizational stakes follow directly.
A world with only Type-1 systems—however capable—is a world where artificial intelligence accelerates human cognition but cannot extend it. The problems we can conceive remain the problems we can solve. The upper bound on machine intelligence remains us.
A world with Type-2 systems is different in kind. The space of thinkable problems expands. Century-long bottlenecks in physics, mathematics, and engineering become accessible. The upper bound on intelligence is no longer human.
That is what we are pursuing.
Why now: the frontier is exposing its own limits
For the first time, the field has clear, shared reference points for what general intelligence in machines might mean — and those reference points are revealing a ceiling. This month, Ilya Sutskever stated publicly that current scaling will stall, that a completely new ML paradigm is needed, and that historical breakthroughs required almost no compute. The question is no longer whether the ceiling exists. It's who has a path through it.
The ARC-AGI benchmark family has become a central touchstone for measuring progress toward general reasoning. It tests skill-acquisition efficiency on tasks that are easy for humans but hard for AI, using only simple "core knowledge priors" like spatial relations and object persistence. The ARC Prize Foundation explicitly describes it as a north star for AGI.
In late 2024, OpenAI's o3 system broke through a long-standing barrier, reaching 75.7% on ARC-AGI-1 under standard compute and 87.5% in high-compute runs, results the ARC team called a breakthrough. In November 2025, Poetiq announced new state-of-the-art results on both ARC-AGI-1 and ARC-AGI-2, using a meta-system that orchestrates frontier models and rewrites reasoning strategy itself.
This is real progress. It demonstrates that Type-1 generalization — learning to operate more effectively within a fixed set of priors and representations — is powerful and commercially viable. Systems can now take human-chosen primitives and discover increasingly sophisticated ways to compose and deploy them.
But notice what these benchmarks hold fixed.
ARC-AGI must hold its core priors constant to be scientifically meaningful. Poetiq's meta-system discovers better strategies for using those priors, but it does not treat "spatial grids with colored tiles and local rules" as something it could discard if the problem demanded it. The priors are given. The question is how well you search within them.
This is the line between Type-1 and Type-2.
Type-1 asks: given these primitives, how efficiently can you learn to solve problems?
Type-2 asks: what if the primitives themselves are inadequate for the class of problems you face?
No current benchmark measures this. No current system attempts it.
The emergence bet
The dominant implicit assumption in the field is that Type-2 capabilities will eventually emerge from Type-1 systems at sufficient scale. Train a large enough model, run a powerful enough search, add enough recursive self-improvement — and paradigm-level shifts will somehow fall out.
Others are beginning to question this. The problem that Kuhn identified — that scientific revolutions require restructuring assumptions, not just extending them — is gaining traction at the frontier. Researchers are asking whether current architectures have a path to genuine conceptual innovation, or whether something more fundamental is required.
Our position is explicit: we are betting against emergence.
We do not believe that paradigm-level breakthroughs will emerge from scaling search and verification within fixed formalisms, however powerful those systems become. Historical paradigm shifts in science did not arise from working the old framework longer and harder. They arose from stepping outside the framework entirely - recognizing that the structure itself was the obstacle, and building a new one.
If we are wrong, and Type-2 emerges naturally from Type-1 at scale, the frontier labs will demonstrate it. We will have lost the bet but learned something important.
If we are right, the path to Type-2 requires foundational work that those labs are not pursuing. That is what QuSu Labs exists to do.
Quantum gravity as forcing function
Quantum gravity sits at the intersection of two extraordinarily successful frameworks that appear to be mutually incompatible.
Quantum mechanics describes the microscopic world in terms of superposition, entanglement, and discrete observables. General relativity describes spacetime as a smooth geometric manifold whose curvature encodes gravity. Each has been confirmed to remarkable precision within its domain. Together, they resist unification.
Decades of human effort have produced candidate approaches — string theory, loop quantum gravity, causal set theory, and others — but no consensus resolution. The difficulty is not computational. It is conceptual. The frameworks make different assumptions about what exists, how things compose, and what questions are well-formed.
This is what makes quantum gravity a useful target for us. It is a textbook example of a problem where the bottleneck is the framework, not the calculation.
If an artificial system could genuinely perform Type-2 generalization, quantum gravity is the kind of problem where that capability would matter. Such a system would need to recognize the structural incompatibility — not as "two sets of equations that don't match," but as a clash between underlying assumptions. It would need to construct new structures that preserve what each theory gets right while dissolving the contradiction. And it would need to do so in a way that produces something physicists can evaluate, critique, and extend.
We do not claim this is likely. We claim it is the right forcing function.
The value of an ambitious target is not primarily the probability of reaching it. It is the infrastructure the attempt forces you to build: evaluation methods that distinguish genuine conceptual innovation from superficial rearrangement, benchmarks that require framework-level moves, and tools that expose how systems represent and revise structure. These are useful regardless of whether we ever touch quantum gravity directly.
You should think of the quantum gravity target the way fusion researchers think of sub-cent electricity: an ambitious line in the sand that disciplines the work and produces valuable infrastructure even if the ultimate goal remains out of reach.
Why this matters even when you can't list the applications
One objection to all of this is straightforward:
"What does this give me beyond physics papers? How does this turn into value?"
The honest answer is that we cannot give you a complete list. And that inability is the argument itself.
Epistemic closure
Inside a given framework, many of the problems that the next framework makes tractable are invisible. The framework lacks the vocabulary to pose them.
Before relativity, GPS was impossible — not difficult, but impossible. Satellite-based positioning was conceivable in principle; triangulation is ancient. But a system accurate enough to be useful requires correcting for relativistic time dilation between satellites and ground receivers — an engineering requirement that cannot be formulated when time is absolute. The physics to make it work didn't exist.
This pattern repeats throughout history. The semiconductor physics underlying modern computing required quantum mechanics to even formulate. Epidemiology as we know it required germ theory. Vast domains of engineering, finance, and science that depend on probabilistic reasoning had to wait for probability itself to be conceived as a framework.
Framework shifts open up entirely new problem spaces. GPS, transistors, vaccination programs, actuarial science — none of these were difficult problems that got solved. They were possibilities that became thinkable.
This means we cannot honestly enumerate the commercially or scientifically relevant problems that require AI systems capable of Type-2 generalization. If we could fully articulate those problems, we would already be thinking inside the new frameworks that make them visible.
What we can say
There are entire sectors — climate dynamics, biosystems, macroeconomic behavior, engineered networks at scale — where progress appears bottlenecked by the representational frameworks we bring to the problem, rather than by data or compute. The instruments are precise. The datasets are large. And yet the systems resist prediction and control in ways that suggest we are missing something structural.
If artificial systems achieve even narrow Type-2 capabilities, they become something qualitatively different from current AI: amplifiers of human conceptual progress, capable of expanding what we can think about, rather than accelerators that help us think faster about the same things.
And the infrastructure required to test whether Type-2 is achievable — benchmarks that require conceptual reorganization, evaluation methods that distinguish genuine innovation from clever encoding, tools for exposing how systems represent structure — is valuable regardless of whether we reach the ultimate target. This infrastructure does not exist. Building it is worthwhile even in the scenario where we ultimately fail.
Abraham Flexner and Robbert Dijkgraaf of the Institute for Advanced Study have championed the "usefulness of useless knowledge" — the principle that fundamental, curiosity-driven inquiry, even without immediate application, is the ultimate source of significant innovation. Quantum mechanics was abstract theorizing until it gave us semiconductors. Relativity was a thought experiment about trains and clocks until it became essential to satellite navigation. The practicality of these frameworks was invisible until the frameworks existed.
We are pursuing a capability whose applications we cannot fully enumerate, in a domain where the value of success is difficult to articulate in advance. This is the nature of the problem.
Why this team
This is not an ordinary AI lab problem. The question of how conceptual frameworks change—and whether artificial systems can participate in that process—sits at an intersection that rarely appears in machine learning research: philosophy of science x epistemology x metaphysics and the abstract mathematics of structure and representation.
Our team comes from that intersection.
My background is in the philosophy of information and the foundations of intelligence, with work spanning quantum theory and AI. Prior to this, I ran a high-stakes technical program in the intelligence community and earlier founded a computational physics startup. That mix of foundational theory and real operational pressure is what led me to start building a lab designed explicitly for problems that are structurally under-served by both academia and conventional tech companies. It also spawned an obsession with culture at frontier efforts (Bletchley, Los Alamos, etc.).
Yoshi is one of a small number of researchers worldwide applying approaches from the frontiers of abstract math to ML. He is a modern polymath who can do serious theoretical CS, ML engineering, philosophy/logic, foundational math, and quantum. The question we are pursuing—whether artificial systems can restructure their own reasoning—is not legible from within the standard ML math stack. Yoshi has been building an alternative foundation for almost a decade.
We are a foundations-first effort that builds engineering capacity downstream of the core theoretical work. The formalism is prior; implementation follows.
We are modeling QuSu on Bletchley Park rather than a typical startup lab: a small, concentrated, deeply technical group focused on a problem that is structurally too challenging for organizations designed for optimization, scale, and “normal science.” The people who can contribute to this work are rare. We know where to find them.
Milestones
Our work is organized around three horizons, each with public commitments.
Two years: First demonstration of Type-2 behavior
Within two years, we will publicly demonstrate an AI system that solves problems provably unsolvable within its starting formalism, where independent evaluation confirms the solution reflects genuine conceptual reorganization.
This is the falsifiable core of our research program. Success requires three things: a benchmark suite where we can rigorously establish that certain problems admit no solution within a given formal starting point; a system that solves those problems anyway; and an evaluation methodology, developed with outside researchers, that distinguishes genuine framework-level innovation from notational tricks, hidden encodings, or brute-force overparameterization.
We will publish the benchmarks, the evaluation criteria, and the results — including negative results. If our systems fail to demonstrate Type-2 behavior under these conditions, we will say so and explain what we learned.
Three years: Scaling to realistic domains
By year three, we aim to demonstrate Type-2 capabilities in domains with real scientific structure: toy versions of problems from physics, mathematics, and other fields where the correct resolution is known but non-trivial. These serve as intermediate stress tests: complex enough to resist shallow solutions, constrained enough that we can verify success.
At this stage, we also expect to have tools and infrastructure that other research groups can use. The evaluation methods, benchmark designs, and diagnostic instruments we build should be useful beyond our own systems.
Five years: A framework-level contribution to quantum gravity
By year five, we aim to produce at least one published framework-level proposal relevant to quantum gravity, where the AI system played an essential and auditable role in generating or refining the conceptual structure, and where the contribution is substantial enough that domain experts engage with it on its merits.
This is ambitious. We assign it a medium probability. The value of the target is in the discipline it imposes and the infrastructure it forces us to build along the way.
Our commitments
Throughout this work, we commit to:
Public benchmarks. The tasks we use to test for Type-2 behavior will be published and available for other researchers to attempt.
Open evaluation criteria. The methods we use to distinguish genuine conceptual innovation from superficial solutions will be developed transparently and subject to external critique.
Negative results on the record. When our systems fail, or when an approach we believed promising turns out to be a dead end, we will publish that. The failures are as informative as the successes.
Auditable contributions. If we claim that an AI system contributed to a scientific result, we will provide sufficient documentation for others to understand and scrutinize that contribution.
We are not interested in building a system that appears to work. We are interested in knowing whether Type-2 generalization is achievable, and in building the shared infrastructure to answer that question rigorously.
The plan
We have not described our technical approach in this document, and we will not.
What we can say is this: we have a theory of what Type-2 generalization requires. It is not a hope that the capability will emerge from sufficient scale. It is a specific architectural/mathematical thesis about what is missing from current systems and what would need to be true for framework-level reasoning to become possible.
That thesis led us to an engineering plan. The program has milestones, and the milestones are falsifiable. Within a 1.5 years, we will know whether the core thesis holds. If it does not, we will say so. If it does, we will have demonstrated something that has not been demonstrated before.
We may be wrong. The capability we are pursuing may be impossible for artificial systems, or possible only through means we have not conceived. Obviously, we don’t think this is the case, and our early hypothesis testing supports our view, but we are prepared for this outcome.
But if we are right—if Type-2 generalization is achievable, and if it requires the kind of foundational work we believe it requires—then QuSu Labs is the only team currently pursuing the actual path to superintelligence.
Everyone else is building very sophisticated tools. We are trying to build a mind.
Get involved
Great breakthroughs emerge when rare minds are placed in high-trust, high-freedom, high-pressure proximity to an impossible question
We would love to hear from you if you’re working on the frontiers of pure math x applied ML, philosophy x AI, quantum x theoretical CS, or other compelling intersections. Non-traditional backgrounds welcome. We don’t pattern-match because there’s nothing to match against when attempting the impossible.
Many think this is impossible. Good. Our path is arduous, and only for the brave.
This is the problem. Get in touch if you’d like to alter the trajectory of our species.