%20(1).jpg)
A patient has osteoarthritis in their right knee. The patient asks a large LLM model what's likely to happen and how much it would cost, and it'll check prevalence data, see the diagnosis code, and forecast a trajectory toward total knee replacement. That's a $30,000 to $50,000 answer.
It sounds authoritative. For many patients, it's wrong.
I'll come back to why. But this example captures something I keep running into after years of building AI products in health and financial services: everyone focuses on the model. Better prompts, better fine-tuning, better retrieval.
In benefits, the model isn't the bottleneck. The intelligence layer underneath it is.
Benefits data isn't one problem. It's four, stacked on top of each other - orchestration, data foundations, benefits structure, and claims intelligence. Most systems solve the first and stop.
Orchestration is the routing layer: parsing intent, selecting data sources, applying guardrails, evaluating correctness for a specific person. It's necessary, but it's a well-understood engineering problem.
Data foundations goes deeper. The health and financial data needed to answer a real benefits question is fragmented across dozens of systems that were never designed to interoperate. Assembling it is genuinely hard. It's also where most companies stop and declare victory. Unified data is the starting line, not the finish line, because having all the data in one place doesn't mean you know what to do with it.
The two layers that follow, benefits structure and claims intelligence, are where the complexity compounds. Harder to build, harder to replicate, and far more consequential when they are missing.
An employer offers a digital musculoskeletal program alongside their major medical physical therapy benefit. Which one applies first? Does using one affect coverage under the other? The same employee has a critical illness policy that pays a lump sum on certain diagnoses, but the filing window is 90 days and the qualifying conditions don't map to what most people think "critical illness" means. Meanwhile, a behavioral health point solution was added last year, but the employee already has EAP sessions and in-network behavioral health coverage through their base plan. Three overlapping programs, three different coverage rules, three different access points.
A general model will describe what each benefit does in isolation. It won't map how they interact for a specific situation, which one to use first, or what's being left on the table. When someone gets this wrong, it's the employee paying $4,000 out of pocket for physical therapy that a covered point solution would have handled, or missing a $10,000 critical illness payout because they didn't file within a window they didn't know existed.
Closing this gap requires transforming unstructured plan documents into a structured schema that captures these interdependencies, validated by actuaries and benefits professionals, continuously, across millions of plans.
This isn't an LLM problem. It's a domain expertise problem with an LLM component.
Claims intelligence is the deepest layer, and the hardest to replicate.
An ICD-10 code tells you a diagnosis. It doesn't tell you the severity. A procedure code tells you what was done. It doesn't tell you billing intent. An admission record doesn't confirm inpatient status without revenue codes. This layer reconstructs fragmented claims into longitudinal episodes of care, triangulates diagnoses with procedures, validates billing through revenue codes, and models cost by site of care.
Without it, you get the wrong forecast. A model sees a diabetes diagnosis for minor clinical encounters and estimates a controlled condition at $3,000-$4,000 a year. But the claims trail tells a different story: pharmacy data showing gaps in the expected medication regimen, inconsistent lab testing, overdue specialist exams. That pattern points toward uncontrolled diabetes that could amount to $10,000-$15,000 or more annually. The intelligence layer catches what the data does and does not reveal. The model-only approach can't, because it doesn't read trajectories from claims.
Back to the patient with the osteoarthritic knee.
A general-purpose model sees the diagnosis code for unilateral osteoarthritis and forecasts surgery. A properly built intelligence layer asks a different question: has this patient been through the expected treatment pathway? Physical therapy, corticosteroid injections, anti-inflammatory prescriptions. If those procedure and pharmacy codes are absent, the cost forecast changes entirely. This patient hasn't exhausted conservative care, which means a surgical trajectory may be premature as a planning assumption.
The financial picture shifts from $30,000-$50,000 for a surgical episode to $3,000-$5,000 for therapies. The intelligence layer can surface whether the employer's digital MSK benefit applies and how it layers on top of major medical. It can check HSA balance and FSA election to show the actual out-of-pocket impact given the employee's financial position. If surgery does become necessary later, it can flag the leave and disability benefits that would need to be activated and the filing timelines that apply. Moreover, the patient can be better prepared both physically and mentally for this significant medical event.
No amount of model improvement gets you here. Models can get better at language, retrieval, and reasoning, but they cannot generate employer-specific plan structure, claims trajectory data, or point solution interplay from pre-training.
Most of what's shipping right now in AI benefits is a model layer with basic plan data retrieval. That architecture will work until the first employee has a real health event, makes a financial decision based on the agent's guidance, and discovers the answer was confidently, expensively wrong.
That moment is coming. And when it arrives, the question won't be whether the model was good enough. It'll be whether anyone built the intelligence layer underneath it.
If you're shipping a benefits agent without one, the first real-world health event will make that gap impossible to ignore.
That’s why we've spent years perfecting Nayya Intelligence, the foundation that powers our entire platform, so that moment never has to happen.