Parallel Paths: Finding Validation for My Context Engineering&nbsp;Framework

Back in September 2025, I realized the hard way that the traditional way of using Prompt Engineering for agentic tasks, while suitable for early frontier models, was not going to be suitable or scalable for complex tasks. But as time has gone by, LLMs have only gotten smarter and AI agents are now capable of stringing towards multi-step tasks, involving multiple tool calls in a single operation. While context windows have also expanded, a larger context window does not really provide a lot of advantages, and in fact provide diminishing returns. Gemini has a 1M token context window, but anyone who has worked significantly with Gemini would tell you how badly Gemini hallucinates as the conversation progresses. In my personal, anecdotal experience, the optimal sweet spot for maintaining effective context is up to 500K tokens, after which the quality degrades. Anthropic’s 200K limit already puts one on a budget, and Gemini’s 1M limit causes a different problem altogether. So, on the one hand you have modern LLMs getting smarter and on the other hand you still have the problem of making the LLM reference the right context. The solution to leverage advantages of modern LLMs is not in effective prompt engineering, but in layering & engineering context effectively.

I spent all of October brainstorming and conceptualizing a context layering & engineering framework that could potentially be leveraged for long-horizon agentic coding tasks. I published my findings and concept early November. You can read it here: https://ashaykubal.com/blog/from-context-chaos-to-clear-an-origin-story/

Feeling Vindicated

Since Thanksgiving rolled around, I’ve been deep into implementation of this framework and getting closer to completing development of the framework’s core engine. In between doing that and keeping up with industry trends, I was pleasantly surprised and felt extremely vindicated to see that lately, the industry discourse also has been shifting towards solving context engineering for long-running agentic tasks.

Research Paper from China & Hong Kong (Nov 23, 2025): Dual-agent memory architecture that outperforms long-context LLMs
- Actual Research Paper: General Agentic Memory via Deep Research
Anthropic’s Engineering Blog (Nov 26, 2025): Effective harnesses for long-running agents
Google Developers Blog (Dec 6, 2025): Architecting efficient multi-agent frameworks for production
LangChain’s Blog (Dec 9, 2025): Agent Engineering – A New Discipline

This work has been a successor to previous research on fundamentals of context engineering. Several publications have expounded the benefits & strategies of effective context engineering, but to particularly call out a couple that I really loved, read these:

LangChain’s blog on Context Engineering (Jul 2, 2025)
Anthropic’s blog on Effective Context Engineering (Sep 29, 2025)

What is interesting is how multiple research teams have coalesced into more or less a shared understanding on how to approach effective context engineering for long-running agents. Three concepts stand out from these publications.

I. Effective Agent/Context Engineering is akin to traditional Software Engineering

LangChain specifically calls it Agent Engineering and distinguishes it from Context Engineering.

As per LangChain: Agent engineering is the iterative process of refining non-deterministic LLM systems into reliable production experiences. It is a cyclical process: build, test, ship, observe, refine, repeat. We see agent engineering as a new discipline that combines 3 skillsets working together:

Product Thinking defines the scope and shapes agent behavior

Engineering builds the infrastructure that makes agent production-ready

Data Science measures and improves agent performance over time

II. A dual-agent harness is the optimal way to layer and engineer effective context

Anthropic developed a two-fold solution to manage context across sessions.

As per Anthropic: The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before. We developed a two-fold solution to enable the Claude Agent SDK to work effectively across many context windows:

An initializer agent that sets up the environment on the first run (includes plans broken down into features & deliverables)

A coding agent that is tasked with making incremental progress in every session, while leaving clear artifacts for the next session This research demonstrates one possible set of solutions in a long-running agent harness to enable the model to make incremental progress across many context windows. However, it’s still unclear whether a single, general-purpose coding agent performs best across contexts, or if better performance can be achieved through a multi-agent architecture. It seems reasonable that specialized agents like a testing agent, a quality assurance agent, or a code cleanup agent, could do an even better job at sub-tasks across the software development lifecycle

III. Required & Necessary Context is a compiled view of a rich stateful system

Google considers that to build production-grade agents that are reliable, efficient & debuggable, context engineering is required. Google also defines context engineering as a “first-class system with its own architecture, lifecycle, and constraints.

As per Google: In previous generations of agent frameworks context was treated like a mutable string buffer. But now Context is a compiled view over a richer stateful system. In that view:

Sessions, memory, and artifacts (files) are the sources, i.e. full structured state of the interaction & its data

Flows and processors are the compiler pipeline, i.e. a sequence of passes that transform that state

The working context is the compiled view you ship to the LLM for this one invocation

Realizing These Insights Personally

The reasons these three concepts stood out particularly to me is because I realized these myself organically over the last month while actually implementing the CLEAR framework.

As part of the implementation, I’ve been managing session states & transitions, knowledge (requirements & past decisions), and project plans in a semi-automated manner. While doing that, I also set up a metrics capture mechanism & dashboard that would allow me to measure and quantify the efficacy of my semi-automated system. I plan on using this as a baseline to measure the efficacy of the CLEAR framework once it is implement by doing a comparative analysis between projects built with and without CLEAR.

Here’s what the numbers say:

And here’s what they mean:

On an average each session lasted for ~2.5-3 active, actual hours
I consistently hit close to 70% or more token usage consumption (out of Claude’s 200K token window) every session, which was the threshold I had set to trigger session handoff and transition to a new session
However, the output of each session differed from each other. In some sessions, I was able to get Claude to write over ~1500 lines of code, while in some sessions that number was close to ~700.

This led me to further analyze the session data and classify my sessions into session types & per session productivity by session type. This revealed some startling insights:

Validation

The session metrics with my semi-automated context management in place prove the need of every single one of the three main tenets I outlined above, further validated by research that Anthropic, Google & LangChain have published in particular.

While the V1.0 of the CLEAR framework does not immediately address the three tenets in full, it aims to address each of these partially at the very least.

The framework initialization protocol immediately stresses the creation of a detailed plan, followed by breaking down the plan into features – these serve as the foundational base that the build is executed against
The inclusion of specialized sub-agents running scoped tasks and the main orchestrator agent recording artifacts like technical & architectural decisions, session progress and handoffs address the dual-agent harness
The ability to self-manager and regulate sessions & session progress with auto-loading of minimal viable context using progressive disclosure principles is akin to loading a compiled view into memory at the start of a new session providing the agent with the minimum information required to complete the scoped tasks

That being said, reading and processing these publications has also provided me a vision of what the V2.0 improvements to the framework, and I am excited to continue building out this framework.