Next-generation large language model interface architecture

Status: 🚧 In progress

A design-philosophy, technical-implementation, and compatibility analysis of Gemini Interactions API vs OpenAI Responses API

1. Introduction: The Shift from Stateless Chat to Stateful Agent Architectures

In the evolution of generation intelligence, the period from 2023 to late 2024 was a clear watershed: application paradigms moved from Text Completion and Stateless Chat to more complex Agentic Interactions and Deep Reasoning.

This shift is not only about stronger model capability. It also forces a redesign of underlying software architecture and API philosophy.

For a long time, RESTful Chat Completions APIs (with OpenAI /v1/chat/completions as de facto standard) dominated developer integration. Its core characteristic is statelessness: every request is independent; clients must maintain full history and resend it on every turn. This is simple and scalable, but its limits become obvious with models that support long context, multimodal understanding, and CoT-like reasoning behavior.

Thinking data is hard to handle: for reasoning models (for example OpenAI o1/o3 and Google Gemini 2.5/3.0), large internal “CoT-like” traces are generated before final answers. These traces are both costly and central to model intelligence. In stateless architecture, if server does not return these traces, client cannot let model “remember” prior reasoning in next turn; if server does return them, bandwidth and IP leakage risks increase.

So model vendors usually return a thinking summary, not raw CoT content. 🙂
Task time becomes much longer: vendors are no longer only offering base models. Agent-as-model execution now means a single API call can trigger workflows that run for minutes or hours (typical case: deep research). Traditional synchronous HTTP request-response with timeout-constrained long connections cannot satisfy this need.

The Emergence of Next-generation APIs

To address the above problems, two major vendors launched new interfaces: OpenAI Responses API and Google Interactions API. The following sections provide a technical breakdown of both APIs, compare their design choices in state management, multimodal handling, long-running task scheduling, and reasoning transparency, and then discuss compatibility integration strategies for existing agent frameworks such as ADK.

2. OpenAI Responses API: Containerized Reasoning-as-a-Service

OpenAI’s Responses API is a substantial redesign of Chat Completions. Its core philosophy is to move conversation state and reasoning process into the server, and use a clearer action type system (Item Ontology) to normalize complex multi-turn interactions.

A quick comparison:

{
  "message": {
    "role": "assistant",
    "content": "I'm going to use the get_weather tool to find the weather.",
    "tool_calls": [
      {
        "id": "call_88O3ElkW2RrSdRTNeeP1PZkm",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"New York, NY\",\"unit\":\"f\"}"
        }
      }
    ],
    "refusal": null,
    "annotations": []
  }
}

In Chat Completions, a single request usually emits one message object 👆.

 {
    "id": "rs_6888f6d0606c819aa8205ecee386963f0e683233d39188e7",
    "type": "reasoning",
    "summary": [
      {
        "type": "summary_text",
        "text": "**Determining weather response**\n\nI need to answer the user's question about the weather in San Francisco. ...."
      },
  },
  {
    "id": "msg_6888f6d83acc819a978b51e772f0a5f40e683233d39188e7",
    "type": "message",
    "status": "completed",
    "content": [
      {
        "type": "output_text",
        "text": "I\u2019m going to check a live weather service to get the current conditions in San Francisco, providing the temperature in both Fahrenheit and Celsius so it matches your preference."
      }
    ],
    "role": "assistant"
  },
  {
    "id": "fc_6888f6d86e28819aaaa1ba69cca766b70e683233d39188e7",
    "type": "function_call",
    "status": "completed",
    "arguments": "{\"location\":\"San Francisco, CA\",\"unit\":\"f\"}",
    "call_id": "call_XOnF4B9DvB8EJVB3JvWnGg83",
    "name": "get_weather"
  }

In Responses API, what you get is a full behavior chain. What to display, persist, or ignore is decided by the developer.

2.1 Design Philosophy: Opaque Reasoning with Hosted State

Responses API largely exists to solve a commercialization paradox for reasoning models. OpenAI o-series models (o1, o3) improve capability through internal long CoT-like reasoning. This reasoning is core IP and is not intended to be exposed directly.

In old Chat APIs, if model did not return thought traces, multi-turn follow-up often caused sharp intelligence drops because prior reasoning context could not be preserved client-side. Responses API solves this through server statefulness: reasoning state is retained on server (encrypted and hidden), and continued safely via previous_response_id (or reasoning items). Client only needs prior response ID to continue internal reasoning, without receiving raw CoT.

This philosophy can be summarized as: “trust server memory.” It frees developers from prompt-caching and truncation housekeeping, while increasing infrastructure dependence on OpenAI.

2.2 Core Data Model: Rise of Item Ontology

The most significant technical change is replacing loose Message objects with strict Item union types. Multimodal complexity outgrew single text-field structures.

2.2.1 Input Item

Inputs are no longer just messages; they are input arrays that can contain multiple InputItem types:

Type	Description and Use	Design Intent
`input_text`	plain text input	base interaction unit replacing legacy `content` string
`input_image`	image input (URL or Base64)	native multimodal understanding instead of attachment-style extension
`input_audio`	audio input	make listen/speak first-class in multimodal interaction

2.2.2 Output Item

Outputs are also structured as Item sequences, so tool calls, code-execution traces, and text replies are clearly separated:

Type	Description and Use	Design Intent
`message`	model reply item (can contain `output_text` parts)	avoid forcing plain text and tool calls into one mixed structure
`function_call`	structured tool/function call (`name`, `arguments`, `call_id`)	provide auditable tool-call receipt for UI/logging
`reasoning`	structured reasoning output (for example `summary`)	retain server reasoning state while supporting safe continuation without exposing raw CoT

The move from Message to Item marks the shift from “text exchange” to “object operation.” Developers no longer manipulate a continuous text blob; they operate on structured multimodal objects.

2.3 State Management: `previous_response_id` and reasoning state

Responses is designed as stateful-by-default: session and tool state are tracked server-side; reasoning state is retained across turns (encrypted, hidden), so reasoning models do not “forget how they were thinking.”

ID-based safe continuation (previous_response_id)

Mechanism: client sends prior response ID; server continues context and reasoning state internally.
Use cases: multi-turn chat, tool chains, stateful agent workflows.
Notes: client must persist at least prior response ID; reasoning state itself is not returned.

OpenAI also mentions reasoning items for continuation assistance, but raw CoT still remains hidden.

2.4 Streaming and semantic events

Responses streaming is no longer token-delta-only. It is semantic streaming events over response lifecycle and item-level generation (message, function_call, reasoning, etc.), enabling finer UI states (for example showing tool status when call starts).

SDK helpers such as output_text also avoid manual extraction from legacy paths like choices[0].message.content.

3. Google Gemini Interactions API: Operating System for Asynchronous Agents

If OpenAI Responses optimizes “thinking,” Google Interactions optimizes “doing.” Google positions Interactions as a unified interface for autonomous agents, especially for long-running, multi-source tasks.

3.1 Design Philosophy: Freeing the Time Dimension

Google’s philosophy is close in direction: deep research and complex tasks often exceed standard HTTP timeout windows (typically 30-60 seconds). So Interactions is designed as a task scheduling system where jobs can run in background for tens of minutes.

3.2 Core Architecture: Interaction and Content

Interactions endpoint is generativelanguage.googleapis.com/v1beta/interactions, and its shape differs significantly from OpenAI.

3.2.1 Interaction object

An Interaction is not just a chat turn. It is a full lifecycle object containing input, execution state, and outputs:

Key Field	Type	Deep Interpretation
`id`	string	unique interaction ID; used for `previous_interaction_id` continuation and `GET /interactions/{id}` status/result query
`model` / `agent`	string	polymorphic: can be base model (`gemini-3-pro-preview`) or preset agent (`deep-research-pro-preview-12-2025`), reducing switching cost
`input`	`string` or `Content[]`	structured multimodal input including `function_result` parts
`previous_interaction_id`	string	optional server-side continuation via prior interaction
`background`	bool	async background execution; docs indicate `background=true` is agent-focused
`status`	string	execution state like `completed`, `in_progress`, `requires_action`, `failed`
`store`	bool	default `true`; `store=false` disables persistence but also blocks `previous_interaction_id` and conflicts with `background=true`

If you are familiar with old Gemini endpoints, Interactions input changes are less drastic than OpenAI’s message→item migration. This likely reflects Gemini’s long-standing multimodal-first design.

3.2.2 State management: optional server-side state

Interactions is explicit about whether state is hosted server-side:

Stateful: send previous_interaction_id; server loads full context internally, reducing client-side history replay and often improving cache hit rates.
Stateless: send full history each time; client owns context assembly.
Storage/compliance: default store=true; store=false is possible but disables previous_interaction_id and conflicts with background=true.
Comparison: OpenAI emphasizes stateful-by-default with server-managed reasoning/tool traces; Google explicitly exposes dual-mode optional hosting.

This does not mean OpenAI Responses rejects stateless full-history submission. Both can emulate legacy stateless behavior through constructed input history; but once you do that, much of the new API value is lost.

Gemini Deep Research Agent is the flagship Interactions use case, which is already widely known.

3.3 Native support for MCP (Model Context Protocol)

Google also explicitly integrates MCP support into Interactions API.

4. Comparative Analysis

Overall the two APIs are closer than many assume. Differences are mainly in selected details.

4.1 State management and context handling

Dimension	OpenAI Responses API	Gemini Interactions API	Analysis
State carrier	`previous_response_id` (and reasoning items; server auto-tracks session/tool state)	`previous_interaction_id` (optional server state; can also be fully stateless with full history)	Both reduce client-side history burden via prior-ID continuation. Difference: OpenAI emphasizes hosted reasoning/tool state; Google emphasizes explicit dual mode.
Data retention	default 30 days; server-side reasoning state retained (encrypted/hidden)	default `store=true`; paid 55 days / free 1 day; `store=false` disables `previous_interaction_id` and conflicts with `background=true`	Both involve compliance trade-offs in hosted mode. Google provides explicit retention windows and opt-out switch with capability sacrifices.

5. Compatibility and Migration Guide

For existing developers, the core questions are: Can my OpenAI Chat-based code still run? How should I migrate?

5.1 Gemini OpenAI compatibility layer: truth and traps

Google claims “three-line migration.” Largely true, but there is a large scope trap.

Compatibility scope

Gemini compatibility layer targets OpenAI legacy Chat Completions (/v1/chat/completions), not the newer Responses API (/v1/responses).

If you use standard openai.chat.completions.create, changing base_url and api_key is often enough.
Google also maps OpenAI reasoning_effort to Gemini thinking_level / thinking_budget, enabling low-friction comparative testing.

Non-compatible scope

You cannot access Gemini Interactions API features through OpenAI SDK compatibility mode:

cannot pass agent='deep-research' in OpenAI SDK and expect Interactions semantics;
cannot pass background=true for async execution;
cannot use previous_interaction_id chain-state behavior.

Conclusion: Gemini compatibility exists mainly to capture legacy chat workloads. If you need Interactions-specific capabilities, you must adopt Google genai SDK and rewrite integration paths.

5.2 Migrating from Chat Completions to Responses API (within OpenAI ecosystem)

Even inside OpenAI ecosystem, migration is one-way and non-trivial:

1. Refactor data model

messages must become structured input items. Simple string concatenation no longer matches model shape.

2. Stop full client-side history persistence

Instead of storing large history blobs, store at least prior response IDs and continue via previous_response_id. Database pressure drops, but state dependence on server increases.

3. Rewrite stream parser

Front-end stream parser must handle semantic event streams and multi-item outputs, not legacy token deltas such as choices[0].delta.content.

5.3 Integration adaptation strategy

Because architecture divergence between Interactions and Chat is significant, a single universal adapter that fully supports both is often unrealistic. A dual-stack strategy is usually more practical.

Synchronous interaction layer

For chatbot/realtime QA low-latency scenarios, continue using standard Chat Completions abstraction. This remains compatible with OpenAI (legacy), Gemini (compat layer), Anthropic, and open-source runtimes (for example vLLM), and helps preserve provider-neutrality.

Google’s own blog reflects this:

although Interactions API supports most generateContent capabilities and improves developer experience, it is still in public preview and may change significantly; for production-standard workloads, generateContent remains the primary path.

Asynchronous task layer

For Deep Research and agent kernel workloads, Interactions API can be introduced as a dedicated async path.

6. Usage Constraints

6.1 State Lock-in

As APIs become stateful, vendor lock-in risk becomes stronger.

In stateless times, migration often meant changing one URL line. In stateful times, migration can mean moving large session/tool-run histories and remapping them into another vendor’s state model (if supported). Hidden server-side reasoning state is usually not exportable, making migration costs much higher once deep workflows are running.

6.2 Agent as Infrastructure

Gemini Deep Research demonstrates a broader trend: agent as infrastructure. Future API shape may shift from completion(prompt) to hire(agent, goal). Vendors provide not only models, but also runtime, tool ecosystems, and memory layers. Interactions is an early shape of this trajectory.

6.3 Vertex AI not Ready

Per Google official blog: Interactions API and Gemini deep research capabilities are coming to Vertex AI. https://blog.google/technology/developers/interactions-api/

7. Summary

OpenAI Responses API and Google Interactions API both target next-generation AI application challenges. For developers, the choice is no longer only about model benchmark scores; it is now an architecture decision: build an instant-response chat surface, or build an async task-delivery system. Understanding this difference is key to next-generation AI product engineering.

Appendix

This appendix provides a lightweight integration approach for model service + Google ecosystem reality.

Current state: many ADK-based agents use ADK’s LiteLLM compatibility layer to route through custom model services.

Core requirement: keep native Gemini capabilities in Google ADK (google-genai type system, tool calling, cachedContents, files/resumable upload, SSE, etc.) while centralizing upstream switching/governance/metrics in our own gateway.

Under current google-adk, this implies: gateway must expose Gemini Developer API (AI Studio / v1beta) compatible surface. Otherwise you either connect Google directly, or fallback to OpenAI endpoint + LiteLLM with capability drift.

ADK’s three current integration paths (and constraints)

ADK Gemini (default)
- Uses google-genai SDK (best native Gemini experience).
- But ADK’s google.genai.Client(...) does not pass base_url explicitly in default path; only tracking headers are injected (so gateway routing depends on google-genai base-url behavior/environment variables).
- See google/adk/models/google_llm.py (Gemini.api_client).
ADK ApigeeLlm (named Apigee but effectively proxy client)
- Also uses google-genai, but explicitly supports proxy_url/base_url and custom_headers.
- See google/adk/models/apigee_llm.py with HttpOptions(base_url=proxy_url).
- Limitation: this does not reduce required gateway protocol compatibility. In vertex_ai mode it often requires GOOGLE_CLOUD_PROJECT/LOCATION and may introduce caller-side GCP credential paths (against “centralize creds/switching in gateway” goal).
ADK LiteLlm (OpenAI endpoint/provider-style)
- Good for quick adoption when OpenAI-compatible gateway already exists.
- But if native Gemini semantics are required (especially cachedContents, files/resumable upload, and some tool/stream behaviors), ChatCompletions compatibility needs extra translation and is not guaranteed 1:1. Complexity still returns to gateway.

Recommended practical approach

If goal is “keep native Gemini invocation,” the most stable ADK approach is:

Call gateway using Gemini Developer API (AI Studio / v1beta) conventions:
- set GOOGLE_GEMINI_BASE_URL=https://<gateway> so google-genai points to gateway;
- use GOOGLE_API_KEY=<gateway_token> for gateway auth.
Do not rely on default ADK Gemini + GOOGLE_GENAI_USE_VERTEXAI=1 + GOOGLE_VERTEX_BASE_URL=... to emulate Vertex-through-gateway:
- during client initialization, google.genai.Client decides whether to read GOOGLE_GEMINI_BASE_URL or GOOGLE_VERTEX_BASE_URL based on vertexai argument (vertexai or False), and ADK default path does not always pass vertexai=True explicitly;
- result: even if lower BaseApiClient switches mode via env vars later, Client layer may not read GOOGLE_VERTEX_BASE_URL as expected; fixed /v1beta (AI Studio shape) routing is often more stable, with gateway internally routing to Vertex.
Let gateway decide upstream (Vertex / AI Studio / other providers):
- recommended: route and attribute metrics by token or X-Client-Id;
- optional: support headers like X-LLM-Upstream for gray release/switching (but check ADK header propagation limits below).

A subtle but critical limitation: ADK context caching does not carry per-request custom routing headers.

ADK GeminiContextCacheManager uses shared self.api_client for cache operations and does not use GenerateContentConfig.http_options (so headers set there do not propagate).
Therefore, if cachedContents also needs per-request upstream switching, default ADK Gemini path is insufficient. More stable options are token/client-based mapping in gateway, or caller-side methods that inject global headers (for example ApigeeLlm(custom_headers=...) or custom model_code).