Skip to content

Next-generation large language model interface architecture

MasakiMu319

Status: 🚧 In progress

A design-philosophy, technical-implementation, and compatibility analysis of Gemini Interactions API vs OpenAI Responses API


1. Introduction: The Shift from Stateless Chat to Stateful Agent Architectures

In the evolution of generation intelligence, the period from 2023 to late 2024 was a clear watershed: application paradigms moved from Text Completion and Stateless Chat to more complex Agentic Interactions and Deep Reasoning.

This shift is not only about stronger model capability. It also forces a redesign of underlying software architecture and API philosophy.

For a long time, RESTful Chat Completions APIs (with OpenAI /v1/chat/completions as de facto standard) dominated developer integration. Its core characteristic is statelessness: every request is independent; clients must maintain full history and resend it on every turn. This is simple and scalable, but its limits become obvious with models that support long context, multimodal understanding, and CoT-like reasoning behavior.

  1. Thinking data is hard to handle: for reasoning models (for example OpenAI o1/o3 and Google Gemini 2.5/3.0), large internal “CoT-like” traces are generated before final answers. These traces are both costly and central to model intelligence. In stateless architecture, if server does not return these traces, client cannot let model “remember” prior reasoning in next turn; if server does return them, bandwidth and IP leakage risks increase.

    So model vendors usually return a thinking summary, not raw CoT content. 🙂

  2. Task time becomes much longer: vendors are no longer only offering base models. Agent-as-model execution now means a single API call can trigger workflows that run for minutes or hours (typical case: deep research). Traditional synchronous HTTP request-response with timeout-constrained long connections cannot satisfy this need.

The Emergence of Next-generation APIs

To address the above problems, two major vendors launched new interfaces: OpenAI Responses API and Google Interactions API. The following sections provide a technical breakdown of both APIs, compare their design choices in state management, multimodal handling, long-running task scheduling, and reasoning transparency, and then discuss compatibility integration strategies for existing agent frameworks such as ADK.


2. OpenAI Responses API: Containerized Reasoning-as-a-Service

OpenAI’s Responses API is a substantial redesign of Chat Completions. Its core philosophy is to move conversation state and reasoning process into the server, and use a clearer action type system (Item Ontology) to normalize complex multi-turn interactions.

A quick comparison:

{
  "message": {
    "role": "assistant",
    "content": "I'm going to use the get_weather tool to find the weather.",
    "tool_calls": [
      {
        "id": "call_88O3ElkW2RrSdRTNeeP1PZkm",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"New York, NY\",\"unit\":\"f\"}"
        }
      }
    ],
    "refusal": null,
    "annotations": []
  }
}

In Chat Completions, a single request usually emits one message object 👆.

 {
    "id": "rs_6888f6d0606c819aa8205ecee386963f0e683233d39188e7",
    "type": "reasoning",
    "summary": [
      {
        "type": "summary_text",
        "text": "**Determining weather response**\n\nI need to answer the user's question about the weather in San Francisco. ...."
      },
  },
  {
    "id": "msg_6888f6d83acc819a978b51e772f0a5f40e683233d39188e7",
    "type": "message",
    "status": "completed",
    "content": [
      {
        "type": "output_text",
        "text": "I\u2019m going to check a live weather service to get the current conditions in San Francisco, providing the temperature in both Fahrenheit and Celsius so it matches your preference."
      }
    ],
    "role": "assistant"
  },
  {
    "id": "fc_6888f6d86e28819aaaa1ba69cca766b70e683233d39188e7",
    "type": "function_call",
    "status": "completed",
    "arguments": "{\"location\":\"San Francisco, CA\",\"unit\":\"f\"}",
    "call_id": "call_XOnF4B9DvB8EJVB3JvWnGg83",
    "name": "get_weather"
  }

In Responses API, what you get is a full behavior chain. What to display, persist, or ignore is decided by the developer.

2.1 Design Philosophy: Opaque Reasoning with Hosted State

Responses API largely exists to solve a commercialization paradox for reasoning models. OpenAI o-series models (o1, o3) improve capability through internal long CoT-like reasoning. This reasoning is core IP and is not intended to be exposed directly.

In old Chat APIs, if model did not return thought traces, multi-turn follow-up often caused sharp intelligence drops because prior reasoning context could not be preserved client-side. Responses API solves this through server statefulness: reasoning state is retained on server (encrypted and hidden), and continued safely via previous_response_id (or reasoning items). Client only needs prior response ID to continue internal reasoning, without receiving raw CoT.

This philosophy can be summarized as: “trust server memory.” It frees developers from prompt-caching and truncation housekeeping, while increasing infrastructure dependence on OpenAI.

2.2 Core Data Model: Rise of Item Ontology

The most significant technical change is replacing loose Message objects with strict Item union types. Multimodal complexity outgrew single text-field structures.

2.2.1 Input Item

Inputs are no longer just messages; they are input arrays that can contain multiple InputItem types:

TypeDescription and UseDesign Intent
input_textplain text inputbase interaction unit replacing legacy content string
input_imageimage input (URL or Base64)native multimodal understanding instead of attachment-style extension
input_audioaudio inputmake listen/speak first-class in multimodal interaction

2.2.2 Output Item

Outputs are also structured as Item sequences, so tool calls, code-execution traces, and text replies are clearly separated:

TypeDescription and UseDesign Intent
messagemodel reply item (can contain output_text parts)avoid forcing plain text and tool calls into one mixed structure
function_callstructured tool/function call (name, arguments, call_id)provide auditable tool-call receipt for UI/logging
reasoningstructured reasoning output (for example summary)retain server reasoning state while supporting safe continuation without exposing raw CoT

The move from Message to Item marks the shift from “text exchange” to “object operation.” Developers no longer manipulate a continuous text blob; they operate on structured multimodal objects.

2.3 State Management: previous_response_id and reasoning state

Responses is designed as stateful-by-default: session and tool state are tracked server-side; reasoning state is retained across turns (encrypted, hidden), so reasoning models do not “forget how they were thinking.”

ID-based safe continuation (previous_response_id)

OpenAI also mentions reasoning items for continuation assistance, but raw CoT still remains hidden.

2.4 Streaming and semantic events

Responses streaming is no longer token-delta-only. It is semantic streaming events over response lifecycle and item-level generation (message, function_call, reasoning, etc.), enabling finer UI states (for example showing tool status when call starts).

SDK helpers such as output_text also avoid manual extraction from legacy paths like choices[0].message.content.


3. Google Gemini Interactions API: Operating System for Asynchronous Agents

If OpenAI Responses optimizes “thinking,” Google Interactions optimizes “doing.” Google positions Interactions as a unified interface for autonomous agents, especially for long-running, multi-source tasks.

3.1 Design Philosophy: Freeing the Time Dimension

Google’s philosophy is close in direction: deep research and complex tasks often exceed standard HTTP timeout windows (typically 30-60 seconds). So Interactions is designed as a task scheduling system where jobs can run in background for tens of minutes.

3.2 Core Architecture: Interaction and Content

Interactions endpoint is generativelanguage.googleapis.com/v1beta/interactions, and its shape differs significantly from OpenAI.

3.2.1 Interaction object

An Interaction is not just a chat turn. It is a full lifecycle object containing input, execution state, and outputs:

Key FieldTypeDeep Interpretation
idstringunique interaction ID; used for previous_interaction_id continuation and GET /interactions/{id} status/result query
model / agentstringpolymorphic: can be base model (gemini-3-pro-preview) or preset agent (deep-research-pro-preview-12-2025), reducing switching cost
inputstring or Content[]structured multimodal input including function_result parts
previous_interaction_idstringoptional server-side continuation via prior interaction
backgroundboolasync background execution; docs indicate background=true is agent-focused
statusstringexecution state like completed, in_progress, requires_action, failed
storebooldefault true; store=false disables persistence but also blocks previous_interaction_id and conflicts with background=true

If you are familiar with old Gemini endpoints, Interactions input changes are less drastic than OpenAI’s message→item migration. This likely reflects Gemini’s long-standing multimodal-first design.

3.2.2 State management: optional server-side state

Interactions is explicit about whether state is hosted server-side:

This does not mean OpenAI Responses rejects stateless full-history submission. Both can emulate legacy stateless behavior through constructed input history; but once you do that, much of the new API value is lost.

Gemini Deep Research Agent is the flagship Interactions use case, which is already widely known.

3.3 Native support for MCP (Model Context Protocol)

Google also explicitly integrates MCP support into Interactions API.


4. Comparative Analysis

Overall the two APIs are closer than many assume. Differences are mainly in selected details.

4.1 State management and context handling

DimensionOpenAI Responses APIGemini Interactions APIAnalysis
State carrierprevious_response_id (and reasoning items; server auto-tracks session/tool state)previous_interaction_id (optional server state; can also be fully stateless with full history)Both reduce client-side history burden via prior-ID continuation. Difference: OpenAI emphasizes hosted reasoning/tool state; Google emphasizes explicit dual mode.
Data retentiondefault 30 days; server-side reasoning state retained (encrypted/hidden)default store=true; paid 55 days / free 1 day; store=false disables previous_interaction_id and conflicts with background=trueBoth involve compliance trade-offs in hosted mode. Google provides explicit retention windows and opt-out switch with capability sacrifices.

5. Compatibility and Migration Guide

For existing developers, the core questions are: Can my OpenAI Chat-based code still run? How should I migrate?

5.1 Gemini OpenAI compatibility layer: truth and traps

Google claims “three-line migration.” Largely true, but there is a large scope trap.

Compatibility scope

Gemini compatibility layer targets OpenAI legacy Chat Completions (/v1/chat/completions), not the newer Responses API (/v1/responses).

Non-compatible scope

You cannot access Gemini Interactions API features through OpenAI SDK compatibility mode:

Conclusion: Gemini compatibility exists mainly to capture legacy chat workloads. If you need Interactions-specific capabilities, you must adopt Google genai SDK and rewrite integration paths.

5.2 Migrating from Chat Completions to Responses API (within OpenAI ecosystem)

Even inside OpenAI ecosystem, migration is one-way and non-trivial:

1. Refactor data model

messages must become structured input items. Simple string concatenation no longer matches model shape.

2. Stop full client-side history persistence

Instead of storing large history blobs, store at least prior response IDs and continue via previous_response_id. Database pressure drops, but state dependence on server increases.

3. Rewrite stream parser

Front-end stream parser must handle semantic event streams and multi-item outputs, not legacy token deltas such as choices[0].delta.content.

5.3 Integration adaptation strategy

Because architecture divergence between Interactions and Chat is significant, a single universal adapter that fully supports both is often unrealistic. A dual-stack strategy is usually more practical.

Synchronous interaction layer

For chatbot/realtime QA low-latency scenarios, continue using standard Chat Completions abstraction. This remains compatible with OpenAI (legacy), Gemini (compat layer), Anthropic, and open-source runtimes (for example vLLM), and helps preserve provider-neutrality.

Google’s own blog reflects this:

although Interactions API supports most generateContent capabilities and improves developer experience, it is still in public preview and may change significantly; for production-standard workloads, generateContent remains the primary path.

Asynchronous task layer

For Deep Research and agent kernel workloads, Interactions API can be introduced as a dedicated async path.


6. Usage Constraints

6.1 State Lock-in

As APIs become stateful, vendor lock-in risk becomes stronger.

In stateless times, migration often meant changing one URL line. In stateful times, migration can mean moving large session/tool-run histories and remapping them into another vendor’s state model (if supported). Hidden server-side reasoning state is usually not exportable, making migration costs much higher once deep workflows are running.

6.2 Agent as Infrastructure

Gemini Deep Research demonstrates a broader trend: agent as infrastructure. Future API shape may shift from completion(prompt) to hire(agent, goal). Vendors provide not only models, but also runtime, tool ecosystems, and memory layers. Interactions is an early shape of this trajectory.

6.3 Vertex AI not Ready

Per Google official blog: Interactions API and Gemini deep research capabilities are coming to Vertex AI. https://blog.google/technology/developers/interactions-api/


7. Summary

OpenAI Responses API and Google Interactions API both target next-generation AI application challenges. For developers, the choice is no longer only about model benchmark scores; it is now an architecture decision: build an instant-response chat surface, or build an async task-delivery system. Understanding this difference is key to next-generation AI product engineering.

Appendix

This appendix provides a lightweight integration approach for model service + Google ecosystem reality.

Current state: many ADK-based agents use ADK’s LiteLLM compatibility layer to route through custom model services.

Core requirement: keep native Gemini capabilities in Google ADK (google-genai type system, tool calling, cachedContents, files/resumable upload, SSE, etc.) while centralizing upstream switching/governance/metrics in our own gateway.

Under current google-adk, this implies: gateway must expose Gemini Developer API (AI Studio / v1beta) compatible surface. Otherwise you either connect Google directly, or fallback to OpenAI endpoint + LiteLLM with capability drift.

ADK’s three current integration paths (and constraints)

  1. ADK Gemini (default)

    • Uses google-genai SDK (best native Gemini experience).
    • But ADK’s google.genai.Client(...) does not pass base_url explicitly in default path; only tracking headers are injected (so gateway routing depends on google-genai base-url behavior/environment variables).
    • See google/adk/models/google_llm.py (Gemini.api_client).
  2. ADK ApigeeLlm (named Apigee but effectively proxy client)

    • Also uses google-genai, but explicitly supports proxy_url/base_url and custom_headers.
    • See google/adk/models/apigee_llm.py with HttpOptions(base_url=proxy_url).
    • Limitation: this does not reduce required gateway protocol compatibility. In vertex_ai mode it often requires GOOGLE_CLOUD_PROJECT/LOCATION and may introduce caller-side GCP credential paths (against “centralize creds/switching in gateway” goal).
  3. ADK LiteLlm (OpenAI endpoint/provider-style)

    • Good for quick adoption when OpenAI-compatible gateway already exists.
    • But if native Gemini semantics are required (especially cachedContents, files/resumable upload, and some tool/stream behaviors), ChatCompletions compatibility needs extra translation and is not guaranteed 1:1. Complexity still returns to gateway.

If goal is “keep native Gemini invocation,” the most stable ADK approach is:

A subtle but critical limitation: ADK context caching does not carry per-request custom routing headers.

Previous
What pi-mono Source Code Reveals About Agent Architecture Tradeoffs: Exploration Tolerance vs Deterministic Convergence
Next
TUIC Protocol