Building a Coding Agent in Swift, Part 6: Context Compaction

Our agent has come a long way. It runs commands, reads and writes files, tracks its own work, delegates to subagents, and loads skills on demand — seven tools, one loop. But every one of those capabilities adds to the same growing resource: the messages array. A single read_file on a 1,000-line source file costs roughly 4,000 tokens. Load a skill body, and that’s another 2,000. After reading 30 files and running 20 bash commands across a long session, the context pushes past 100,000 tokens. At that point, the agent either hits the API’s context window limit and errors out, or — more subtly — the model’s response quality degrades as the relevant information gets buried in a sea of stale tool results.

This is the threshold that separates a demo from a useful tool. Everything we’ve built so far assumes the context has room. Once it doesn’t, the agent has a hard ceiling on how much work it can do in a single session. That’s where context compaction comes in: a three-layer compression strategy that progressively shrinks the messages array — quietly trimming old results, automatically summarizing when a threshold is crossed, and letting the model request compression explicitly. With these three layers working together, the agent can run indefinitely.

In this guide, let’s build ContextCompactor — the type that implements all three layers — and wire it into the agent loop. This is the beginning of Act III in our series: the agent now needs to manage its own memory.

The complete source code for this stage is available at the 06-context-compaction tag on GitHub. Code blocks below show key excerpts.

Three layers, three strategies

The compression strategy works in layers, each more aggressive than the last.

Layer 1 — micro-compact — runs silently before every API call. It scans the messages array for old tool results (anything beyond the three most recent) and replaces their content with a short placeholder like "[Previous: used read_file]". The model still sees that a tool was called and what kind. The actual output — the 500-line file, the verbose bash dump — is dropped. That’s the whole point: old tool results have usually done their job by the time micro-compact catches them, the recent three are kept untouched as the active working set, and anything older can be re-fetched by calling the tool again if needed. No API call, no summarization, just structural pruning — the cheapest of the three layers.

Layer 2 — auto-compact — triggers when the estimated token count crosses a threshold (50,000 by default). This is the dramatic one: the agent saves the entire conversation transcript to disk as a JSONL file, then asks the LLM itself to summarize the conversation. The summary replaces the entire messages array — every prior turn collapses into two messages: a user message containing the compressed summary and an assistant acknowledgment. The conversation continues from there with a clean slate and full context of what happened.

Layer 3 — the compact tool — is the same summarization as layer 2, but triggered deliberately. The model calls compact when it decides compression would help, optionally specifying a focus parameter to guide what the summary should preserve. It’s the difference between automatic garbage collection and an explicit free() — sometimes the model knows best when to compress.

The ContextCompactor type

Let’s start with the type that owns all three layers. ContextCompactor holds two configuration values — the path where transcripts are saved and the token threshold that triggers auto-compaction — and exposes methods for each layer:

// Sources/Core/ContextCompactor.swift
public struct ContextCompactor: Sendable {
  public static let keepRecent = 3
  public static let minContentLength = 100

  public let transcriptDirectory: String
  public let tokenThreshold: Int

  public init(
    transcriptDirectory: String,
    tokenThreshold: Int = Limits.defaultTokenThreshold
  ) {
    self.transcriptDirectory = transcriptDirectory
    self.tokenThreshold = tokenThreshold
  }
}

The keepRecent and minContentLength constants control micro-compact’s behavior: keep the three most recent tool results untouched, and only replace results longer than 100 characters. Anything shorter isn’t worth compacting.

Micro-compact: the quiet layer

The microCompact method scans the messages array for every .toolResult content block, identifies which ones are old enough to compress, and replaces their content with a placeholder. One thing to keep in mind here is that Message.content is a let property — we can’t mutate a content block in place. Instead, we reconstruct entire Message values with new content arrays:

public func microCompact(messages: inout [Message]) {
  let toolResultLocations = findToolResultLocations(in: messages)
  guard toolResultLocations.count > Self.keepRecent else {
    return
  }

  let toolNameMap = buildToolNameMap(from: messages)
  let oldResults = toolResultLocations.dropLast(Self.keepRecent)
  var modifiedContents: [Int: [ContentBlock]] = [:]

  for (msgIdx, contentIdx) in oldResults {
    guard
      case .toolResult(let toolUseId, let content, let isError) = messages[msgIdx].content[contentIdx],
      content.count > Self.minContentLength
    else {
      continue
    }

    let toolName = toolNameMap[toolUseId] ?? "unknown"
    let replacement = ContentBlock.toolResult(
      toolUseId: toolUseId,
      content: "[Previous: used \(toolName)]",
      isError: isError
    )

    if modifiedContents[msgIdx] == nil {
      modifiedContents[msgIdx] = messages[msgIdx].content
    }
    modifiedContents[msgIdx]![contentIdx] = replacement
  }

  for (msgIdx, newContent) in modifiedContents {
    messages[msgIdx] = Message(role: messages[msgIdx].role, content: newContent)
  }
}

The method is intentionally synchronous — it’s pure data transformation with no reason to await anything. Two private helpers do the scanning: findToolResultLocations collects every toolResult position in the array, and buildToolNameMap walks assistant messages to map each toolUseId back to its tool name — bridging a gap in the API’s data model where toolResult blocks carry an ID but no name.

Auto-compact: threshold-triggered summarization

Layer 2 needs to answer a question before it can act: how many tokens are we using? The API doesn’t tell us the context size mid-conversation, so we estimate:

public func estimateTokens(from messages: [Message]) -> Int {
  let data = (try? JSONEncoder().encode(messages)) ?? Data()
  return data.count / 4
}

The divide-by-four heuristic is rough, but it’s close enough for a threshold check — and JSON encoding closely matches the actual API payload size, which is what we care about.

When the estimate crosses the threshold, autoCompact takes over. It saves the full transcript to disk first — nothing is truly lost — then asks the LLM to summarize:

public func autoCompact(
  messages: [Message],
  using apiClient: APIClientProtocol,
  model: String,
  focus: String?
) async -> [Message] {
  do {
    let path = try saveTranscript(messages)

    let encoder = JSONEncoder()
    let data = (try? encoder.encode(messages)) ?? Data()

    var transcript = String(data: data, encoding: .utf8) ?? "[]"
    if transcript.count > Self.maxSummaryInputLength {
      transcript = String(transcript.prefix(Self.maxSummaryInputLength)) + "\n[truncated]"
    }

    var prompt = ""
    if let focus, !focus.isEmpty {
      prompt += "Focus on: \(focus). "
    }
    prompt += """
      Summarize this conversation for continuity. Include: \
      1) What was accomplished, 2) Current state, 3) Key decisions made. \
      Be concise but preserve critical details.

      \(transcript)
      """

    let request = APIRequest(
      model: model,
      maxTokens: 2000,
      messages: [.user(prompt)]
    )
    let response = try await apiClient.createMessage(request: request)
    let summary = response.content.textContent

    return [
      .user("[Conversation compressed. Transcript: \(path)]\n\n\(summary)"),
      .assistant("Understood. I have the context from the summary. Continuing.")
    ]
  } catch {
    print("[warning] Auto-compact failed: \(error). Keeping original messages.")
    return messages
  }
}

The do/catch wrapping the entire method body is a deliberate safety net — compaction failure should never crash the agent loop. If the API call fails or the transcript can’t be written, the method prints a warning and returns the original messages unchanged. The agent continues with a full context rather than no context.

The saveTranscript method writes each message as a single JSON line to a .transcripts/ directory. One early version used a bare Unix timestamp for the filename, which created collisions when two compactions happened in the same second. The fix appends a UUID prefix:

let timestamp = Int(Date().timeIntervalSince1970)
let unique = UUID().uuidString.prefix(8)
let path = "\(transcriptDirectory)/transcript_\(timestamp)_\(unique).jsonl"

The compact tool and two-phase dispatch

Layer 3 gives the model direct control over compression. The compact tool definition includes an optional focus parameter that lets the model specify what the summary should preserve:

ToolDefinition(
  name: "compact",
  description: "Compress conversation history to free context space. Use when working on long tasks.",
  inputSchema: .object([
    "type": "object",
    "properties": .object([
      "focus": .object([
        "type": "string",
        "description": "What to preserve in the summary (e.g., 'file paths edited', 'current task progress')"
      ])
    ]),
    "required": .array([])
  ])
)

The handler, though, is surprising — it doesn’t actually compact anything:

private func executeCompact(_ input: JSONValue) async -> Result<String, ToolError> {
  .success("Compressing...")
}

This is the two-phase dispatch pattern. The compact tool can’t perform the actual compaction because tool handlers return Result<String, ToolError> — they don’t have access to the messages array. The real work needs to happen in the loop, where messages is a local var. So the handler returns a marker string, and processToolUses captures the focus parameter as a signal:

struct ToolProcessingResult {
  let results: [ContentBlock]
  let didUseTodo: Bool
  let compactFocus: String?
}

The compactFocus field is nil when compact wasn’t called, and holds the focus value (or an empty string for no focus) when it was. This replaces the growing tuple that processToolUses previously returned — a named struct with a clear nil-vs-present semantic is easier to reason about than a third tuple element.

Inside processToolUses, the compact detection is a simple check alongside the existing didUseTodo tracking:

if name == "compact" {
  compactFocus = input["focus"]?.stringValue ?? ""
}

Wiring into the agent loop

With all three layers built, let’s connect them. The applyCompaction helper runs layers 1 and 2 in sequence:

private func applyCompaction(_ messages: [Message]) async -> [Message] {
  var compacted = messages
  contextCompactor.microCompact(messages: &compacted)

  if contextCompactor.estimateTokens(from: compacted) > contextCompactor.tokenThreshold {
    print("[auto_compact triggered]")
    return await contextCompactor.autoCompact(
      messages: compacted, using: apiClient, model: model, focus: nil
    )
  }

  return compacted
}

Micro-compact runs first (every turn), then the threshold check determines whether auto-compact fires. The method takes messages by value and returns a new array — the same pure-value pattern we’ve used since extracting agentLoop for subagents.

In the loop itself, applyCompaction runs before each API call, and manual compaction runs after tool results are appended:

while true {
  try Task.checkCancellation()

  iteration += 1
  if iteration > config.maxIterations {
    return (lastAssistantText + "\n(\(config.label) reached iteration limit)", messages)
  }

  messages = await applyCompaction(messages)

  let request = APIRequest(
    model: model, maxTokens: Limits.defaultMaxTokens,
    system: systemPrompt, messages: messages, tools: config.tools
  )

  let response = try await apiClient.createMessage(request: request)
  messages.append(Message(role: .assistant, content: response.content))
  // ... print, check stop reason, process tools ...

  messages.append(Message(role: .user, content: toolResults))

  if let compactFocus = toolProcessing.compactFocus {
    print("[manual compact]")
    messages = await contextCompactor.autoCompact(
      messages: messages, using: apiClient, model: model, focus: compactFocus
    )
  }
}

The placement matters. Micro-compact and auto-compact run before the API call, so the request always goes out with a trimmed context. Manual compact runs after tool results are appended, so the summary includes the compact tool call itself — the model’s explicit decision to compress is preserved in the transcript.

The compact tool is excluded from LoopConfig.subagent alongside agent and todo — a subagent shouldn’t be able to compress the parent’s history. But micro-compact and auto-compact do run in subagent loops, since subagents share the same agentLoop code path. A subagent making heavy read_file calls across its 30-iteration limit can benefit from the quiet cleanup.

With that in place, we now have an agent that manages its own memory. Three layers of compression, one new type, and two injection points in the loop — before the API call and after tool processing.

Taking it for a spin

Let’s build and run:

swift build && swift run agent

Try: Read every Swift file in the Sources/ directory one by one. Watch the terminal — after the first few files, earlier tool results in the context will start appearing as "[Previous: used read_file]" in subsequent API requests. That’s micro-compact doing its work silently.

For a more dramatic demonstration, keep reading files or ask the agent to explore a large codebase. When the estimated token count crosses 50,000, auto-compact triggers: the agent saves a full transcript to .transcripts/, asks the LLM for a summary, and continues with a fresh two-message context. Check the .transcripts/ directory afterward — the full conversation history is preserved as JSONL.

To see layer 3 in action, try: Use the compact tool to compress this conversation, focusing on what files we've read. The model calls compact with a focus parameter, the loop triggers summarization, and the conversation continues with a targeted summary.

What we’ve built and where it breaks

We now have an agent that can work indefinitely. Micro-compact quietly trims old tool results every turn. Auto-compact summarizes the full conversation when the context gets large. The compact tool gives the model deliberate control. Transcripts on disk mean nothing is truly lost — just moved out of active context.

The limitation is that compression is lossy. When auto-compact fires, the model loses access to the exact content of files it read, the precise error messages it encountered, the specific commands it ran. The summary preserves the gist — what was accomplished, the current state, key decisions — but not the details. For a long-running task with dozens of steps, the model might forget exactly which files it edited or which approach it tried and abandoned. The loop is still the invariant; tools are still the variable. But now one of those tools can reshape the loop’s own working memory — the first time in our series that the agent isn’t just acting on the world, but acting on itself. In the next guide, we’ll address the lossy-compression problem directly: a file-based task system that gives the agent durable state that survives compaction. Thanks for reading!