How do we write code using fewer tokens?

What we will cover

๐Ÿงฎ

Token Basics

How tokens are calculated.

โš™๏ธ

Coding Agent Settings

Saving tokens in a coding agent.

๐Ÿ› ๏ธ

Tools

Exploring token-saving developer tools.

Classic Chat vs. Agent Mode

How these tools operate determines how tokens are consumed. The structural difference directly impacts overall cost.

๐Ÿ’ฌ

Classic AI Chat

Linear & Predictable: One-shot request and response. Pauses and remains idle after responding. Costs grow based on manual inputs and output.

๐Ÿ”„

Coding Agent (Agent Mode)

Autonomous Loops: Reason-Act-Observe cycle. Runs in the background to execute plans, check logs, and run code, rebuilding context.

Highlight the difference between user-driven prompts (where the developer is in full control of the token usage) and autonomous loops (where the agent runs multiple turns automatically without user approval).

Classic Chat: Single Transaction

In standard chat, you interact in a single exchange. You pay once for the context you send and the direct response.

The Chat Formula

Total = Context + Prompt + Completion
Prompt + Context Files 10,000 Context + 1,000 Prompt
11,000 Tokens
Total Billed: ~11,000 Tokens
Explain that this is the baseline. When a user submits a prompt, it's a static transaction. There is no automated background work, meaning token spend stops immediately after the response is generated.

The ReAct Loop Multiplier

Why does Agent Mode consume so many tokens? Because prompts are rebuilt in the background.

๐Ÿ” Repeated Context Reading

Every loop iteration forces the model to reread the entire context: the task, previous steps, and tool logs.

๐Ÿ“‚ Independent File Reading

The agent can read up to 5 additional files without your knowledge to locate relevant information.

Explain context accumulation and "puffing". This is why a simple one-line prompt to an agent can turn into a 100k token execution if it starts reading multiple files and logs in a loop.

Context Inflation: How Costs Accumulate

Classic chat is a single exchange. Agent runs charge you for every loop iteration, which contains the sum of all previous prompts, files, and tool logs.

The Accumulation Formula

Total = Sum(Context + History + Logi)
Loop 1: Initial Context 11,000 Tokens
โฌ‡๏ธ
Loop 2: Context + Step 1 Logs 12,000 Tokens
โฌ‡๏ธ
Loop 3: Context + History + Step 2 Logs 13,500 Tokens
โฌ‡๏ธ
Loop 10: Full History + Accumulated Logs 20,000 Tokens
Total Billed: ~155,000 Tokens
Explain that this diagram visualizes context compounding. At Loop 10, the developer is paying for Loop 1, 2, 3... all the way to 10 combined, because LLMs are stateless and every action requires sending the full history back.

LLM Pricing & The Caching Discount

LLM API Pricing Table (June 2026)

Provider Model Input / 1M Cached Input / 1M Output / 1M
Anthropic Claude Haiku 4.5 $1.00 $0.10 $5.00
Anthropic Claude Sonnet 4.6 $3.00 $0.30 $15.00
Anthropic Claude Opus 4.8 $5.00 $0.50 $25.00
OpenAI GPT-5.5 (Frontier) $5.00 $0.50 $30.00
OpenAI GPT-5.4 (Standard) $2.50 $0.25 $15.00
OpenAI GPT-5.4 Mini $0.75 $0.075 $4.50
OpenAI GPT-5.3 Codex $2.50 $0.25 $15.00
Google Gemini 3.5 Flash $1.50 $0.15 $9.00

Why the 90% Drop Matters

For an autonomous coding agent executing a feedback loop against a 100,000 token context (codebase map, open files, logs) using Claude 4.6 Sonnet:

โŒ Without Caching $0.30 / turn

Every single turn bills the full 100k input context from scratch.

โœจ With Caching $0.03 / turn

The 100k prefix is cached. You only pay full price for the tiny delta.

90% Cost Reduction flatlines the Agentic Cost Traps.
Highlight the pricing table for June 2026. Point out that for all three top providers (Anthropic, OpenAI, Google), the caching discount is consistently 90%. Use the Claude Sonnet 4.6 example: instead of burning 30 cents per message to re-read files, prompt caching drops it to 3 cents per message.

Managing Cache TTL & Structure

โฑ๏ธ The 5-Minute Rule

Caches persist using a sliding 5-to-10 minute TTL window:

  • Hot Cache: Active agent turns maintain the 90% discount.
  • Cold Start: Idle sessions expire, requiring a full-price re-warm.

Avoiding Cache Busts

1. Prefix Match Ordering

Caches match from the start. Put static system prompts and files first; put dynamic inputs (queries, timestamps) last.

2. Size Thresholds

Requires minimum prompt overhead to trigger caching (e.g., 1,024 tokens for Anthropic).

3. Deterministic Structure

Keep tool output keys, logs, and files in a fixed order. Shuffling order breaks the cache.

Explain the sliding TTL window (5-minute rule). Emphasize three cache bust triggers: prefix ordering mismatch, under-threshold size, and non-deterministic shuffles.

Base Prompt Overheads

Every request carries a fixed system payload of ~10,000 tokens (system rules) that cannot be skipped. The baseline overhead scales based on the active mode:

๐Ÿ’ฌ Ask Mode

~13,000 Tokens

Base instructions, UI state, and workspace indexes loaded automatically.

๐Ÿค– Agent Mode

~18,000 Tokens

Ask Mode baseline, tool definitions, and loop instructions.

Explain that the 10k token system instructions are the core foundation. Moving from Ask to Agent mode adds another 5k tokens of planner and tool schemas, creating a raw 18k token starting point before any files are read.

Model Context Protocol Overhead

Adding Model Context Protocol (MCP) servers and IDE extensions injects extra tokens into every execution turn.

๐Ÿค– Coding Agent Base

18,000 Tokens

Base system prompt for planning and boundaries.

โž• Custom MCPs & Extensions

+0 to 10,000+ Tokens

Token schemas added by custom MCP tools and active extensions.

Total Start Cost: 18,000 to 28,000+ Tokens
Explain that the 18k base from the coding agent is just the starting point. When you load complex MCP tools like Jupyter or database connectors, their schemas inject an extra 0-10k+ tokens into the system prompt, inflating the initial payload.

On-Demand Agent Skills

Add agent skills using lightweight JavaScript functions. They run only when needed, avoiding the continuous token cost of persistent MCP servers.

terminal - workspace shell
$ /gh-answer-pr-comments 2137
        
        โžœ Step 1: Reading PR Comments
          Found comment on src/auth.js #L34: "Can we optimize this search to O(1)?"
        
        โžœ Step 2: Proposing Code Improvements
          Replacing Array.find() with Map lookup.
          Diff proposed: + const tokenMap = new Map(tokens.map(t => [t.id, t]));
        
        โžœ Step 3: Replying to Comment
          Posted: "Refactored the linear search to a Map lookup for O(1) performance in commit a3b8f1."
        
        โœ” Analysis complete. All comments resolved.
Explain how on-demand skills (like those defined on agentskills.io) can be triggered with a direct workspace slash command, running the full analysis (fetch, code review, git commit, PR update) without needing a heavy, persistent server schema.
AI Mathematics

Plan First,
Code Second

Why text planning is cheaper than coding on the fly, saving API budget and waiting time.

Introduce the planning section: explaining why mapping out solutions in text before writing code is the most significant token-saving strategy you can implement.

The Cost of "Coding on the Fly"

Asking an agent to write code immediately without a plan is an expensive mistake, triggering long feedback loops.

๐Ÿ” Feedback Death Loop

AI writes code โž” compiler throws errors โž” AI attempts blind fixes โž” breaks adjacent modules โž” repeats.

โ„๏ธ Context Snowball Effect

Every step appends previous prompts, files, and build logs to the history, growing the cost.

Explain these two key pitfalls to the audience:


๐Ÿ” Feedback Death Loop: The AI writes code โž” compiler throws errors โž” AI attempts blind fixes โž” breaks adjacent modules โž” repeats.


โ„๏ธ Context Snowball Effect: Every iteration appends all previous prompts, files, and build logs to the history. The cost grows exponentially.

Visualizing The Cost Traps

๐Ÿ” Feedback Death Loop

The agent attempts to fix compiler errors blindly. Without a high-level plan, it edits code, triggers new errors, and ends up in an endless, expensive loop.

๐Ÿ’ธ
1. Write Code
2. Build Error
3. Blind Fix Attempt
4. Break Module
Loop continues indefinitely. Wastes time & money.

โ„๏ธ Context Snowball Effect

As the feedback loop runs, every compile error, terminal log, and retry code is appended to the prompt history. The input context grows larger with each turn.

Turn 3: History + Error Logs + Retries 35,000+ Tokens
Turn 2: History + Compile Error 22,000 Tokens
Turn 1: Initial Prompt + Files 12,000 Tokens
Context grows exponentially. Billed per turn.
Talk through the two traps: 1. Feedback Death Loop: Explain how the agent cycles repeatedly through writing code and encountering build errors, attempting guesses that can break other modules. 2. Context Snowball Effect: Emphasize that because LLM requests are stateless, every new retry includes the entire chat history and all build logs, meaning developers pay progressively more for every attempt.

The Plan Paradox

Paying once for a text blueprint is cheaper than paying multiple times for code fixes.

๐Ÿ“‰ The Planning Overhead

Drafting a plan can still be expensive if the model reads the entire repository at once.

โšก Token-Optimized Planning

By using file signatures and summary indexes, we keep planning costs low.

Correct the misconception: Planning is not automatically free. If the agent reads 50k tokens of code just to draft a plan, it is expensive. Emphasize that we must use token-optimized inputs (like signature-only reads and structured maps) to keep the planning cost low, which remains far cheaper than loop debugging.

Targeted Context Isolation

A plan lets you divide the task so the agent only reads files relevant to the current step.

โŒ Unplanned: Blind Exploration

Agent searches the codebase aimlessly, reading unrelated files and wasting tokens.

โœ… Planned: Isolated Context

The plan splits the task. In step 1, the agent reads only one file, avoiding repository noise.

Explain how planning enables context isolation. Instead of letting the agent read the whole project, the plan tells it exactly which file is needed for Step 1, cutting inputs dramatically.

The Math: Spontaneous vs. Planned

Spontaneous Coding

Code on the Fly

  • โ€ข 1 direct code prompt
  • โ€ข 4 compiler error loops
  • โ€ข Context grows fast
Total Tokens: ~400,000
Planned Coding

Plan โž” Implement

  • โ€ข 1 planning prompt
  • โ€ข 2 - 3 prompts clarifying requirements
  • โ€ข 1 - 3 prompts refining the plan
  • โ€ข 1 targeted execution step
  • โ€ข Minimal context growth
Total Tokens: ~200,000
Average Savings: ~50% off your token bill
Review the math: Coding directly results in multiple compiler error loops and massive context snowballing. Planning first costs only a few tokens for text and results in a clean, one-shot code delivery.

Architect vs. Mechanic

Use the AI as an architect to design blueprints, not a mechanic trying random fixes.

๐Ÿ“‹ Rule: Blueprints First

Always agree on a text plan before letting the agent modify code.

๐Ÿ’ก Cheap Edits

Fixing logical errors in text costs a single sentence. Fixing them in code costs thousands of tokens.

Conclude the planning section: shift the team's philosophy to validate designs in text first. Blueprints are cheap to fix; code is expensive.
Tool Showcase: Caveman Skill

The Caveman Skill

"Why use many token when few do trick."
A plugin that cuts ~75% of output tokens with 100% technical accuracy.

Introduce the Caveman skill: a developer plugin (available for Claude Code, Cursor, Cline, coding agents, etc.) that forces the model to drop conversational filler and write concise, high-density technical responses.

Less Word = More Correct

A March 2026 paper ("Brevity Constraints...") found that forcing LLMs to be brief improved accuracy by up to 26 points on benchmarks.

โšก Speed Increase

Fewer output tokens mean the model responds almost instantly, speeding up feedback.

๐ŸŽฏ 100% Technical Accuracy

Logic and code remain the sameโ€”the model strips conversational fluff and replies in short fragments.

Explain the cognitive benefit of brevity for LLMs. Verbose is not always betterโ€”fewer tokens focus the model's output distribution on essential arguments, avoiding distractions and speed delays.

Output Comparison: React Re-render

๐Ÿ—ฃ๏ธ Normal Claude
"The reason your React component is re-rendering is likely because you're creating a new object reference on each render cycle. When you pass an inline object as a prop, React's shallow comparison sees it as a different object every time, which triggers a re-render. I'd recommend using useMemo to memoize the object."
69 Tokens
โ›๏ธ Caveman Claude
"New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo."
19 Tokens
Same technical solution. 72% fewer words. Faster response.
Saved: 50 output tokens (72% reduction)
Show the before/after: Normal Claude explains in full conversational paragraphs. Caveman Claude gives the exact same advice in brief, actionable technical fragments, saving 72% of output tokens.

Claude API Benchmark Savings

Real output token counts from the Claude API show an average of 65% output reduction.

65%

Average Token Reduction

~3x

Output Generation Speedup

Benchmark Task Normal Caveman Saved
Explain React re-render bug 1180 159 -87%
Fix auth middleware expiry 704 121 -83%
Set up PostgreSQL pool 2347 380 -84%
Explain git rebase vs merge 702 292 -58%
Docker multi-stage build 1042 290 -72%
Present the benchmark figures: Caveman demonstrates an average of 65% output reduction across standard prompts (ranging from 22% on refactorings to 87% on explanations).

Triggering Caveman in Chat

Activate the compression skill in your active session using slash commands.

agent chat session
dev: /caveman full, why i have so many rerenders
        agent:  "New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo." 
        
Explain how the skill is activated. The developer types /caveman full to change settings. When they follow up with "why i have so many rerenders", the agent outputs in caveman fragment mode.
Tool Showcase: RTK

Rust Token Killer

A CLI proxy that intercepts and compresses verbose terminal logs for AI agents.

Introduce RTK (Rust Token Killer) as an open-source development tool built to resolve the massive inefficiencies of AI-driven software engineering.

Why do we need RTK?

Agents waste a significant portion of their context window on terminal noise (like test logs or file trees). RTK filters this out.

โณ Longer Sessions

Stops context windows from overflowing, making coding sessions last longer before needing a restart.

๐Ÿ’ฐ Major Cost Reductions

Minimizes LLM API expenses by preventing redundant, bloated output from filling up your token history.

๐ŸŽฏ Better AI Reasoning

Shows only actionable data, improving reasoning and code fix accuracy.

Talk about the core benefits of RTK: longer session limits, massive cost reductions, and cleaner logs that help the AI think more clearly and make fewer mistakes.

Frictionless Auto-Rewrite

RTK runs silently in the background, intercepting commands using hooks, hidden from the agent.

๐Ÿ”Œ

Pre-Use Interception

Hooks catch command triggers and prefix them with rtk automatically.

Wrap up the execution loop by explaining the invisible proxy hook that makes RTK frictionless. The AI agent doesn't even know it's being optimized.

Invisible Command Interception

Minimalist Setup

No agent modifications, no API rewrites, and no SDK wrappers. Just append a single line to your shell profile to start optimizing every command automatically.

~/.zshrc
# Enable RTK shell hook integration
    eval "$(rtk hook init)"
โšก Zero overhead: The hook is registered at shell startup. It intercepts executions locally in microseconds before the LLM call starts.
1

Command Triggered

AI agent executes standard command in workspace shell:

pytest -v
๐Ÿ”Œ
2

Hook Interception & Rewrite

Shell hook intercepts the command and prefixes it on-the-fly:

rtk pytest -v
3

Noise-Filtered Execution

Command runs. Outputs (like passing tests) are compressed before returning to the AI agent.

Explain that a hook is simply a shell function or alias loader. By putting this one line in ~/.zshrc or ~/.bashrc, the user doesn't need to change how the agent is configured. Every tool call or manual execution is automatically piped through RTK, ensuring zero friction.

JS File Read Signature Compression

Raw auth.js Code Read
import { db } from './db.js';
    
    export async function login(user, pass) {
        const u = await db.getUser(user);
        if (!u || u.password !== pass) {
            throw new Error('Invalid login');
        }
        return generateToken(u);
    }
    
    export function verify(token) {
        if (!token) return false;
        try {
            const payload = jwt.verify(token);
            return payload.expiry > Date.now();
        } catch {
            return false;
        }
    }
rtk read auth.js Output
// auth.js (Structural outline)
    import { db } from './db.js';
    
    export async function login(user, pass) { /* body stripped */ }
    export function verify(token) { /* body stripped */ }
When exploring structure, RTK strips function bodies and sends signatures only.
Raw: 4,200 โž” RTK: 210 tokens Saved: 3,990 tokens (95%)
Explain that if the agent only needs to locate function signatures and export contracts to understand how modules link, rtk read strips implementation details, saving 95% on tokens.

JS Git Diff Compression

Raw git diff Output
diff --git a/src/auth.js b/src/auth.js
    index 45adcf2..2ba9c4b 100644
    --- a/src/auth.js
    +++ b/src/auth.js
    @@ -12,5 +12,5 @@
     export function verify(token) {
    -    const payload = jwt.verify(token);
    +    const payload = jwt.verify(token, process.env.SECRET);
         return payload.expiry > Date.now();
     }
rtk git diff Output
diff --git a/src/auth.js b/src/auth.js
    -    const payload = jwt.verify(token);
    +    const payload = jwt.verify(token, process.env.SECRET);
    
    [x412 lines of unmodified code collapsed]
RTK hides unchanged code blocks, git index info, and metadata headers.
Raw: 8,400 โž” RTK: 504 tokens Saved: 7,896 tokens (94%)
Explain diff compression: By stripping out surrounding context headers, git indexing logs, and collapsing large blocks of unchanged code, RTK isolates the exact modified lines to save 94% on diff payloads.

RTK Token Savings Statistics

Benchmarked across 2,900+ real-world developer commands with an average of 89% noise reduction.

89%

Average Token Savings

59.3k

GitHub Stars (MIT License)

Command Token Reduction
pytest -v (Python) -96.0%
rtk read (File Reading) -95.0%
git diff -94.0%
cargo test (Rust) -91.8%
git log -86.0%
Talk through the numbers: Point out that testing outputs are compressed by over 90% (e.g. pytest at 96% and cargo test at 91.8%) by stripping passing counts and retaining only failure logs. Highlight git diff (-94%) and the enormous community support (59.3k stars).

Conclusion: Token-Optimized Coding

โšก

Prefer Skills + CLI over MCPs

Use lightweight, on-demand shell actions instead of persistent, heavy MCP servers. Bypasses constant prompt schema overhead.

๐Ÿ—ก๏ธ

RTK (Input Compression)

Filter out verbose unit test logs, file listings, and repository metadata to drastically reduce your prompt's input tokens.

โ›๏ธ

Caveman (Output Compression)

Force the model to strip conversational filler and answer in concise fragments, saving ~75% on response tokens.

๐Ÿ“‹

Planning First

Validate technical blueprints in text to isolate module context, prevent feedback loops, and accelerate development speed.

Summarize the key strategies for the team: 1. Replace heavy MCP servers with lightweight Workspace Skills. 2. Run RTK to compress input logs by 80-90%. 3. Activate Caveman Mode to slash output words by 75%. 4. Always write a blueprint plan first to keep agent context clean and isolated.
01 / 01

Presenter Notes