How do we write code using fewer tokens?

Agenda

What we will cover

🧮

Token Basics

How tokens are calculated.

⚙️

Coding Agent Settings

Saving tokens in a coding agent.

🛠️

Tools

Exploring token-saving developer tools.

Architecture

Classic Chat vs. Agent Mode

How these tools operate determines how tokens are consumed. The structural difference directly impacts overall cost.

💬

Classic AI Chat

Linear & Predictable: One-shot request and response. Pauses and remains idle after responding. Costs grow based on manual inputs and output.

🔄

Coding Agent (Agent Mode)

Autonomous Loops: Reason-Act-Observe cycle. Runs in the background to execute plans, check logs, and run code, rebuilding context.

Highlight the difference between user-driven prompts (where the developer is in full control of the token usage) and autonomous loops (where the agent runs multiple turns automatically without user approval).

Visualizing Token Accumulation

Classic Chat: Single Transaction

In standard chat, you interact in a single exchange. You pay once for the context you send and the direct response.

The Chat Formula


                        Total = Context + Prompt + Completion

Prompt + Context Files 10,000 Context + 1,000 Prompt

11,000 Tokens

Total Billed: ~11,000 Tokens

Explain that this is the baseline. When a user submits a prompt, it's a static transaction. There is no automated background work, meaning token spend stops immediately after the response is generated.

Deep Dive

The ReAct Loop Multiplier

Why does Agent Mode consume so many tokens? Because prompts are rebuilt in the background.

🔁 Repeated Context Reading

Every loop iteration forces the model to reread the entire context: the task, previous steps, and tool logs.

📂 Independent File Reading

The agent can read up to 5 additional files without your knowledge to locate relevant information.

Explain context accumulation and "puffing". This is why a simple one-line prompt to an agent can turn into a 100k token execution if it starts reading multiple files and logs in a loop.

Visualizing Token Accumulation

Context Inflation: How Costs Accumulate

Classic chat is a single exchange. Agent runs charge you for every loop iteration, which contains the sum of all previous prompts, files, and tool logs.

The Accumulation Formula


                        Total = Sum(Context + History + Log_i)

Loop 1: Initial Context 11,000 Tokens

⬇️

Loop 2: Context + Step 1 Logs 12,000 Tokens

⬇️

Loop 3: Context + History + Step 2 Logs 13,500 Tokens

⬇️

Loop 10: Full History + Accumulated Logs 20,000 Tokens

Total Billed: ~155,000 Tokens

Explain that this diagram visualizes context compounding. At Loop 10, the developer is paying for Loop 1, 2, 3... all the way to 10 combined, because LLMs are stateless and every action requires sending the full history back.

Unit Economics

LLM Pricing & The Caching Discount

LLM API Pricing Table (June 2026)

Provider	Model	Input / 1M	Cached Input / 1M	Output / 1M
Anthropic	Claude Haiku 4.5	$1.00	$0.10	$5.00
Anthropic	Claude Sonnet 4.6	$3.00	$0.30	$15.00
Anthropic	Claude Opus 4.8	$5.00	$0.50	$25.00
OpenAI	GPT-5.5 (Frontier)	$5.00	$0.50	$30.00
OpenAI	GPT-5.4 (Standard)	$2.50	$0.25	$15.00
OpenAI	GPT-5.4 Mini	$0.75	$0.075	$4.50
OpenAI	GPT-5.3 Codex	$2.50	$0.25	$15.00
Google	Gemini 3.5 Flash	$1.50	$0.15	$9.00

Why the 90% Drop Matters

For an autonomous coding agent executing a feedback loop against a 100,000 token context (codebase map, open files, logs) using Claude 4.6 Sonnet:

❌ Without Caching $0.30 / turn

Every single turn bills the full 100k input context from scratch.

✨ With Caching $0.03 / turn

The 100k prefix is cached. You only pay full price for the tiny delta.

90% Cost Reduction flatlines the Agentic Cost Traps.

Highlight the pricing table for June 2026. Point out that for all three top providers (Anthropic, OpenAI, Google), the caching discount is consistently 90%. Use the Claude Sonnet 4.6 example: instead of burning 30 cents per message to re-read files, prompt caching drops it to 3 cents per message.

Cache Dynamics

Managing Cache TTL & Structure

⏱️ The 5-Minute Rule

Caches persist using a sliding 5-to-10 minute TTL window:

Hot Cache: Active agent turns maintain the 90% discount.
Cold Start: Idle sessions expire, requiring a full-price re-warm.

Avoiding Cache Busts

1. Prefix Match Ordering

Caches match from the start. Put static system prompts and files first; put dynamic inputs (queries, timestamps) last.

2. Size Thresholds

Requires minimum prompt overhead to trigger caching (e.g., 1,024 tokens for Anthropic).

3. Deterministic Structure

Keep tool output keys, logs, and files in a fixed order. Shuffling order breaks the cache.

Explain the sliding TTL window (5-minute rule). Emphasize three cache bust triggers: prefix ordering mismatch, under-threshold size, and non-deterministic shuffles.

GitHub Copilot

Base Prompt Overheads

Every request carries a fixed system payload of ~10,000 tokens (system rules) that cannot be skipped. The baseline overhead scales based on the active mode:

💬 Ask Mode

~13,000 Tokens

Base instructions, UI state, and workspace indexes loaded automatically.

🤖 Agent Mode

~18,000 Tokens

Ask Mode baseline, tool definitions, and loop instructions.

Explain that the 10k token system instructions are the core foundation. Moving from Ask to Agent mode adds another 5k tokens of planner and tool schemas, creating a raw 18k token starting point before any files are read.

Extensibility

Model Context Protocol Overhead

Adding Model Context Protocol (MCP) servers and IDE extensions injects extra tokens into every execution turn.

🤖 Coding Agent Base

18,000 Tokens

Base system prompt for planning and boundaries.

➕ Custom MCPs & Extensions

+0 to 10,000+ Tokens

Token schemas added by custom MCP tools and active extensions.

Total Start Cost: 18,000 to 28,000+ Tokens

Explain that the 18k base from the coding agent is just the starting point. When you load complex MCP tools like Jupyter or database connectors, their schemas inject an extra 0-10k+ tokens into the system prompt, inflating the initial payload.

Optimization Alternative

On-Demand Agent Skills

Add agent skills using lightweight JavaScript functions. They run only when needed, avoiding the continuous token cost of persistent MCP servers.

terminal - workspace shell

$ /gh-answer-pr-comments 2137
        
        ➜ Step 1: Reading PR Comments
          Found comment on src/auth.js #L34: "Can we optimize this search to O(1)?"
        
        ➜ Step 2: Proposing Code Improvements
          Replacing Array.find() with Map lookup.
          Diff proposed: + const tokenMap = new Map(tokens.map(t => [t.id, t]));
        
        ➜ Step 3: Replying to Comment
          Posted: "Refactored the linear search to a Map lookup for O(1) performance in commit a3b8f1."
        
        ✔ Analysis complete. All comments resolved.

Explain how on-demand skills (like those defined on agentskills.io) can be triggered with a direct workspace slash command, running the full analysis (fetch, code review, git commit, PR update) without needing a heavy, persistent server schema.

AI Mathematics

Plan First,
Code Second

Why text planning is cheaper than coding on the fly, saving API budget and waiting time.

Introduce the planning section: explaining why mapping out solutions in text before writing code is the most significant token-saving strategy you can implement.

Pitfalls

The Cost of "Coding on the Fly"

Asking an agent to write code immediately without a plan is an expensive mistake, triggering long feedback loops.

🔁 Feedback Death Loop

AI writes code ➔ compiler throws errors ➔ AI attempts blind fixes ➔ breaks adjacent modules ➔ repeats.

❄️ Context Snowball Effect

Every step appends previous prompts, files, and build logs to the history, growing the cost.

Explain these two key pitfalls to the audience:

🔁 Feedback Death Loop: The AI writes code ➔ compiler throws errors ➔ AI attempts blind fixes ➔ breaks adjacent modules ➔ repeats.

❄️ Context Snowball Effect: Every iteration appends all previous prompts, files, and build logs to the history. The cost grows exponentially.

AI Pitfalls in Action

Visualizing The Cost Traps

🔁 Feedback Death Loop

The agent attempts to fix compiler errors blindly. Without a high-level plan, it edits code, triggers new errors, and ends up in an endless, expensive loop.

💸

1. Write Code

2. Build Error

3. Blind Fix Attempt

4. Break Module

Loop continues indefinitely. Wastes time & money.

❄️ Context Snowball Effect

As the feedback loop runs, every compile error, terminal log, and retry code is appended to the prompt history. The input context grows larger with each turn.

Turn 3: History + Error Logs + Retries 35,000+ Tokens

Turn 2: History + Compile Error 22,000 Tokens

Turn 1: Initial Prompt + Files 12,000 Tokens

Context grows exponentially. Billed per turn.

Talk through the two traps: 1. Feedback Death Loop: Explain how the agent cycles repeatedly through writing code and encountering build errors, attempting guesses that can break other modules. 2. Context Snowball Effect: Emphasize that because LLM requests are stateless, every new retry includes the entire chat history and all build logs, meaning developers pay progressively more for every attempt.

Optimization Mechanics

The Plan Paradox

Paying once for a text blueprint is cheaper than paying multiple times for code fixes.

📉 The Planning Overhead

Drafting a plan can still be expensive if the model reads the entire repository at once.

⚡ Token-Optimized Planning

By using file signatures and summary indexes, we keep planning costs low.

Correct the misconception: Planning is not automatically free. If the agent reads 50k tokens of code just to draft a plan, it is expensive. Emphasize that we must use token-optimized inputs (like signature-only reads and structured maps) to keep the planning cost low, which remains far cheaper than loop debugging.

Context Management

Targeted Context Isolation

A plan lets you divide the task so the agent only reads files relevant to the current step.

❌ Unplanned: Blind Exploration

Agent searches the codebase aimlessly, reading unrelated files and wasting tokens.

✅ Planned: Isolated Context

The plan splits the task. In step 1, the agent reads only one file, avoiding repository noise.

Explain how planning enables context isolation. Instead of letting the agent read the whole project, the plan tells it exactly which file is needed for Step 1, cutting inputs dramatically.

Case Study Comparison

The Math: Spontaneous vs. Planned

Spontaneous Coding

Code on the Fly

• 1 direct code prompt
• 4 compiler error loops
• Context grows fast

Total Tokens: ~400,000

Planned Coding

Plan ➔ Implement

• 1 planning prompt
• 2 - 3 prompts clarifying requirements
• 1 - 3 prompts refining the plan
• 1 targeted execution step
• Minimal context growth

Total Tokens: ~200,000

Average Savings: ~50% off your token bill

Review the math: Coding directly results in multiple compiler error loops and massive context snowballing. Planning first costs only a few tokens for text and results in a clean, one-shot code delivery.

Summary

Architect vs. Mechanic

Use the AI as an architect to design blueprints, not a mechanic trying random fixes.

📋 Rule: Blueprints First

Always agree on a text plan before letting the agent modify code.

💡 Cheap Edits

Fixing logical errors in text costs a single sentence. Fixing them in code costs thousands of tokens.

Conclude the planning section: shift the team's philosophy to validate designs in text first. Blueprints are cheap to fix; code is expensive.

Tool Showcase: Caveman Skill

The Caveman Skill

"Why use many token when few do trick."
A plugin that cuts ~75% of output tokens with 100% technical accuracy.

Introduce the Caveman skill: a developer plugin (available for Claude Code, Cursor, Cline, coding agents, etc.) that forces the model to drop conversational filler and write concise, high-density technical responses.

Brevity Science

Less Word = More Correct

A March 2026 paper ("Brevity Constraints...") found that forcing LLMs to be brief improved accuracy by up to 26 points on benchmarks.

⚡ Speed Increase

Fewer output tokens mean the model responds almost instantly, speeding up feedback.

🎯 100% Technical Accuracy

Logic and code remain the same—the model strips conversational fluff and replies in short fragments.

Explain the cognitive benefit of brevity for LLMs. Verbose is not always better—fewer tokens focus the model's output distribution on essential arguments, avoiding distractions and speed delays.

Brevity Comparison

Output Comparison: React Re-render

🗣️ Normal Claude

"The reason your React component is re-rendering is likely because you're creating a new object reference on each render cycle. When you pass an inline object as a prop, React's shallow comparison sees it as a different object every time, which triggers a re-render. I'd recommend using useMemo to memoize the object."

69 Tokens

⛏️ Caveman Claude

"New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo."

19 Tokens

Same technical solution. 72% fewer words. Faster response.

Saved: 50 output tokens (72% reduction)

Show the before/after: Normal Claude explains in full conversational paragraphs. Caveman Claude gives the exact same advice in brief, actionable technical fragments, saving 72% of output tokens.

Performance Metrics

Claude API Benchmark Savings

Real output token counts from the Claude API show an average of 65% output reduction.

65%

Average Token Reduction

~3x

Output Generation Speedup

Benchmark Task	Normal	Caveman	Saved
Explain React re-render bug	1180	159	-87%
Fix auth middleware expiry	704	121	-83%
Set up PostgreSQL pool	2347	380	-84%
Explain git rebase vs merge	702	292	-58%
Docker multi-stage build	1042	290	-72%

Present the benchmark figures: Caveman demonstrates an average of 65% output reduction across standard prompts (ranging from 22% on refactorings to 87% on explanations).

Skill Trigger

Triggering Caveman in Chat

Activate the compression skill in your active session using slash commands.

agent chat session

dev: /caveman full, why i have so many rerenders
        agent:  "New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo."


        
            Explain how the skill is activated. The developer types /caveman full to change settings. When they follow up with "why i have so many rerenders", the agent outputs in caveman fragment mode.







    
        
        
            Tool Showcase: RTK
        
        Rust Token Killer
        A CLI proxy that intercepts and compresses verbose terminal logs for AI agents.
        
            Introducing RTK
            •
            v0.42.3
        
        
            Introduce RTK (Rust Token Killer) as an open-source development tool built to resolve the massive inefficiencies of AI-driven software engineering.
        
    






    
        
            Problem & Solution
            Why do we need RTK?
            Agents waste a significant portion of their context window on terminal noise (like test logs or file trees). RTK filters this out.
        
        
            
                
                    ⏳ Longer Sessions
                    Stops context windows from overflowing, making coding sessions last longer before needing a restart.
                
                
                    💰 Major Cost Reductions
                    Minimizes LLM API expenses by preventing redundant, bloated output from filling up your token history.
                
                
                    🎯 Better AI Reasoning
                    Shows only actionable data, improving reasoning and code fix accuracy.
                
            
        
        
            Talk about the core benefits of RTK: longer session limits, massive cost reductions, and cleaner logs that help the AI think more clearly and make fewer mistakes.
        
    






    
        
            Hook Mechanics
            Frictionless Auto-Rewrite
            RTK runs silently in the background, intercepting commands using hooks, hidden from the agent.
        
        
            
                🔌
                
                    Pre-Use Interception
                    Hooks catch command triggers and prefix them with rtk automatically.
                
            
        
        
            Wrap up the execution loop by explaining the invisible proxy hook that makes RTK frictionless. The AI agent doesn't even know it's being optimized.
        
    






    
        Hook Mechanics
        Invisible Command Interception
        
        
            
            
                
                    Minimalist Setup
                    
                        No agent modifications, no API rewrites, and no SDK wrappers. Just append a single line to your shell profile to start optimizing every command automatically.
                    
                
                
                
                    
                        
                            
                            
                            
                        
                        ~/.zshrc
                    
                    # Enable RTK shell hook integration
    eval "$(rtk hook init)"
                
                
                
                    ⚡
                    
                        Zero overhead: The hook is registered at shell startup. It intercepts executions locally in microseconds before the LLM call starts.
                    
                
            
            
            
            
                
                
                    1
                    
                        Command Triggered
                        AI agent executes standard command in workspace shell:
                        pytest -v
                    
                
                
                
                
                    
                        🔌
                    
                
                
                
                
                    2
                    
                        Hook Interception & Rewrite
                        Shell hook intercepts the command and prefixes it on-the-fly:
                        rtk pytest -v
                    
                
                
                
                
                    
                
                
                
                
                    3
                    
                        Noise-Filtered Execution
                        Command runs. Outputs (like passing tests) are compressed before returning to the AI agent.
                    
                
            
        
    
    
        Explain that a hook is simply a shell function or alias loader. By putting this one line in ~/.zshrc or ~/.bashrc, the user doesn't need to change how the agent is configured. Every tool call or manual execution is automatically piped through RTK, ensuring zero friction.
    






    
        RTK in Action: Files
        JS File Read Signature Compression
    
        
        
            
            
            
                
                    
                    Raw auth.js Code Read
                
                import { db } from './db.js';
    
    export async function login(user, pass) {
        const u = await db.getUser(user);
        if (!u || u.password !== pass) {
            throw new Error('Invalid login');
        }
        return generateToken(u);
    }
    
    export function verify(token) {
        if (!token) return false;
        try {
            const payload = jwt.verify(token);
            return payload.expiry > Date.now();
        } catch {
            return false;
        }
    }
            
    
            
            
                
                    
                    rtk read auth.js Output
                
                // auth.js (Structural outline)
    import { db } from './db.js';
    
    export async function login(user, pass) { /* body stripped */ }
    export function verify(token) { /* body stripped */ }
            
    
        
    
        
        
            When exploring structure, RTK strips function bodies and sends signatures only.
            
                Raw: 4,200 ➔ RTK: 210 tokens Saved: 3,990 tokens (95%)
            
        
    
    
        Explain that if the agent only needs to locate function signatures and export contracts to understand how modules link, rtk read strips implementation details, saving 95% on tokens.
    






    
        RTK in Action: Git
        JS Git Diff Compression
    
        
        
            
            
            
                
                    
                    Raw git diff Output
                
                diff --git a/src/auth.js b/src/auth.js
    index 45adcf2..2ba9c4b 100644
    --- a/src/auth.js
    +++ b/src/auth.js
    @@ -12,5 +12,5 @@
     export function verify(token) {
    -    const payload = jwt.verify(token);
    +    const payload = jwt.verify(token, process.env.SECRET);
         return payload.expiry > Date.now();
     }
            
    
            
            
                
                    
                    rtk git diff Output
                
                diff --git a/src/auth.js b/src/auth.js
    -    const payload = jwt.verify(token);
    +    const payload = jwt.verify(token, process.env.SECRET);
    
    [x412 lines of unmodified code collapsed]
            
    
        
    
        
        
            RTK hides unchanged code blocks, git index info, and metadata headers.
            
                Raw: 8,400 ➔ RTK: 504 tokens Saved: 7,896 tokens (94%)
            
        
    
    
        Explain diff compression: By stripping out surrounding context headers, git indexing logs, and collapsing large blocks of unchanged code, RTK isolates the exact modified lines to save 94% on diff payloads.
    






    
        Performance & Benchmarks
        RTK Token Savings Statistics
        Benchmarked across 2,900+ real-world developer commands with an average of 89% noise reduction.
        
        
            
                
                    89%
                    Average Token Savings
                
                
                    59.3k
                    GitHub Stars (MIT License)
                
            
            
            
                
                    
                        
                            Command
                            Token Reduction
                        
                    
                    
                        
                            pytest -v (Python)
                            -96.0%
                        
                        
                            rtk read (File Reading)
                            -95.0%
                        
                        
                            git diff
                            -94.0%
                        
                        
                            cargo test (Rust)
                            -91.8%
                        
                        
                            git log
                            -86.0%
                        
                    
                
            
        
    
    
        Talk through the numbers: Point out that testing outputs are compressed by over 90% (e.g. pytest at 96% and cargo test at 91.8%) by stripping passing counts and retaining only failure logs. Highlight git diff (-94%) and the enormous community support (59.3k stars).
    






    
        Summary
        Conclusion: Token-Optimized Coding
    
        
            
            
                
                    ⚡
                    Prefer Skills + CLI over MCPs
                
                Use lightweight, on-demand shell actions instead of persistent, heavy MCP servers. Bypasses constant prompt schema overhead.
            
    
            
            
                
                    🗡️
                    RTK (Input Compression)
                
                Filter out verbose unit test logs, file listings, and repository metadata to drastically reduce your prompt's input tokens.
            
    
            
            
                
                    ⛏️
                    Caveman (Output Compression)
                
                Force the model to strip conversational filler and answer in concise fragments, saving ~75% on response tokens.
            
    
            
            
                
                    📋
                    Planning First
                
                Validate technical blueprints in text to isolate module context, prevent feedback loops, and accelerate development speed.
            
        
    
    
        Summarize the key strategies for the team:
        1. Replace heavy MCP servers with lightweight Workspace Skills.
        2. Run RTK to compress input logs by 80-90%.
        3. Activate Caveman Mode to slash output words by 75%.
        4. Always write a blueprint plan first to keep agent context clean and isolated.

Command	Token Reduction
pytest -v (Python)	-96.0%
rtk read (File Reading)	-95.0%
git diff	-94.0%
cargo test (Rust)	-91.8%
git log	-86.0%



        
        
            
            
            
            
            
                01
                /
                01
            

            
            

            

            
            

            
            

            
            

            
            
        

        
        
            
                Presenter Notes
                
            
            
                
            
        

        
        
            
                
                    Keyboard Shortcuts
                    
                
                
                    Space / PageDown / ➔Next Slide
                    Shift+Space / PageUp / ➔Previous Slide
                    Home / EndFirst / Last Slide
                    O KeyToggle Grid Overview
                    T KeyCycle Color Themes
                    N KeyToggle Presenter Notes
                    F KeyToggle Fullscreen
                    ? KeyToggle Help Dialog

How do we write code using fewer tokens?

What we will cover

Token Basics

Coding Agent Settings

Tools

Classic Chat vs. Agent Mode

Classic AI Chat

Coding Agent (Agent Mode)

Classic Chat: Single Transaction

The Chat Formula

The ReAct Loop Multiplier

🔁 Repeated Context Reading

📂 Independent File Reading

Context Inflation: How Costs Accumulate

The Accumulation Formula

LLM Pricing & The Caching Discount

LLM API Pricing Table (June 2026)

Why the 90% Drop Matters

Managing Cache TTL & Structure

⏱️ The 5-Minute Rule

Avoiding Cache Busts

1. Prefix Match Ordering

2. Size Thresholds

3. Deterministic Structure

Base Prompt Overheads

💬 Ask Mode

🤖 Agent Mode

Model Context Protocol Overhead

🤖 Coding Agent Base

➕ Custom MCPs & Extensions

On-Demand Agent Skills

Plan First,Code Second

The Cost of "Coding on the Fly"

🔁 Feedback Death Loop

❄️ Context Snowball Effect

Visualizing The Cost Traps

🔁 Feedback Death Loop

❄️ Context Snowball Effect

The Plan Paradox

📉 The Planning Overhead

⚡ Token-Optimized Planning

Targeted Context Isolation

❌ Unplanned: Blind Exploration

✅ Planned: Isolated Context

The Math: Spontaneous vs. Planned

Code on the Fly

Plan ➔ Implement

Architect vs. Mechanic

📋 Rule: Blueprints First

💡 Cheap Edits

The Caveman Skill

Less Word = More Correct

⚡ Speed Increase

🎯 100% Technical Accuracy

Output Comparison: React Re-render

Claude API Benchmark Savings

65%

~3x

Triggering Caveman in Chat

Rust Token Killer

Why do we need RTK?

⏳ Longer Sessions

💰 Major Cost Reductions

🎯 Better AI Reasoning

Frictionless Auto-Rewrite

Pre-Use Interception

Invisible Command Interception

Minimalist Setup

Command Triggered

Hook Interception & Rewrite

Noise-Filtered Execution

JS File Read Signature Compression

JS Git Diff Compression

RTK Token Savings Statistics

89%

59.3k

Conclusion: Token-Optimized Coding

Prefer Skills + CLI over MCPs

RTK (Input Compression)

Caveman (Output Compression)

Plan First,
Code Second