How tokens are calculated.
Saving tokens in a coding agent.
Exploring token-saving developer tools.
How these tools operate determines how tokens are consumed. The structural difference directly impacts overall cost.
Linear & Predictable: One-shot request and response. Pauses and remains idle after responding. Costs grow based on manual inputs and output.
Autonomous Loops: Reason-Act-Observe cycle. Runs in the background to execute plans, check logs, and run code, rebuilding context.
In standard chat, you interact in a single exchange. You pay once for the context you send and the direct response.
Total = Context + Prompt + Completion
Why does Agent Mode consume so many tokens? Because prompts are rebuilt in the background.
Every loop iteration forces the model to reread the entire context: the task, previous steps, and tool logs.
The agent can read up to 5 additional files without your knowledge to locate relevant information.
Classic chat is a single exchange. Agent runs charge you for every loop iteration, which contains the sum of all previous prompts, files, and tool logs.
Total = Sum(Context + History + Logi)
| Provider | Model | Input / 1M | Cached Input / 1M | Output / 1M |
|---|---|---|---|---|
| Anthropic | Claude Haiku 4.5 | $1.00 | $0.10 | $5.00 |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $0.30 | $15.00 |
| Anthropic | Claude Opus 4.8 | $5.00 | $0.50 | $25.00 |
| OpenAI | GPT-5.5 (Frontier) | $5.00 | $0.50 | $30.00 |
| OpenAI | GPT-5.4 (Standard) | $2.50 | $0.25 | $15.00 |
| OpenAI | GPT-5.4 Mini | $0.75 | $0.075 | $4.50 |
| OpenAI | GPT-5.3 Codex | $2.50 | $0.25 | $15.00 |
| Gemini 3.5 Flash | $1.50 | $0.15 | $9.00 |
For an autonomous coding agent executing a feedback loop against a 100,000 token context (codebase map, open files, logs) using Claude 4.6 Sonnet:
Every single turn bills the full 100k input context from scratch.
The 100k prefix is cached. You only pay full price for the tiny delta.
Caches persist using a sliding 5-to-10 minute TTL window:
Caches match from the start. Put static system prompts and files first; put dynamic inputs (queries, timestamps) last.
Requires minimum prompt overhead to trigger caching (e.g., 1,024 tokens for Anthropic).
Keep tool output keys, logs, and files in a fixed order. Shuffling order breaks the cache.
Every request carries a fixed system payload of ~10,000 tokens (system rules) that cannot be skipped. The baseline overhead scales based on the active mode:
Base instructions, UI state, and workspace indexes loaded automatically.
Ask Mode baseline, tool definitions, and loop instructions.
Adding Model Context Protocol (MCP) servers and IDE extensions injects extra tokens into every execution turn.
Base system prompt for planning and boundaries.
Token schemas added by custom MCP tools and active extensions.
Add agent skills using lightweight JavaScript functions. They run only when needed, avoiding the continuous token cost of persistent MCP servers.
$ /gh-answer-pr-comments 2137
โ Step 1: Reading PR Comments
Found comment on src/auth.js #L34: "Can we optimize this search to O(1)?"
โ Step 2: Proposing Code Improvements
Replacing Array.find() with Map lookup.
Diff proposed: + const tokenMap = new Map(tokens.map(t => [t.id, t]));
โ Step 3: Replying to Comment
Posted: "Refactored the linear search to a Map lookup for O(1) performance in commit a3b8f1."
โ Analysis complete. All comments resolved.
Why text planning is cheaper than coding on the fly, saving API budget and waiting time.
Asking an agent to write code immediately without a plan is an expensive mistake, triggering long feedback loops.
AI writes code โ compiler throws errors โ AI attempts blind fixes โ breaks adjacent modules โ repeats.
Every step appends previous prompts, files, and build logs to the history, growing the cost.
Explain these two key pitfalls to the audience:
๐ Feedback Death Loop: The AI writes code โ compiler throws errors โ AI attempts blind fixes โ breaks adjacent modules โ repeats.
โ๏ธ Context Snowball Effect: Every iteration appends all previous prompts, files, and build logs to the history. The cost grows exponentially.
The agent attempts to fix compiler errors blindly. Without a high-level plan, it edits code, triggers new errors, and ends up in an endless, expensive loop.
As the feedback loop runs, every compile error, terminal log, and retry code is appended to the prompt history. The input context grows larger with each turn.
Paying once for a text blueprint is cheaper than paying multiple times for code fixes.
Drafting a plan can still be expensive if the model reads the entire repository at once.
By using file signatures and summary indexes, we keep planning costs low.
A plan lets you divide the task so the agent only reads files relevant to the current step.
Agent searches the codebase aimlessly, reading unrelated files and wasting tokens.
The plan splits the task. In step 1, the agent reads only one file, avoiding repository noise.
Use the AI as an architect to design blueprints, not a mechanic trying random fixes.
Always agree on a text plan before letting the agent modify code.
Fixing logical errors in text costs a single sentence. Fixing them in code costs thousands of tokens.
"Why use many token when few do trick."
A plugin that cuts ~75% of output tokens with 100% technical accuracy.
A March 2026 paper ("Brevity Constraints...") found that forcing LLMs to be brief improved accuracy by up to 26 points on benchmarks.
Fewer output tokens mean the model responds almost instantly, speeding up feedback.
Logic and code remain the sameโthe model strips conversational fluff and replies in short fragments.
"The reason your React component is re-rendering is likely because you're creating a new object reference on each render cycle. When you pass an inline object as a prop, React's shallow comparison sees it as a different object every time, which triggers a re-render. I'd recommend using useMemo to memoize the object."
"New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo."
Real output token counts from the Claude API show an average of 65% output reduction.
Average Token Reduction
Output Generation Speedup
| Benchmark Task | Normal | Caveman | Saved |
|---|---|---|---|
| Explain React re-render bug | 1180 | 159 | -87% |
| Fix auth middleware expiry | 704 | 121 | -83% |
| Set up PostgreSQL pool | 2347 | 380 | -84% |
| Explain git rebase vs merge | 702 | 292 | -58% |
| Docker multi-stage build | 1042 | 290 | -72% |
Activate the compression skill in your active session using slash commands.
dev: /caveman full, why i have so many rerenders
agent: "New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo."
A CLI proxy that intercepts and compresses verbose terminal logs for AI agents.
Agents waste a significant portion of their context window on terminal noise (like test logs or file trees). RTK filters this out.
Stops context windows from overflowing, making coding sessions last longer before needing a restart.
Minimizes LLM API expenses by preventing redundant, bloated output from filling up your token history.
Shows only actionable data, improving reasoning and code fix accuracy.
RTK runs silently in the background, intercepting commands using hooks, hidden from the agent.
Hooks catch command triggers and prefix them with rtk automatically.
No agent modifications, no API rewrites, and no SDK wrappers. Just append a single line to your shell profile to start optimizing every command automatically.
# Enable RTK shell hook integration
eval "$(rtk hook init)"
AI agent executes standard command in workspace shell:
pytest -v
Shell hook intercepts the command and prefixes it on-the-fly:
rtk pytest -v
Command runs. Outputs (like passing tests) are compressed before returning to the AI agent.
import { db } from './db.js';
export async function login(user, pass) {
const u = await db.getUser(user);
if (!u || u.password !== pass) {
throw new Error('Invalid login');
}
return generateToken(u);
}
export function verify(token) {
if (!token) return false;
try {
const payload = jwt.verify(token);
return payload.expiry > Date.now();
} catch {
return false;
}
}
// auth.js (Structural outline)
import { db } from './db.js';
export async function login(user, pass) { /* body stripped */ }
export function verify(token) { /* body stripped */ }
diff --git a/src/auth.js b/src/auth.js
index 45adcf2..2ba9c4b 100644
--- a/src/auth.js
+++ b/src/auth.js
@@ -12,5 +12,5 @@
export function verify(token) {
- const payload = jwt.verify(token);
+ const payload = jwt.verify(token, process.env.SECRET);
return payload.expiry > Date.now();
}
diff --git a/src/auth.js b/src/auth.js
- const payload = jwt.verify(token);
+ const payload = jwt.verify(token, process.env.SECRET);
[x412 lines of unmodified code collapsed]
Benchmarked across 2,900+ real-world developer commands with an average of 89% noise reduction.
Average Token Savings
GitHub Stars (MIT License)
| Command | Token Reduction |
|---|---|
| pytest -v (Python) | -96.0% |
| rtk read (File Reading) | -95.0% |
| git diff | -94.0% |
| cargo test (Rust) | -91.8% |
| git log | -86.0% |
Use lightweight, on-demand shell actions instead of persistent, heavy MCP servers. Bypasses constant prompt schema overhead.
Filter out verbose unit test logs, file listings, and repository metadata to drastically reduce your prompt's input tokens.
Force the model to strip conversational filler and answer in concise fragments, saving ~75% on response tokens.
Validate technical blueprints in text to isolate module context, prevent feedback loops, and accelerate development speed.