Skip to content

Benchmarks

Board: NVIDIA Jetson Orin Nano Super (8GB, JetPack 6.2, R36.4.3) CUDA: 12.6, Compute Capability 8.7 Storage: 57GB eMMC (root) + 938GB NVMe Runtime: llama.cpp (CUDA build, fbd441c37) Date: 2026-04-02 Configuration: Headless (no USB devices, no video, no cameras)


ModelTest15W25WMAXN15W→MAXN Gain
TinyLlama 1.1B (Q4_K_M, 636MB)Prompt (pp512)1,0261,3981,506+47%
Generation (tg128)35.347.851.1+45%
Phi-3 3.8B (Q4, 2.23GB)Prompt (pp512)350492537+53%
Generation (tg128)12.216.016.9+39%
Llama 3.2 3B (Q4_K_M, 1.87GB)Prompt (pp512)479650713+49%
Generation (tg128)13.317.218.4+38%

Test15W Mode25W ModeMAXN
Idle5.2W5.5W9.1W
CPU-only stress (6 cores)7.6W8.0W9.1W
TinyLlama inference10.3W avg / 11.1W peak12.7W avg / 13.6W peak14.0W avg / 15.0W peak
Phi-3 3.8B inference11.7W avg / 12.5W peak14.6W avg / 15.3W peak16.0W avg / 16.9W peak
Llama 3.2 3B inference11.2W avg / 11.9W peak13.4W avg / 14.4W peak15.0W avg / 16.0W peak
Combined CPU+GPU6.8W avg / 14.5W peak8.4W avg / 17.0W peak9.9W avg / 19.3W peak

Power Draw — Headless Max with I/O (Run 2)

Section titled “Power Draw — Headless Max with I/O (Run 2)”

Individual Subsystem Power Contribution (MAXN)

Section titled “Individual Subsystem Power Contribution (MAXN)”
SubsystemAvg PowerPeak PowerNotes
Idle baseline5.7W5.7WMAXN, no load, post-soak
NVMe I/O only (fio seq+rand)7.0W7.6W+1.9W over idle
Network only (iperf3 GbE flood)6.0W6.5W+0.3-0.8W over idle
GPU inference only (Phi-3)16.0W16.9WDominates power budget

Kitchen Sink — All Subsystems Simultaneously (CPU + GPU + NVMe + Network)

Section titled “Kitchen Sink — All Subsystems Simultaneously (CPU + GPU + NVMe + Network)”
ModeAvg PowerPeak PowerMax Tj
15W8.3W14.9W60.5C
25W8.2W17.4W62.8C
MAXN8.5W20.0W64.5C

Sustained Thermal Soak — 5 Minutes at MAXN (All Subsystems)

Section titled “Sustained Thermal Soak — 5 Minutes at MAXN (All Subsystems)”
MetricValue
Peak power20.4W
Sustained power (under full load)17.2-20.2W
Tj at start51.2C
Tj peak (at ~4 min)71.0C
Tj stabilized67-68C
Tj after cooldown49.7C
GPU clock start1015 MHz
GPU clock end1013 MHz
Thermal throttlingNone
Performance degradationNone

Inference Stability Over 5-Minute Soak (8 consecutive runs)

Section titled “Inference Stability Over 5-Minute Soak (8 consecutive runs)”
RunPhi-3 pp1024 (t/s)Phi-3 tg256 (t/s)
1473.9216.67
2473.3116.69
3474.0116.51
4474.4816.95
5476.9016.96
6475.5016.98
7477.0316.98
8476.6116.97

Performance actually improved slightly as components warmed up. Zero degradation.


ModeBogo ops/sCPU Freq
15W1,0521497 MHz
25W9441344 MHz
MAXN1,2151728 MHz

Note: 25W mode scored lower than 15W on CPU — 25W sets CPU to 1344 MHz vs 15W’s 1497 MHz. The 25W budget goes more to GPU clocks (914 MHz vs 611 MHz).


TestThroughput
Sequential write (1M blocks)915 MB/s
Sequential read (1M blocks)1,456 MB/s
Random 4K read (4 jobs)144 MB/s
Random 4K write (4 jobs)145 MB/s

DirectionThroughput
Upload (4 streams)457 Mbits/sec
Download (4 streams)522 Mbits/sec

ModeGPU Clock
15W611 MHz
25W911-917 MHz
MAXN1011-1019 MHz

Thermal Analysis — No Throttling Detected

Section titled “Thermal Analysis — No Throttling Detected”
ObservationDetail
Max junction temp71.0C (MAXN 5-min soak, all subsystems)
Thermal throttle threshold~97C (Orin Nano)
Headroom~26C below throttle point
GPU clock stabilityRock solid across all tests — no frequency drops
Thermal stabilizationPeaks at ~70C after 4 min, settles to 67-68C
CooldownReturns to 50C within ~5 min after load stops
TestStart TjEnd TjMax TjGPU Clock (start/end)
15W TinyLlama50.1C52.7C53.3C611 / 611
15W Phi-351.9C56.3C57.9C611 / 611
15W Llama 3.256.0C57.8C58.8C611 / 611
25W TinyLlama50.3C54.4C54.4C916 / 911
25W Phi-351.9C58.5C59.3C914 / 914
25W Llama 3.256.1C58.8C60.4C912 / 914
MAXN TinyLlama51.5C56.2C56.2C1018 / 1011
MAXN Phi-350.7C60.3C60.3C1012 / 1017
MAXN Llama 3.256.1C60.1C61.9C1015 / 1017

PicoCluster Power Budget (Headless, Per Node)

Section titled “PicoCluster Power Budget (Headless, Per Node)”
ScenarioPower Draw
Idle (MAXN)5.7W
Typical inference load15-17W
Absolute worst case (all subsystems, MAXN)20.4W
Cluster SizeMin PSU (at measured max)Recommended (20% margin)
1 node21W25W
2 nodes41W50W
4 nodes82W100W
5 nodes102W125W

  1. True headless max power: 20.4W at MAXN — even with CPU, GPU, NVMe, and network all saturated simultaneously
  2. MAXN gives ~40-53% more inference throughput over 15W for only ~5W more actual draw
  3. The board never reaches its 25W MAXN envelope — the 8GB Orin Nano simply can’t draw that much without USB/video/camera peripherals
  4. Zero thermal throttling — peaked at 71C under sustained full load, 26C below the throttle point
  5. Zero performance degradation — inference speed was rock steady across 8 consecutive runs during the 5-min soak
  6. Generation speed is memory-bandwidth limited — 12-18 tok/s on 3B models regardless of how hard you push the GPU
  7. NVMe adds ~2W, network adds <1W to the power budget — the GPU dominates
  8. Budget 25W per node with 20% margin for PSU sizing in a headless cluster

PDU Testing — 12V Passthrough, 1.5A Rated USB-A Ports

Section titled “PDU Testing — 12V Passthrough, 1.5A Rated USB-A Ports”
TestAvg PowerPeak PowerPeak CurrentStatus
Idle (MAXN)6.75W6.75W0.56A✅ Safe
CPU stress (6 cores)~10.0W10.2W0.85A✅ Safe
TinyLlama 1.1B14.9W17.5W1.46A⚠️ Borderline
Llama 3.2 3B18.0W19.0W1.58A❌ Over 1.5A spec
Phi-3 3.8B19.3W20.2W1.68A❌ Over 1.5A spec
Kitchen sink (all subsystems)11.3W21.7W1.81A❌ 21% over spec

5-Minute Sustained Max Soak (12V PDU, All Subsystems, MAXN)

Section titled “5-Minute Sustained Max Soak (12V PDU, All Subsystems, MAXN)”
MetricValue
Min power8.3W
Avg power20.0W
Peak power21.7W
Peak current1.81A at 12V
Tj start53.4C
Tj avg70.8C
Tj max73.0C
Tj end71.9C
ThrottlingNone
Brownout / crashNone
PDU port tempCool to touch

Graduated Load Test — Port 3 (12V, MAXN)

Section titled “Graduated Load Test — Port 3 (12V, MAXN)”
TestAvg PowerPeak PowerPeak CurrentStatus
Idle (MAXN)6.76W6.76W0.57A✅ Safe
CPU stress (6 cores)~10.3W10.25W0.85A✅ Safe
TinyLlama 1.1B17.5W1.46A⚠️ Borderline
Llama 3.2 3B18.9W1.57A❌ Over 1.5A spec
Phi-3 3.8B20.1W1.68A❌ Over 1.5A spec

5-Minute Sustained Max Soak — Port 3 (12V, All Subsystems, MAXN)

Section titled “5-Minute Sustained Max Soak — Port 3 (12V, All Subsystems, MAXN)”
MetricValue
Peak power21.73W
Peak current1.81A at 12V
Max Tj73.3C
ThrottlingNone
Brownout / crashNone
PDU port tempCool to touch

PDU Port Comparison — 12V, MAXN, 5-Minute Max Soak (All 3 Ports)

Section titled “PDU Port Comparison — 12V, MAXN, 5-Minute Max Soak (All 3 Ports)”
MetricPort 1Port 2Port 3
Peak power21.7W21.7W21.73W
Peak current1.81A1.81A1.81A
Max Tj73.0C73.2C73.3C
ThrottlingNoneNoneNone
Brownout / crashNoneNoneNone
Port tempCoolCoolCool
  • Ports are rated 1.5A but held 1.81A sustained for 5 minutes without failure across all 3 ports
  • All PDU components remained cool — suggests conservative rating or low-resistance implementation
  • Inference performance was rock solid throughout the soak on all ports
  • Results are highly consistent port-to-port (<1% variance on peak power/current, <0.3C on Tj)
  • For dev/test clusters: Existing PDU appears viable — ports are clearly overbuilt vs. their 1.5A spec
  • For production clusters: Further validation recommended, direct wiring to PSU is safest
  • NVMe auto-mounts on boot via /etc/fstab with nofail

PicoCluster Claw Cluster Power Testing (In-Case, 2026-04-04)

Section titled “PicoCluster Claw Cluster Power Testing (In-Case, 2026-04-04)”

Configuration: PicoCluster Claw acrylic case, 80x25mm case fan, PDU, 50W PSU Nodes: clusterclaw (RPi5 8GB, Raspbian) + clustercrush (Orin Nano Super 8GB, MAXN) Measurement: Kill-A-Watt at wall, tegrastats on Orin, vcgencmd PMIC on RPi5

TestOut-of-Case Tj PeakIn-Case Tj PeakImprovement
CPU-only stress (5 min)64.5C54.4C-10.1C
Kitchen sink + GPU inference (5 min)71.0C65.4C-5.6C

Case fan provides significant cooling benefit. No throttling in either configuration.

StateTemperature
Idle31.2C
CPU + matrix + I/O stress (peak)67.5C
Post-cooldown33.4C
StateKill-A-WattNotes
Both idle13-14WBaseline
Both under CPU/IO stress (no GPU inference)21-22WRPi5 headroom-limited
Both max + GPU inference~35W (estimated)Orin peaks at 21.6W SoC
ComponentIdleTypical (agent workload)Peak (kitchen sink)
RPi5 (clusterclaw)~3W~4W7.5W
Orin Nano (clustercrush)~6W~10W20W
12V fan~2W~2W~2W
PDU~1W~1W~1W
PSU efficiency loss (~15%)~2W~2.5W~4.5W
Total per pair~14W~19.5W~35W

Orin Nano Tegrastats Summary (In-Case, Kitchen Sink + GPU Inference)

Section titled “Orin Nano Tegrastats Summary (In-Case, Kitchen Sink + GPU Inference)”
MetricValue
VDD_IN peak21.6W
VDD_IN sustained20.8-21.5W
VDD_CPU_GPU_CV peak11.7W
VDD_SOC peak4.5W
GPU utilization92-99% @ 1014-1017 MHz
CPU utilization94-100% @ 1728 MHz (6 cores)
Tj peak (in case)65.4C
Tj at start49.9C
ThrottlingNone

Real-world OpenClaw agent loop is bursty (inference → wait for browser → observe → repeat):

PhaseDurationOrin Power
Inference (token generation)~12s~16W
Idle (waiting for browser action)~25s~6W
Weighted cycle average~37s~9-10W

Agent is idle (waiting for user commands) 90%+ of the time.

US average electricity: $0.16/kWh. PSU: 50W (15W headroom at peak).

ScenarioWattskWh/monthkWh/year$/month$/year
Idle (waiting for tasks)14W10.1122.6$1.61$19.62
Active agent workload20W14.4175.2$2.30$28.03
Peak (all subsystems maxed)35W25.2306.6$4.03$49.06
Realistic blend (90% idle, 10% active)15W10.8131.4$1.73$21.02
DeviceTypical PowerAnnual Cost
PicoCluster Claw cluster15W~$21/year
Desktop PC (always on)~80W~$112/year
Single GPU server (RTX 4090)~300W+~$420+/year
60W light bulb (always on)60W~$84/year

Your own private AI agent for under $2/month.


  • Orin out-of-case benchmarks: /mnt/nvme/results/ on clustercrush
  • Orin in-case benchmarks: /home/picocluster/results-incase/ on clustercrush
  • RPi5 benchmarks: /home/picocluster/results/ on clusterclaw

Date: 2026-05-19
Hardware: Jetson Orin Nano Super 8GB (MAXN, Ollama CUDA)
Gateway: OpenClaw 2026.5.12 via LiteLLM proxy
Context: ~9,250 chars per request (optimized lean profile for local models)

The OpenClaw benchmark measures real agent capability across four tiers of increasing complexity — raw inference through multi-step tool chaining — using the live gateway, not a raw API.

TierTestsWhat it measures
T1ping, math, JSONRaw inference, instruction following
T2list files, read, writeSingle tool call with verification
T3write→read chain, compute→writeMulti-step tool chaining
T4write→edit→verify, read→summarize→writeHarder reasoning + tool use

Results (10 tasks, all tiers) — May 2026

Section titled “Results (10 tasks, all tiers) — May 2026”
ModelT1T2T3T4ScoreAvg latency
granite4.1:8b3/33/32/22/210/1016,761 ms
nemotron-3-nano:4b3/33/32/21/29/1038,895 ms
qwen3.5:4b2/33/32/21/28/1045,824 ms
llama3.1:8b3/32/31/20/26/1027,144 ms
llama3.2:3b3/31/30/20/25/108,957 ms

granite4.1:8b is the validated default: perfect score across all tiers. nemotron-3-nano:4b at 9/10 is the recommended fast alternative — 4B parameters with near-8B tool reliability.

Validated threshold: T2+ required for agent work. granite, nemotron, and qwen3.5 all qualify.


Script: bench-claw.py (30 tasks across cap and workflow suites)
Date: May 2026
SMEEP: Small Model Experimentation and Evaluation Project

30 tasks that test real-world agent behavior: natural-language exec, web research, scheduling, file operations, and multi-tool chains. Every task is something a real user would type.

SuiteTasksWhat it tests
cap14Structured capability: exec, web, scheduling, chains
workflow16Natural language: commands a real user would type

v2 Results (with tool context injection) — Orin Nano, May 2026

Section titled “v2 Results (with tool context injection) — Orin Nano, May 2026”
ModelCap (14)Workflow (16)TotalAvg task time
qwen2.5:7b14/3037s
granite4.1:8b12/3068s
nemotron-3-nano:4b12/3057s
llama3.1:8b11/3030s
qwen3.5:4b9/3063s

Cap/workflow breakdown pending full bench analysis. Run in progress for phi4-mini:3.8b.

Tool context injection triples granite’s score. Without a structured context block documenting available tools, granite4.1:8b scored 4/30. With it: 12/30. Granite needs explicit tool documentation; it won’t infer from context what’s available.

7B is a meaningful step. qwen2.5:7b at 14/30 leads all models — including 8B models. Natural-language command interpretation is where the size difference shows up: “How’s the memory looking?” maps correctly to cat /proc/meminfo without a hint.

Mac 14B comparison (early data): mac/qwen2.5:14b scored 20/30 on the same 30-task suite. 14B closes the gap on multi-tool chains and complex scheduling. Full results pending.

Terminal window
# Clone the repo
git clone https://github.com/picocluster/PicoCluster-Claw
cd PicoCluster-Claw
# Run on any Ollama host with OpenClaw running
python3 scripts/bench-claw.py \
--models granite4.1:8b,qwen2.5:7b,nemotron-3-nano:4b \
--task-delay 2.0 \
--output results/my-bench.json

Full methodology: docs/research/benchmarks/methodology.md

Context optimization that enabled these scores

Section titled “Context optimization that enabled these scores”

Earlier runs with the default OpenClaw context showed granite scoring 5/10 and qwen scoring 3/10 — small models were overwhelmed before they could reason about the actual task.

Key changes:

ChangeSavedImpact
Replaced full workspace file templates with minimal stubs~12,000 charsEliminated boilerplate injection
Reduced tool schema via deny list~5,400 charsRemoved session, node, media tools
Removed dead remote-node tools (file_fetch, dir_list, dir_fetch, file_write)~2,580 charsAlways fail without a paired node; confused models
Fixed t2-list: exec ls instead of dir_listdir_list requires unconfigured remote node ACL

Total context: ~11,800 → ~9,250 chars per request for local models

Cloud models added to OpenClaw (Anthropic Claude, OpenAI GPT, etc.) automatically receive the full unrestricted tool set via OpenClaw’s tools.byProvider policy. The local provider gets the lean profile; all other providers get browser automation, TTS, image/video/music generation, and all workspace tools. No config changes are needed when adding a cloud API key.