System Performance Architect · Google

I make computing faster, and more efficient.

From open source to hyperscale silicon, and the teams that ship them.

Latest essay The 80% Problem: The Last 20% Is Where the Engineer Used to Live ↓

01 · About

Soldier → Scientist → System Architect

Performance, memory, and scalable computing at hyperscale.

Jonathan Beard turns architecture, memory, interconnect, power, and real workload behavior into data-driven platform decisions, and builds the teams that make those calls stick.

A hardware-architecture and systems-performance leader with 18+ years spanning the U.S. Army, academia, and hyperscale silicon. System Performance Architect at Google, U.S. Army veteran, and the primary author of RaftLib.

Now: Google · 2022–present

Owns system-wide Perf/TCO for Google's internal SoC program (gSoC) across silicon generations and product lines.
Leads the New Platform Introduction Performance team assessing the forward-looking Google SoC roadmap, translating architecture specs into real-world workload impact, deployment risk, and fleet adoption decisions.
Builds multivariate CapEx/OpEx, perf-per-dollar, and perf-per-watt models, from cores to racks.
Drives ML-guided SoC tuning (core registers, memory-controller settings, mesh QoS) validated against internal customer workloads, including AI/ML inference; technical lead for Google-specific accelerator IP; leads orgs of up to 40 engineers.

RaftLib · Open source · 2013–present

A C++ stream-processing DSL and parallel runtime I authored and still maintain. Wire compute kernels into a graph and RaftLib handles the queues, scheduling, and parallelism. It grew out of my PhD on online modeling and auto-tuning of parallel stream systems.

~1,000 stars · 127 forks~46 daily clones Hacker News front page ×2C++Now '16/'17 talksApache-2.0

Recognition: Wikipedia: CSP, named with Erlang & Go Wikipedia: RaftLib C++ Reactive Programming · Packt 2018 Awesome C++ Awesome Parallel Computing

Also in the wild: Raft language ipc: SHM library LeigNet sim framework constexpr HighwayHash

raftlib.io → GitHub

Research & PhD thesis

Online Modeling and Tuning of Parallel Stream Processing Systems (2015): queueing theory, machine learning, and control theory to auto-tune dataflow systems.

Work spans IJHPCA, IPDPS, ICPP, PACT, ICS, Euro-Par, MEMSYS, and HPEC.

At Arm: mentored and funded academic teams at Barcelona Supercomputing Center, UT Austin, and Georgia Tech (12+ students, 5 professors), converting research into patents, papers, and architecture proposals.

Arm · 2015–2022

Senior Research Engineer → Principal System Architect. CHI memory-copy enhancements, the LS64 architecture + gather-hint instruction (early AArch64 8.7/9.2 accelerator extensions), Project-38 (DoD/DoE), and Arm rep to Sandia's DOE data-movement project.

Concurrently: technical advisor to FastData.io, GPU-accelerated streaming data processing (2016–2020).

GenZ-CXL subcommitteest6430+ patent filings

IP Contributions

29 granted U.S. patents with more in flight. Where they cluster:

Virtualization ×7 Memory systems ×6 Address translation ×4 Data movement ×4 Cache & coherence ×4 Queues & channels ×3 Telemetry & PMUs ×1 Hints & prefetch ×1

Where the ideas were cited →

Off the clock

Austin, Texas. Nine years a soldier before the doctorate. A bioinformatics master's (Johns Hopkins) earned along the way, biology + international studies before that.

Builds restomod robots: currently resurrecting an Omnibot 2000, the greatest vintage bot of all time, with a SLAM stack that's headed for open source. Rebuilds cars too: a 1979 Porsche 911SC, stripped to the shell and going back together from the ground up around a custom ADAS of his own design. Photography, too. Blog posts on all of the above are coming.

Omnibot 2000SLAM → open source'79 911SC + ADASPhotographyHome lab

Washington University

PhD research, Stream-Based Supercomputing Lab. The thesis: teach parallel stream systems to tune themselves while running. Online queueing models spot where backpressure will bite, then the runtime resizes buffers and re-places kernels mid-flight. No profiling runs, no restarts. It shipped as RaftLib's autotuner.

Community & speaking

Supercomputing (SC) TPC 2016–2022: Architecture & Networks Co-Chair (SC19), Speed Mentorship Chair (SC21).
TPC service: ISCA, MICRO, HPCA, ICPP (Architecture Track Co-Chair 2023), ISC, IA³, DOE P3HPC.
General Chair, Arm Research Summit 2019; Co-General Chair, GoingArm 2017/2018; MEMSYS organizing committee 2017–2022.
Co-author, U.S. DOE Extreme Heterogeneity workshop report (2018).

CppCast · Ep. 50C++Now ×3MEMSYSSC

Tools

C/C++PythonGoCUDA gem5SSTQEMUVerilog/VHDL Linux/eBPF/perfPMUs/LTTngLLVM/GCCFPGA

What I work on

AI acceleratorsCPU/GPU µarchHBM/DDR CXL & disaggregationInterconnectsPerf modeling & benchmarking Workload characterizationCache/coherence protocols HW/SW co-designTCO

Foundations

Ph.D., Computer Science: Washington University in St. Louis, 2015
M.S., Bioinformatics: Johns Hopkins University, 2010
B.S. Biological Sciences / B.A. International Studies: LSU, 2005

02 · Impact · patents

Where the ideas were cited

Forward citations from Google Patents: 120+ later filings cite the 29 granted patents, at Apple, Intel, NVIDIA, Microsoft, Samsung, IBM, AMD, Google, and a wave of AI-silicon startups. Each chip links to the cited patent behind the cluster. Highlights here. The full impact map lives with the publications.

Accelerator integration & offload

Intel: Configuring/reconfiguring chains of accelerators
Samsung: Host/accelerator work-sharing via shared memory
NVIDIA: Unified virtual memory in heterogeneous systems

US11550585

Memory fabrics & disaggregation

Apple: Scalable system-on-a-chip (M-series fabric)
Apple: Address hashing across multiple memory controllers
Apple: Soft memory folding / compacted pipe addressing

US10534719 US10467159 US10901691

Coherence at scale

Microsoft: Snoop filter w/ disaggregated vector table
Microsoft: Adaptive coherency tracking (×4 patents)
IBM: Coordination namespace / global virtual address space

US10592424 US11176042 US11445020

Hardware queues & message passing

Google: Optimizing hardware FIFO instructions
Xilinx (AMD): Producer→consumer active cache transfers
Samsung: SoC data sync between processors

US11960945 US11614985 US10474575 US10445094

Context switching & migration

Apple: Thread-channel deactivation; memory-backed register preemption; multi-stage thread scheduling (×3)
Intel: NVM cloning w/ hardware copy-on-write
VMware: Cross-privilege-domain communication in CPU cores

US11934272 US10671426 US10423446 US10353826 US10552212

Virtual memory & translation

Intel: Pointer-extent-informed predictors
Intel: NVM cloning w/ HW copy-on-write
Apple: Memory Objects

US12007905 US10613989 US10565126 US10489304

Near-memory & sparse data movement

AMD: Near-memory data-dependent gather & packing
AMD: Reducing side-effects of compute offload to memory
Intel: Smart memory store/load; disaggregated-memory filtering

US10353601 US10067708 US10552152

Memory reliability & prediction

Samsung: SSD-based RAID
Apple: Dynamic address-based data reliability
Raytheon: Optimal bit apportionment vs soft errors (×3)

US10884850 US10423510 US10909045

Method: Google Patents “Cited by” and family-citation data, mined June 2026. A citation marks later work that builds on or relates to the patent: prior art acknowledged by the applicant or examiner, not endorsement.

02 · Impact · papers

Cited in the literature

RaftLib alone: 42 citations. The body of work is cited across OSDI, EuroSys, SC, USENIX ATC, HPCA, MICRO, CGO, and HPDC. See the full record on Google Scholar.

Stream processing & RaftLib

EuroSys 2020: PaSh: light-touch data-parallel shell processing (Vasilakis)
OSDI 2022: Practically Correct, Just-in-Time Shell Script Parallelization (Kallas)
SC 2022: TD-NUCA: Runtime-Driven Management of NUCA Caches (Caheny)

Source: RaftLib (IJHPCA) · 42 citations

Near-memory & sparse acceleration

HPCA 2021: FAFNIR: Accelerating Sparse Gathering by Near-Memory Reduction (Asgari)
IEEE Access 2021: DAMOV: Benchmark Suite for Data-Movement Bottlenecks (Oliveira)
MICRO 2023: A Tensor Marshaling Unit for Sparse Tensor Algebra (Siracusa)

Source: SPiDRE · Dark Bandwidth · NUCD

Hardware queues & messaging

HPCA 2025: Push Multicast: Speculative Coherent Interconnect (Huang)
CC 2024: BLQ: Locality-Aware Blocking-Less Queuing (Wu)
HotOS 2023: NextGen-Malloc: Giving the Allocator Its Own Room (Li)

Source: Virtual-Link · SPAMeR

Performance modeling of streaming

USENIX ATC 2019: EdgeWise: A Better Stream Processing Engine for the Edge (Fu)
HPDC 2023: Streaming Task Graph Scheduling for Dataflow Architectures (De Matteis)
Parallel Computing 2021: Reducing queuing impact in irregular dataflow (Timcheck)

Source: Analytic streaming models (MASCOTS) · 20 citations

Method: Semantic Scholar forward citations across 26 publications, mined June 2026; curated to recognizable venues, self-citations excluded. Canonical counts live on Google Scholar.

03 · Writing

Notes on life, systems, and performance

Stream processing, memory systems, parallelism, and the occasional war story.

All posts → RSS

Latest: June 27, 2026

I make computing faster, and more efficient.

Soldier → Scientist → System Architect

Where the ideas were cited

Cited in the literature

Notes on life, systems, and performance

The 80% Problem: The Last 20% Is Where the Engineer Used to Live

AI Is the Ultimate Leaky Abstraction

Why More Cores Stopped Saving Us