System Performance Architect · Google

I make computing faster — and more efficient.

From open source to hyperscale silicon — and the teams that ship them.

Latest essay Managing Parallel, Part 4: The Machine Underneath

01 — About

Soldier Scientist System Architect

Performance, memory, and scalable computing at hyperscale.

Jonathan Beard turns architecture, memory, interconnect, power, and real workload behavior into data-driven platform decisions — and builds the teams that make those calls stick.

A hardware-architecture and systems-performance leader with 18+ years spanning the U.S. Army, academia, and hyperscale silicon. System Performance Architect at Google, U.S. Army veteran, and the primary author of RaftLib.

Now — Google · 2022–present

  • Owns system-wide Perf/TCO for Google's internal SoC program (gSoC) across silicon generations and product lines.
  • Leads the New Platform Introduction Performance team assessing the forward-looking Google SoC roadmap — translating architecture specs into real-world workload impact, deployment risk, and fleet adoption decisions.
  • Builds multivariate CapEx/OpEx, perf-per-dollar, and perf-per-watt models, from cores to racks.
  • Drives ML-guided SoC tuning (core registers, memory-controller settings, mesh QoS) validated against internal customer workloads, including AI/ML inference; technical lead for Google-specific accelerator IP; leads orgs of up to 40 engineers.

RaftLib · Open source · 2013–present

A C++ stream-processing DSL and parallel runtime I authored and still maintain — wire compute kernels into a graph and RaftLib handles the queues, scheduling, and parallelism. It grew out of my PhD on online modeling and auto-tuning of parallel stream systems.

~1,000 stars · 127 forks~46 daily clones Hacker News front page ×2C++Now '16/'17 talksApache-2.0

Recognition: Wikipedia: CSP — named with Erlang & Go Wikipedia: RaftLib C++ Reactive Programming · Packt 2018 Awesome C++ Awesome Parallel Computing

Also in the wild: Raft language ipc — SHM library LeigNet sim framework constexpr HighwayHash

Research & PhD thesis

Online Modeling and Tuning of Parallel Stream Processing Systems (2015) — queueing theory, machine learning, and control theory to auto-tune dataflow systems.

Work spans IJHPCA, IPDPS, ICPP, PACT, ICS, Euro-Par, MEMSYS, and HPEC.

At Arm: mentored and funded academic teams at Barcelona Supercomputing Center, UT Austin, and Georgia Tech — 12+ students, 5 professors — converting research into patents, papers, and architecture proposals.

Arm · 2015–2022

Senior Research Engineer → Principal System Architect. CHI memory-copy enhancements, the LS64 architecture + gather-hint instruction (early AArch64 8.7/9.2 accelerator extensions), Project-38 (DoD/DoE), and Arm rep to Sandia's DOE data-movement project.

Concurrently: technical advisor to FastData.io — GPU-accelerated streaming data processing (2016–2020).

GenZ-CXL subcommitteest6430+ patent filings

Off the clock

Austin, Texas. Nine years a soldier before the doctorate — a bioinformatics master's (Johns Hopkins) earned along the way, biology + international studies before that.

Builds restomod robots — currently resurrecting an Omnibot 2000, the greatest vintage bot of all time, with a SLAM stack that's headed for open source. Rebuilds cars too: a 1979 Porsche 911SC, stripped to the shell and going back together from the ground up around a custom ADAS of his own design. Photography, too. Blog posts on all of the above are coming.

Omnibot 2000SLAM → open source'79 911SC + ADASPhotographyHome lab

Washington University

PhD research, Stream-Based Supercomputing Lab. The thesis: teach parallel stream systems to tune themselves while running — online queueing models spot where backpressure will bite, then the runtime resizes buffers and re-places kernels mid-flight. No profiling runs, no restarts. It shipped as RaftLib's autotuner.

Community & speaking

  • Supercomputing (SC) TPC 2016–2022 — Architecture & Networks Co-Chair (SC19), Speed Mentorship Chair (SC21).
  • TPC service: ISCA, MICRO, HPCA, ICPP (Architecture Track Co-Chair 2023), ISC, IA³, DOE P3HPC.
  • General Chair, Arm Research Summit 2019; Co-General Chair, GoingArm 2017/2018; MEMSYS organizing committee 2017–2022.
  • Co-author, U.S. DOE Extreme Heterogeneity workshop report (2018).
CppCast · Ep. 50C++Now ×3MEMSYSSC

Tools

C/C++PythonGoCUDA gem5SSTQEMUVerilog/VHDL Linux/eBPF/perfPMUs/LTTngLLVM/GCCFPGA

What I work on

AI acceleratorsCPU/GPU µarchHBM/DDR CXL & disaggregationInterconnectsPerf modeling & benchmarking Workload characterizationCache/coherence protocols HW/SW co-designTCO

Foundations

  • Ph.D., Computer Science — Washington University in St. Louis, 2015
  • M.S., Bioinformatics — Johns Hopkins University, 2010
  • B.S. Biological Sciences / B.A. International Studies — LSU, 2005

02 — Impact · patents

Where the ideas were cited

Forward citations from Google Patents: 120+ later filings cite the 29 granted patents — at Apple, Intel, NVIDIA, Microsoft, Samsung, IBM, AMD, Google, and a wave of AI-silicon startups. Each chip links to the cited patent behind the cluster. Highlights here — the full impact map lives with the publications.

Accelerator integration & offload

  • Intel — Configuring/reconfiguring chains of accelerators
  • Samsung — Host/accelerator work-sharing via shared memory
  • NVIDIA — Unified virtual memory in heterogeneous systems

Memory fabrics & disaggregation

  • Apple — Scalable system-on-a-chip (M-series fabric)
  • Apple — Address hashing across multiple memory controllers
  • Apple — Soft memory folding / compacted pipe addressing

Coherence at scale

  • Microsoft — Snoop filter w/ disaggregated vector table
  • Microsoft — Adaptive coherency tracking (×4 patents)
  • IBM — Coordination namespace / global virtual address space

Hardware queues & message passing

  • Google — Optimizing hardware FIFO instructions
  • Xilinx (AMD) — Producer→consumer active cache transfers
  • Samsung — SoC data sync between processors

Context switching & migration

  • Apple — Thread-channel deactivation; memory-backed register preemption; multi-stage thread scheduling (×3)
  • Intel — NVM cloning w/ hardware copy-on-write
  • VMware — Cross-privilege-domain communication in CPU cores

Near-memory & sparse data movement

  • AMD — Near-memory data-dependent gather & packing
  • AMD — Reducing side-effects of compute offload to memory
  • Intel — Smart memory store/load; disaggregated-memory filtering

Memory reliability & prediction

  • Samsung — SSD-based RAID
  • Apple — Dynamic address-based data reliability
  • Raytheon — Optimal bit apportionment vs soft errors (×3)

Method: Google Patents “Cited by” and family-citation data, mined June 2026. A citation marks later work that builds on or relates to the patent — prior art acknowledged by the applicant or examiner, not endorsement.

02 — Impact · papers

Cited in the literature

RaftLib alone: 42 citations. The body of work is cited across OSDI, EuroSys, SC, USENIX ATC, HPCA, MICRO, CGO, and HPDC. See the full record on Google Scholar.

Stream processing & RaftLib

  • EuroSys 2020 — PaSh: light-touch data-parallel shell processing (Vasilakis)
  • OSDI 2022 — Practically Correct, Just-in-Time Shell Script Parallelization (Kallas)
  • SC 2022 — TD-NUCA: Runtime-Driven Management of NUCA Caches (Caheny)
Source: RaftLib (IJHPCA) — 42 citations

Near-memory & sparse acceleration

  • HPCA 2021 — FAFNIR: Accelerating Sparse Gathering by Near-Memory Reduction (Asgari)
  • IEEE Access 2021 — DAMOV: Benchmark Suite for Data-Movement Bottlenecks (Oliveira)
  • MICRO 2023 — A Tensor Marshaling Unit for Sparse Tensor Algebra (Siracusa)
Source: SPiDRE · Dark Bandwidth · NUCD

Hardware queues & messaging

  • HPCA 2025 — Push Multicast: Speculative Coherent Interconnect (Huang)
  • CC 2024 — BLQ: Locality-Aware Blocking-Less Queuing (Wu)
  • HotOS 2023 — NextGen-Malloc: Giving the Allocator Its Own Room (Li)
Source: Virtual-Link · SPAMeR

Performance modeling of streaming

  • USENIX ATC 2019 — EdgeWise: A Better Stream Processing Engine for the Edge (Fu)
  • HPDC 2023 — Streaming Task Graph Scheduling for Dataflow Architectures (De Matteis)
  • Parallel Computing 2021 — Reducing queuing impact in irregular dataflow (Timcheck)
Source: Analytic streaming models (MASCOTS) — 20 citations

Method: Semantic Scholar forward citations across 26 publications, mined June 2026; curated to recognizable venues, self-citations excluded. Canonical counts live on Google Scholar.

03 — Writing

Notes on life, systems, and performance

Stream processing, memory systems, parallelism — and the occasional war story.