System Performance Architect · Google

I make computing faster, and more efficient.

From open source to hyperscale silicon, and the teams that ship them.

Latest essay The 80% Problem: The Last 20% Is Where the Engineer Used to Live

01 · About

Soldier Scientist System Architect

Performance, memory, and scalable computing at hyperscale.

Jonathan Beard turns architecture, memory, interconnect, power, and real workload behavior into data-driven platform decisions, and builds the teams that make those calls stick.

A hardware-architecture and systems-performance leader with 18+ years spanning the U.S. Army, academia, and hyperscale silicon. System Performance Architect at Google, U.S. Army veteran, and the primary author of RaftLib.

Now: Google · 2022–present

  • Owns system-wide Perf/TCO for Google's internal SoC program (gSoC) across silicon generations and product lines.
  • Leads the New Platform Introduction Performance team assessing the forward-looking Google SoC roadmap, translating architecture specs into real-world workload impact, deployment risk, and fleet adoption decisions.
  • Builds multivariate CapEx/OpEx, perf-per-dollar, and perf-per-watt models, from cores to racks.
  • Drives ML-guided SoC tuning (core registers, memory-controller settings, mesh QoS) validated against internal customer workloads, including AI/ML inference; technical lead for Google-specific accelerator IP; leads orgs of up to 40 engineers.

RaftLib · Open source · 2013–present

A C++ stream-processing DSL and parallel runtime I authored and still maintain. Wire compute kernels into a graph and RaftLib handles the queues, scheduling, and parallelism. It grew out of my PhD on online modeling and auto-tuning of parallel stream systems.

~1,000 stars · 127 forks~46 daily clones Hacker News front page ×2C++Now '16/'17 talksApache-2.0

Recognition: Wikipedia: CSP, named with Erlang & Go Wikipedia: RaftLib C++ Reactive Programming · Packt 2018 Awesome C++ Awesome Parallel Computing

Also in the wild: Raft language ipc: SHM library LeigNet sim framework constexpr HighwayHash

Research & PhD thesis

Online Modeling and Tuning of Parallel Stream Processing Systems (2015): queueing theory, machine learning, and control theory to auto-tune dataflow systems.

Work spans IJHPCA, IPDPS, ICPP, PACT, ICS, Euro-Par, MEMSYS, and HPEC.

At Arm: mentored and funded academic teams at Barcelona Supercomputing Center, UT Austin, and Georgia Tech (12+ students, 5 professors), converting research into patents, papers, and architecture proposals.

Arm · 2015–2022

Senior Research Engineer → Principal System Architect. CHI memory-copy enhancements, the LS64 architecture + gather-hint instruction (early AArch64 8.7/9.2 accelerator extensions), Project-38 (DoD/DoE), and Arm rep to Sandia's DOE data-movement project.

Concurrently: technical advisor to FastData.io, GPU-accelerated streaming data processing (2016–2020).

GenZ-CXL subcommitteest6430+ patent filings

Off the clock

Austin, Texas. Nine years a soldier before the doctorate. A bioinformatics master's (Johns Hopkins) earned along the way, biology + international studies before that.

Builds restomod robots: currently resurrecting an Omnibot 2000, the greatest vintage bot of all time, with a SLAM stack that's headed for open source. Rebuilds cars too: a 1979 Porsche 911SC, stripped to the shell and going back together from the ground up around a custom ADAS of his own design. Photography, too. Blog posts on all of the above are coming.

Omnibot 2000SLAM → open source'79 911SC + ADASPhotographyHome lab

Washington University

PhD research, Stream-Based Supercomputing Lab. The thesis: teach parallel stream systems to tune themselves while running. Online queueing models spot where backpressure will bite, then the runtime resizes buffers and re-places kernels mid-flight. No profiling runs, no restarts. It shipped as RaftLib's autotuner.

Community & speaking

  • Supercomputing (SC) TPC 2016–2022: Architecture & Networks Co-Chair (SC19), Speed Mentorship Chair (SC21).
  • TPC service: ISCA, MICRO, HPCA, ICPP (Architecture Track Co-Chair 2023), ISC, IA³, DOE P3HPC.
  • General Chair, Arm Research Summit 2019; Co-General Chair, GoingArm 2017/2018; MEMSYS organizing committee 2017–2022.
  • Co-author, U.S. DOE Extreme Heterogeneity workshop report (2018).
CppCast · Ep. 50C++Now ×3MEMSYSSC

Tools

C/C++PythonGoCUDA gem5SSTQEMUVerilog/VHDL Linux/eBPF/perfPMUs/LTTngLLVM/GCCFPGA

What I work on

AI acceleratorsCPU/GPU µarchHBM/DDR CXL & disaggregationInterconnectsPerf modeling & benchmarking Workload characterizationCache/coherence protocols HW/SW co-designTCO

Foundations

  • Ph.D., Computer Science: Washington University in St. Louis, 2015
  • M.S., Bioinformatics: Johns Hopkins University, 2010
  • B.S. Biological Sciences / B.A. International Studies: LSU, 2005

02 · Impact · patents

Where the ideas were cited

Forward citations from Google Patents: 120+ later filings cite the 29 granted patents, at Apple, Intel, NVIDIA, Microsoft, Samsung, IBM, AMD, Google, and a wave of AI-silicon startups. Each chip links to the cited patent behind the cluster. Highlights here. The full impact map lives with the publications.

Accelerator integration & offload

  • Intel: Configuring/reconfiguring chains of accelerators
  • Samsung: Host/accelerator work-sharing via shared memory
  • NVIDIA: Unified virtual memory in heterogeneous systems

Memory fabrics & disaggregation

  • Apple: Scalable system-on-a-chip (M-series fabric)
  • Apple: Address hashing across multiple memory controllers
  • Apple: Soft memory folding / compacted pipe addressing

Coherence at scale

  • Microsoft: Snoop filter w/ disaggregated vector table
  • Microsoft: Adaptive coherency tracking (×4 patents)
  • IBM: Coordination namespace / global virtual address space

Hardware queues & message passing

  • Google: Optimizing hardware FIFO instructions
  • Xilinx (AMD): Producer→consumer active cache transfers
  • Samsung: SoC data sync between processors

Context switching & migration

  • Apple: Thread-channel deactivation; memory-backed register preemption; multi-stage thread scheduling (×3)
  • Intel: NVM cloning w/ hardware copy-on-write
  • VMware: Cross-privilege-domain communication in CPU cores

Near-memory & sparse data movement

  • AMD: Near-memory data-dependent gather & packing
  • AMD: Reducing side-effects of compute offload to memory
  • Intel: Smart memory store/load; disaggregated-memory filtering

Memory reliability & prediction

  • Samsung: SSD-based RAID
  • Apple: Dynamic address-based data reliability
  • Raytheon: Optimal bit apportionment vs soft errors (×3)

Method: Google Patents “Cited by” and family-citation data, mined June 2026. A citation marks later work that builds on or relates to the patent: prior art acknowledged by the applicant or examiner, not endorsement.

02 · Impact · papers

Cited in the literature

RaftLib alone: 42 citations. The body of work is cited across OSDI, EuroSys, SC, USENIX ATC, HPCA, MICRO, CGO, and HPDC. See the full record on Google Scholar.

Stream processing & RaftLib

  • EuroSys 2020: PaSh: light-touch data-parallel shell processing (Vasilakis)
  • OSDI 2022: Practically Correct, Just-in-Time Shell Script Parallelization (Kallas)
  • SC 2022: TD-NUCA: Runtime-Driven Management of NUCA Caches (Caheny)
Source: RaftLib (IJHPCA) · 42 citations

Near-memory & sparse acceleration

  • HPCA 2021: FAFNIR: Accelerating Sparse Gathering by Near-Memory Reduction (Asgari)
  • IEEE Access 2021: DAMOV: Benchmark Suite for Data-Movement Bottlenecks (Oliveira)
  • MICRO 2023: A Tensor Marshaling Unit for Sparse Tensor Algebra (Siracusa)
Source: SPiDRE · Dark Bandwidth · NUCD

Hardware queues & messaging

  • HPCA 2025: Push Multicast: Speculative Coherent Interconnect (Huang)
  • CC 2024: BLQ: Locality-Aware Blocking-Less Queuing (Wu)
  • HotOS 2023: NextGen-Malloc: Giving the Allocator Its Own Room (Li)
Source: Virtual-Link · SPAMeR

Performance modeling of streaming

  • USENIX ATC 2019: EdgeWise: A Better Stream Processing Engine for the Edge (Fu)
  • HPDC 2023: Streaming Task Graph Scheduling for Dataflow Architectures (De Matteis)
  • Parallel Computing 2021: Reducing queuing impact in irregular dataflow (Timcheck)
Source: Analytic streaming models (MASCOTS) · 20 citations

Method: Semantic Scholar forward citations across 26 publications, mined June 2026; curated to recognizable venues, self-citations excluded. Canonical counts live on Google Scholar.

03 · Writing

Notes on life, systems, and performance

Stream processing, memory systems, parallelism, and the occasional war story.

June 19, 2026

Why More Cores Stopped Saving Us

Adding cores quietly stopped working, and Amdahl told us why forty years early: scaling stalls on the one dependency you can't parallelize away, not the resource you keep buying.