Full list of papers and patents available at Google Scholar.

Open Source Software (things I lead)

  • RaftLib - runtime for heterogeneous data-flow/streami processing using a C++ DSL, was also my thesis. It’s Apache 2.0, so free to use for pretty much everything.

Interviews

Talks (just talks, not papers)

Organized Workshops and Conferences

In the News

Research Publications

Full list of papers and patents available at Google Scholar.

  1. The Non-Uniform Compute Device (NUCD) Architecture for Lightweight Accelerator Offload

    Introduces the NUCD architecture, a lightweight accelerator offload mechanism tightly coupled to a host core for fine-grain tasks; demonstrates performance gains over conventional driver-based offload.

    Asri, M., Dunham, C., Rusitoru, R., Gerstlauer, A., & Beard, J. (2020). The Non-Uniform Compute Device (NUCD) Architecture for Lightweight Accelerator Offload.

    Show BibTeX
    @article{adrgb20,
      title = {The Non-Uniform Compute Device (NUCD) Architecture for Lightweight Accelerator Offload},
      author = {Asri, Mochamad and Dunham, Curtis and Rusitoru, Roxana and Gerstlauer, Andreas and Beard, Jonathan},
      publisher = {28th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing},
      series = {PDP2020},
      year = {2020},
      month = mar
    }
    

  2. Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data

    Presents multi-spectral reuse distance to infer spatial locality from temporal traces across multiple granularities; uses Earth Mover’s Distance to quantify shifts and guide page-sizing decisions.

    Cabrera, A. M., Chamberlain, R. D., & Beard, J. C. (2019, September). Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data.

    Show BibTeX
    @inproceedings{ccb19,
      title = {Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data},
      author = {Cabrera, Anthony M. and Chamberlain, Roger D. and Beard, Jonathan C.},
      publisher = {The IEEE High Performance Extreme Computing Conference 2019},
      series = {HPEC2019},
      year = {2019},
      month = sep,
      slides = {../slides/HPEC-2019-anthony.pdf}
    }
    

  3. SPiDRE: Accelerating Sparse Memory Access Patterns

    Explores SPiDRE hardware to accelerate sparse and irregular memory access via near-memory data reorganization, improving bandwidth utilization and performance.

    Barredo, A., Beard, J. C., & Moretó, M. (2019, September). SPiDRE: Accelerating Sparse Memory Access Patterns.

    Show BibTeX
    @inproceedings{bbm19,
      title = {SPiDRE: Accelerating Sparse Memory Access Patterns},
      author = {Barredo, Adri\'an and Beard, Jonathan C. and Moret\'o, Miquel},
      publisher = {28th International Conference on Parallel Architectures and Compilation Techniques (PACT)},
      series = {PACT2019},
      year = {2019},
      month = sep
    }
    

  4. This Architecture Tastes Like Microarchitecture

    Revisits early RISC ideas and argues for ISAs that specify less microarchitectural detail, contrasting multiple ISA design philosophies and their implications for hardware/software co-design.

    Dunham, C., & Beard, J. C. (2018). This Architecture Tastes Like Microarchitecture. The 2nd Workshop on Pioneering Processor Paradigms.

    Show BibTeX
    @online{db18a,
      title = {This Architecture Tastes Like Microarchitecture},
      author = {Dunham, Curtis and Beard, Jonathan C},
      publisher = {The 2nd Workshop on Pioneering Processor Paradigms},
      series = {WP3},
      year = {2018}
    }
    

  5. The Sparse Data Reduction Engine (SPiDRE): Chopping Sparse Data One Byte at a Time

    Introduces SPiDRE, a near-memory programmable data-reduction/rearrangement engine for sparse workloads, including a programmer interface and evaluation on representative applications.

    Beard, J. C. (2017, October). The Sparse Data Reduction Engine (SPiDRE): Chopping Sparse Data One Byte at a Time. Proceedings of the Second International Symposium on Memory Systems.

    Show BibTeX
    @inproceedings{b17a,
      title = {The Sparse Data Reduction Engine (SPiDRE): Chopping Sparse Data One Byte at a Time},
      author = {Beard, Jonathan C},
      booktitle = {Proceedings of the Second International Symposium on Memory Systems},
      year = {2017},
      month = oct,
      organization = {ACM},
      slides = {../slides/memsys2017_SPiDRE_Beard.pdf}
    }
    

  6. Eliminating Dark Bandwidth: a data-centric view of scalable, efficient performance, post-Moore

    Defines ’dark bandwidth’ as wasted data movement driven by memory interfaces and virtual- memory abstractions, and argues for system-level approaches that reduce data movement in post-Moore systems.

    Beard, J. C., & Randall, J. (2017). Eliminating Dark Bandwidth: a data-centric view of scalable, efficient performance, post-Moore. Proc. High Performance Computing Post-Moore (HCPM’17).

    Show BibTeX
    @article{br17a,
      title = {Eliminating Dark Bandwidth: a data-centric view of scalable, efficient performance, post-Moore},
      author = {Beard, Jonathan C and Randall, Joshua},
      booktitle = {Proc. High Performance Computing Post-Moore (HCPM'17)},
      series = {Lecture Notes in Computer Science},
      year = {2017},
      month = jun,
      slides = {../slides/beard_hcpm2017.pdf}
    }
    

  7. RaftLib: A C++ template library for high performance stream parallel processing

    Journal version of RaftLib detailing a C++ template library that enables streaming graph optimizations, dynamic queue tuning, automatic parallelization, and low-overhead monitoring for legacy C/C++ code.

    Beard, J. C., Li, P., & Chamberlain, R. D. (2016). RaftLib: A C++ template library for high performance stream parallel processing. International Journal of High Performance Computing Applications. https://doi.org/https://doi.org/10.1177/1094342016672542

    Show BibTeX
    @article{blc16,
      author = {Beard, Jonathan C and Li, Peng and Chamberlain, Roger D},
      title = {RaftLib: A C++ template library for high performance stream parallel processing},
      year = {2016},
      doi = {https://doi.org/10.1177/1094342016672542},
      journal = {International Journal of High Performance Computing Applications}
    }
    

  8. Online Automated Reliability Classification of Queueing Models for Streaming Processing using Support Vector Machines

    Extends SVM-based model selection to online settings, determining when M/M/1 and M/D/1 models apply to stream-processing queues; validated across multiple hardware and software platforms.

    Beard, J. C., Epstein, C., & Chamberlain, R. D. (2015). Online Automated Reliability Classification of Queueing Models for Streaming Processing using Support Vector Machines. Proceedings of Euro-Par 2015 Parallel Processing, 82-93.

    Show BibTeX
    @inproceedings{bec15b,
      title = {Online Automated Reliability Classification of Queueing Models for Streaming Processing using Support Vector Machines},
      author = {Beard, Jonathan C. and Epstein, Cooper and Chamberlain, Roger D.},
      booktitle = {Proceedings of Euro-Par 2015 Parallel Processing},
      year = {2015},
      month = aug,
      pages = {82-93},
      publisher = {Springer}
    }
    

  9. Run Time Approximation of Non-blocking Service Rates for Streaming Systems

    Introduces runtime estimation of kernel service rates for streaming systems to enable continuous retuning and avoid reliance on steady-state assumptions.

    Beard, J. C., & Chamberlain, R. D. (2015). Run Time Approximation of Non-blocking Service Rates for Streaming Systems. Proceedings of the 17th IEEE International Conference on High Performance and Communications, 792-797. https://doi.org/https://doi.org/10.1109/HPCC-CSS-ICESS.2015.64

    Show BibTeX
    @inproceedings{bc15f,
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      booktitle = {Proceedings of the 17th IEEE International Conference on High Performance and Communications},
      title = {Run Time Approximation of Non-blocking Service Rates for Streaming Systems},
      year = {2015},
      pages = {792-797},
      month = aug,
      publisher = {IEEE},
      doi = {https://doi.org/10.1109/HPCC-CSS-ICESS.2015.64},
      slides = {../slides/hpcc2015_public.pdf}
    }
    

  10. Online Modeling and Tuning of Parallel Stream Processing Systems

    Thesis on online modeling and tuning for parallel stream processing: presents RaftLib and techniques for runtime modeling/optimization to reduce tuning cost and improve portability across heterogeneous systems.

    Beard, J. C. (2015). Online Modeling and Tuning of Parallel Stream Processing Systems [PhD thesis]. Department of Computer Science and Engineering, Washington University in St. Louis.

    Show BibTeX
    @phdthesis{beardthesis,
      author = {Beard, Jonathan C.},
      title = {Online Modeling and Tuning of Parallel Stream Processing Systems},
      school = {Department of Computer Science and Engineering, Washington University
      in St. Louis},
      month = aug,
      year = {2015},
      link = {https://www.jonathanbeard.io/pdf/beard-thesis.pdf}
    }
    

  11. Run Time Approximation of Non-blocking Service Rates for Streaming Systems

    Presents an online method to estimate non-blocking service rates of stream kernels, enabling dynamic queueing/network-flow optimization under changing workloads; implemented in RaftLib and validated on benchmarks.

    Beard, J. C., & Chamberlain, R. D. (2015). Run Time Approximation of Non-blocking Service Rates for Streaming Systems. ArXiv Preprint ArXiv:1504.00591v2.

    Show BibTeX
    @article{bc15b,
      title = {Run Time Approximation of Non-blocking Service Rates for Streaming Systems},
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      journal = {arXiv preprint arXiv:1504.00591v2},
      year = {2015},
      month = apr,
      link = {https://arxiv.org/pdf/1504.00591v2}
    }
    

  12. Deadlock-free Buffer Configuration for Stream Computing

    Shows that output-buffer sizing in stream graphs can induce deadlock; provides necessary and sufficient conditions for deadlock-free configurations plus algorithms to detect and fix unsafe buffer settings.

    Li, P., Beard, J. C., & Buhler, J. (2015). Deadlock-free Buffer Configuration for Stream Computing. Proceedings of Programming Models and Applications on Multicores and Manycores, 164-169.

    Show BibTeX
    @inproceedings{lbb15,
      author = {Li, Peng and Beard, Jonathan C. and Buhler, Jeremy},
      title = {Deadlock-free Buffer Configuration for Stream Computing},
      publisher = {ACM},
      address = {New York, NY, USA},
      year = {2015},
      month = feb,
      series = {PMAM 2015},
      booktitle = {Proceedings of Programming Models and Applications on Multicores and Manycores},
      pages = {164-169}
    }
    

  13. RaftLib: A C++ template library for high performance stream parallel processing

    Introduces RaftLib, a C++ template library for stream/data-flow processing that brings streaming optimizations to legacy C/C++ code, including dynamic queue optimization, automatic parallelization, and low-overhead monitoring.

    Beard, J. C., Li, P., & Chamberlain, R. D. (2015). RaftLib: A C++ template library for high performance stream parallel processing. Proceedings of Programming Models and Applications on Multicores and Manycores, 96-105.

    Show BibTeX
    @inproceedings{blc15,
      author = {Beard, Jonathan C. and Li, Peng and Chamberlain, Roger D.},
      title = {RaftLib: A {C++} template library for high performance stream parallel processing},
      publisher = {ACM},
      address = {New York, NY, USA},
      year = {2015},
      month = feb,
      series = {PMAM 2015},
      booktitle = {Proceedings of Programming Models and Applications on Multicores and Manycores},
      pages = {96-105}
    }
    

  14. Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines

    Uses SVMs to automate reliability classification of queueing models for streaming systems, reducing reliance on expert tuning; demonstrated on microbenchmarks spanning diverse queueing conditions.

    Beard, J. C., Epstein, C., & Chamberlain, R. D. (2015). Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines. Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, 325-328.

    Show BibTeX
    @inproceedings{bec15,
      author = {Beard, Jonathan C. and Epstein, Cooper and Chamberlain, Roger D.},
      title = {Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines},
      month = jan,
      year = {2015},
      booktitle = {Proceedings of the 6th ACM/SPEC international conference on Performance engineering},
      series = {ICPE 2015},
      publisher = {ACM},
      address = {New York, NY, USA},
      pages = {325-328}
    }
    

  15. Use of a Levy Distribution for Modeling Best Case Execution Time Variation

    Proposes a truncated Levy distribution to characterize best-case execution-time variation in multicore systems and a parameterization method based on measurable system parameters; evaluated under Linux CFS.

    Beard, J. C., & Chamberlain, R. D. (2014). Use of a Levy Distribution for Modeling Best Case Execution Time Variation. In A. Horvath & K. Wolter (Eds.), Computer Performance Engineering (Vol. 8721, pp. 74-88). Springer International Publishing.

    Show BibTeX
    @incollection{bc14a,
      year = {2014},
      month = sep,
      isbn = {978-3-319-10884-1},
      booktitle = {Computer Performance Engineering},
      volume = {8721},
      series = {Lecture Notes in Computer Science},
      editor = {Horvath, A. and Wolter, K.},
      title = {Use of a {Levy} Distribution for Modeling Best Case Execution Time Variation},
      publisher = {Springer International Publishing},
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      pages = {74-88},
      slides = {../slides/EPEW2014.pdf}
    }
    

  16. Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications

    Describes a simple, practical approach to modeling throughput and buffering for streaming applications deployed on heterogeneous processors, emphasizing usability amid hidden architectural complexity.

    Beard, J. C., & Chamberlain, R. D. (2013). Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications. Proc. of IEEE Int’l Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 345-349.

    Show BibTeX
    @inproceedings{bc13b,
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      title = {Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications},
      booktitle = {Proc. of IEEE Int’l Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems},
      month = aug,
      year = {2013},
      pages = {345-349}
    }
    

  17. Use of Simple Analytic Performance Models of Streaming Data Applications Deployed on Diverse Architectures

    Presents a lightweight analytic model (hybrid max-flow and decomposed queueing) to estimate throughput and buffering needs for streaming applications on heterogeneous hardware, validated with real and synthetic benchmarks.

    Beard, J. C., & Chamberlain, R. D. (2013). Use of Simple Analytic Performance Models of Streaming Data Applications Deployed on Diverse Architectures. Proc. of Int’l Symp. on Performance Analysis of Systems and Software, 138-139.

    Show BibTeX
    @inproceedings{bc13a,
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      title = {Use of Simple Analytic Performance Models of Streaming Data
      Applications Deployed on Diverse Architectures},
      booktitle = {Proc. of Int’l Symp. on Performance Analysis of Systems
      and Software},
      month = apr,
      year = {2013},
      pages = {138-139}
    }
    

  18. Crossing Boundaries in TimeTrial: Monitoring Communications Across Architecturally Diverse Computing Platforms

    Introduces TimeTrial, a low-impact performance monitor for streaming applications spanning heterogeneous platforms (multicore CPUs and FPGAs). It measures cross-platform communication queues using a mix of direct measurement and modeling, validated on microbenchmarks and a Monte Carlo Laplace application.

    Lancaster, J. M., Wingbermuehle, J. G., Beard, J. C., & Chamberlain, R. D. (2011). Crossing Boundaries in TimeTrial: Monitoring Communications Across Architecturally Diverse Computing Platforms. Proc. of Ninth IEEE/IFIP Int’l Conf. on Embedded and Ubiquitous Computing, 280-287.

    Show BibTeX
    @inproceedings{lancaster11b,
      author = {Lancaster, Joseph M. and Wingbermuehle, Joseph G. and Beard, Jonathan C. and Chamberlain, Roger D.},
      title = {Crossing Boundaries in {TimeTrial}: Monitoring Communications Across
      Architecturally Diverse Computing Platforms},
      booktitle = {Proc. of Ninth IEEE/IFIP Int’l Conf. on Embedded and Ubiquitous
      Computing},
      month = oct,
      year = {2011},
      pages = {280-287}
    }