Interviews

Talks (just talks, not papers)

Publications

  1. RaftLib: A C++ template library for high performance stream parallel processing

    Stream processing is a compute paradigm that has been around for decades, yet until recently has failed to garner the same attention as other mainstream languages and libraries (e.g. C++, OpenMP, MPI). Stream processing has great promise: the ability to safely exploit extreme levels of parallelism to process huge volumes of streaming data. There have been many implementations, both libraries and full languages. The full languages implicitly assume that the streaming paradigm cannot be fully exploited in legacy languages, while library approaches are often preferred for being integrable with the vast expanse of extant legacy code. Libraries, however are often criticized for yielding to the shape of their respective languages. RaftLib aims to fully exploit the stream processing paradigm, enabling a full spectrum of streaming graph optimizations, while providing a platform for the exploration of integrability with legacy C/C++ code. RaftLib is built as a C++ template library, enabling programmers to utilize the robust C++ standard library, and other legacy code, along with RaftLib’s parallelization framework. RaftLib supports several online optimization techniques: dynamic queue optimization, automatic parallelization, and real-time low overhead performance monitoring.

    Beard, J. C., Li, P., & Chamberlain, R. D. (2016). RaftLib: A C++ template library for high performance stream parallel processing. International Journal of High Performance Computing Applications. http://doi.org/http://dx.doi.org/10.1177/1094342016672542
    @article{blc16,
      author = {Beard, Jonathan C and Li, Peng and Chamberlain, Roger D},
      title = {RaftLib: A C++ template library for high performance stream parallel processing},
      year = {2016},
      doi = {http://dx.doi.org/10.1177/1094342016672542},
      eprint = {http://hpc.sagepub.com/content/early/2016/10/18/1094342016672542.full.pdf+html},
      journal = {International Journal of High Performance Computing Applications}
    }
    

  2. Online Automated Reliability Classification of Queueing Models for Streaming Processing using Support Vector Machines

    When do you trust a performance model? More specifically, when can a particular model be used for a specific application? Once a stochastic model is selected, its parameters must be determined. This involves instrumentation, data collection, and finally interpretation; which are very time consuming. Even when done correctly, the results hold for only the conditions under which the system was characterized. For modern, dynamic stream processing systems, this is far too slow if a model-based approach to performance tuning is to be considered. This work demonstrates the use of a Support Vector Machine (SVM) to determine if a stochastic queueing model is usable or not for a particular queueing station within a streaming application. When combined with methods for online service rate approximation, our SVM approach can select models while the application is executing (online). The method is tested on a variety of hardware and software platforms. The technique is shown to be highly effective for determining the applicability of M/M/1 and M/D/1 queueing models to stream processing applications.

    Beard, J. C., Epstein, C., & Chamberlain, R. D. (2015). Online Automated Reliability Classification of Queueing Models for Streaming Processing using Support Vector Machines. In Proceedings of Euro-Par 2015 Parallel Processing (pp. 82-93). Springer.
    @inproceedings{bec15b,
      title = {Online Automated Reliability Classification of Queueing Models for Streaming Processing using Support Vector Machines},
      author = {Beard, Jonathan C. and Epstein, Cooper and Chamberlain, Roger D.},
      booktitle = {Proceedings of Euro-Par 2015 Parallel Processing},
      year = {2015},
      month = aug,
      pages = {82-93},
      publisher = {Springer}
    }
    

  3. Run Time Approximation of Non-blocking Service Rates for Streaming Systems

    Stream processing is a compute paradigm that promises safe and efficient parallelism. Its realization requires optimization of multiple compute kernels and communications links. Most techniques to optimize these use queueing network models or network flow models, which require estimates of the execution rate of each compute kernel. What we want to know is how fast can each kernel process input data. This is known as the “service rate” of the kernel within the queueing literature. Current approaches to divining service rates are static. Modern workloads, however, are often dynamic. It is therefore desirable to continuously re-estimate kernel service rates and re-tune an application during run time in response to changing conditions. Our approach enables online service rate monitoring under most conditions, obviating the need to rely on steady state predictions for what are likely non-steady state phenomena.

    Beard, J. C., & Chamberlain, R. D. (2015). Run Time Approximation of Non-blocking Service Rates for Streaming Systems. In Proceedings of the 17th IEEE International Conference on High Performance and Communications (pp. 792-797). IEEE. http://doi.org/http://dx.doi.org/10.1109/HPCC-CSS-ICESS.2015.64
    @inproceedings{bc15f,
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      booktitle = {Proceedings of the 17th IEEE International Conference on High Performance and Communications},
      title = {Run Time Approximation of Non-blocking Service Rates for Streaming Systems},
      year = {2015},
      pages = {792-797},
      month = aug,
      publisher = {IEEE},
      doi = {http://dx.doi.org/10.1109/HPCC-CSS-ICESS.2015.64},
      slides = {../slides/hpcc2015_public.pdf}
    }
    

  4. Online Modeling and Tuning of Parallel Stream Processing Systems

    Writing performant computer programs is hard. Code for high performance applications is profiled, tweaked, and re-factored for months specifically for the hardware for which it is to run. Consumer application code doesn’t get the benefit of endless massaging that benefits high performance code, even though heterogeneous processor environments are beginning to resemble those in more performance oriented arenas. This thesis offers a path to performant, parallel code (through stream processing) which is tuned online and automatically adapts to the environment it is given. This approach has the potential to reduce the tuning costs associated with high performance code and brings the benefit of performance tuning to consumer applications where otherwise it would be cost prohibitive. This thesis introduces a stream processing library and multiple techniques to enable its online modeling and tuning.

    Beard, J. C. (2015, August). Online Modeling and Tuning of Parallel Stream Processing Systems (PhD thesis). Department of Computer Science and Engineering, Washington University in St. Louis.
    @phdthesis{beardthesis,
      author = {Beard, Jonathan C.},
      title = {Online Modeling and Tuning of Parallel Stream Processing Systems},
      school = {Department of Computer Science and Engineering, Washington University
      in St. Louis},
      month = aug,
      year = {2015},
      link = {http://www.jonathanbeard.io//pdf/beard-thesis.pdf}
    }
    

  5. Run Time Approximation of Non-blocking Service Rates for Streaming Systems

    Stream processing is a compute paradigm that promises safe and efficient parallelism. Modern big-data problems are often well suited for stream processing’s throughput-oriented nature. Realization of efficient stream processing requires monitoring and optimization of multiple communications links. Most techniques to optimize these links use queueing network models or network flow models, which require some idea of the actual execution rate of each independent compute kernel within the system. What we want to know is how fast can each kernel process data independent of other communicating kernels. This is known as the “service rate’’ of the kernel within the queueing literature. Current approaches to divining service rates are static. Modern workloads, however, are often dynamic. Shared cloud systems also present applications with highly dynamic execution environments (multiple users, hardware migration, etc.). It is therefore desirable to continuously re-tune an application during run time (online) in response to changing conditions. Our approach enables online service rate monitoring under most conditions, obviating the need for reliance on steady state predictions for what are probably non-steady state phenomena. First, some of the difficulties associated with online service rate determination are examined. Second, the algorithm to approximate the online non-blocking service rate is described. Lastly, the algorithm is implemented within the open source RaftLib framework for validation using a simple microbenchmark as well as two full streaming applications.

    Beard, J. C., & Chamberlain, R. D. (2015). Run Time Approximation of Non-blocking Service Rates for Streaming Systems. ArXiv Preprint ArXiv:1504.00591v2.
    @article{bc15b,
      title = {Run Time Approximation of Non-blocking Service Rates for Streaming Systems},
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      journal = {arXiv preprint arXiv:1504.00591v2},
      year = {2015},
      month = apr,
      link = {http://arxiv.org/pdf/1504.00591v2}
    }
    

  6. Deadlock-free Buffer Configuration for Stream Computing

    Stream computing is a popular paradigm for parallel and distributed computing, which features computing nodes connected by first-in first-out (FIFO) data channels. To increase the efficiency of communication links and boost application throughput, output buffers are often used that are some multiple greater than the size required. However, the connection between the configuration of output buffers and application deadlocks has not been studied. In this paper, we show that bad configuration of output buffers can lead to application deadlock. We prove necessary and sufficient condition for deadlock-free buffer configurations. We also propose an efficient method based on all-pair shortest path algorithms to detect unsafe buffer configuration. We also provide a method to adjust unsafe buffer configuration to a safe one.

    Li, P., Beard, J. C., & Buhler, J. (2015). Deadlock-free Buffer Configuration for Stream Computing. In Proceedings of Programming Models and Applications on Multicores and Manycores (pp. 164-169). New York, NY, USA: ACM.
    @inproceedings{lbb15,
      author = {Li, Peng and Beard, Jonathan C. and Buhler, Jeremy},
      title = {Deadlock-free Buffer Configuration for Stream Computing},
      publisher = {ACM},
      address = {New York, NY, USA},
      year = {2015},
      month = feb,
      series = {PMAM 2015},
      booktitle = {Proceedings of Programming Models and Applications on Multicores and Manycores},
      pages = {164-169}
    }
    

  7. RaftLib: A C++ template library for high performance stream parallel processing

    Stream processing or data-flow programming is a compute paradigm that has been around for decades in many forms yet has failed garner the same attention as other mainstream languages and libraries (e.g., C++ or OpenMP). Stream processing has great promise: the ability to safely exploit extreme levels of parallelism. There have been many implementations, both libraries and full languages. The full languages implicitly assume that the streaming paradigm cannot be fully exploited in legacy languages, while library approaches are often preferred for being integrable with the vast expanse of legacy code that exists in the wild. Libraries, however are often criticized for yielding to the shape of their respective languages. RaftLib aims to fully exploit the stream processing paradigm, enabling a full spectrum of streaming graph optimizations while providing a platform for the exploration of integrability with legacy C/C++ code. RaftLib is built as a C++ template library, enabling end users to utilize the robust C++ standard library along with RaftLib’s pipeline parallel framework. RaftLib supports dynamic queue optimization, automatic parallelization, and real-time low overhead performance monitoring.

    Beard, J. C., Li, P., & Chamberlain, R. D. (2015). RaftLib: A C++ template library for high performance stream parallel processing. In Proceedings of Programming Models and Applications on Multicores and Manycores (pp. 96-105). New York, NY, USA: ACM.
    @inproceedings{blc15,
      author = {Beard, Jonathan C. and Li, Peng and Chamberlain, Roger D.},
      title = {RaftLib: A {C++} template library for high performance stream parallel processing},
      publisher = {ACM},
      address = {New York, NY, USA},
      year = {2015},
      month = feb,
      series = {PMAM 2015},
      booktitle = {Proceedings of Programming Models and Applications on Multicores and Manycores},
      pages = {96-105}
    }
    

  8. Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines

    When do you trust a model? More specifically, when can a model be used for a specific application? This question often takes years of experience and specialized knowledge to answer correctly. Once this knowledge is acquired it must be applied to each application. This involves instrumentation, data collection and finally interpretation. We propose the use of a trained Support Vector Machine (SVM) to give an automated system the ability to make an educated guess as to model applicability. We demonstrate a proof-of-concept which trains a SVM to correctly determine if a particular queueing model is suitable for a specific queue within a streaming system. The SVM is demonstrated using a micro-benchmark to simulate a wide variety of queueing conditions.

    Beard, J. C., Epstein, C., & Chamberlain, R. D. (2015). Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines. In Proceedings of the 6th ACM/SPEC international conference on Performance engineering (pp. 325-328). New York, NY, USA: ACM.
    @inproceedings{bec15,
      author = {Beard, Jonathan C. and Epstein, Cooper and Chamberlain, Roger D.},
      title = {Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines},
      month = jan,
      year = {2015},
      booktitle = {Proceedings of the 6th ACM/SPEC international conference on Performance engineering},
      series = {ICPE 2015},
      publisher = {ACM},
      address = {New York, NY, USA},
      pages = {325-328}
    }
    

  9. Use of a Levy Distribution for Modeling Best Case Execution Time Variation

    Minor variations in execution time can lead to out-sized effects on the behavior of an application as a whole. There are many sources of such variation within modern multi-core computer systems. For an otherwise deterministic application, we would expect the execution time variation to be non-existent (effectively zero). Unfortunately, this expectation is in error. For instance, variance in the realized execution time tends to increase as the number of processes per compute core increases. Recognizing that characterizing the exact variation or the maximal variation might be a futile task, we take a different approach, focusing instead on the best case variation. We propose a modified (truncated) Levy distribution to characterize this variation. Using empirical sampling we also derive a model to parametrize this distribution that doesn’t require expensive distribution fitting, relying only on known parameters of the system. The distributional assumptions and parametrization model are evaluated on multi-core systems with the common Linux completely fair scheduler.

    Beard, J. C., & Chamberlain, R. D. (2014). Use of a Levy Distribution for Modeling Best Case Execution Time Variation. In A. Horvath & K. Wolter (Eds.), Computer Performance Engineering (Vol. 8721, pp. 74-88). Springer International Publishing.
    @incollection{bc14a,
      year = {2014},
      month = sep,
      isbn = {978-3-319-10884-1},
      booktitle = {Computer Performance Engineering},
      volume = {8721},
      series = {Lecture Notes in Computer Science},
      editor = {Horvath, A. and Wolter, K.},
      title = {Use of a {Levy} Distribution for Modeling Best Case Execution Time Variation},
      publisher = {Springer International Publishing},
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      pages = {74-88},
      slides = {../slides/EPEW2014.pdf}
    }
    

  10. Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications

    Current state of the art systems contain various types of multicore processors, including General Purpose Graphics Processing Units (GPGPUs) and occasionally Digital Signal Processors (DSPs) or Field-Programmable Gate Arrays (FPGAs). With heterogeneity comes multiple abstraction layers that hide underlying complexity. While necessary to ease programmability of these systems, this hidden complexity makes quantitative performance modeling a difficult task. This paper outlines a computationally simple approach to modeling the overall throughput and buffering needs of a streaming application deployed on heterogeneous hardware.

    Beard, J. C., & Chamberlain, R. D. (2013). Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications. In Proc. of IEEE Int’l Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (pp. 345-349).
    @inproceedings{bc13b,
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      title = {Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications},
      booktitle = {Proc. of IEEE Int’l Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems},
      month = aug,
      year = {2013},
      pages = {345-349}
    }
    

  11. Use of Simple Analytic Performance Models of Streaming Data Applications Deployed on Diverse Architectures

    Modern hardware is often heterogeneous. With heterogeneity comes multiple abstraction layers that hide underlying complex systems. This complexity makes quantitative performance modeling a difficult task. Designers of high-performance streaming applications for heterogeneous systems must contend with unpredictable and often non-generalizable models to predict performance of a particular application and hardware mapping. This paper outlines a computationally simple approach that can be used to model the overall throughput and buffering needs of a streaming application on heterogeneous hardware. The model presented is based upon a hybrid maximum flow and decomposed discrete queueing model. The utility of the model is assessed using a set of real and synthetic benchmarks with model predictions compared to measured application performance.

    Beard, J. C., & Chamberlain, R. D. (2013). Use of Simple Analytic Performance Models of Streaming Data Applications Deployed on Diverse Architectures. In Proc. of Int’l Symp. on Performance Analysis of Systems and Software (pp. 138-139).
    @inproceedings{bc13a,
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      title = {Use of Simple Analytic Performance Models of Streaming Data
      Applications Deployed on Diverse Architectures},
      booktitle = {Proc. of Int’l Symp. on Performance Analysis of Systems
      and Software},
      month = apr,
      year = {2013},
      pages = {138-139}
    }
    

  12. Crossing Boundaries in TimeTrial: Monitoring Communications Across Architecturally Diverse Computing Platforms

    TimeTrial is a low-impact performance monitor that supports streaming data applications deployed on a variety of architecturally diverse computational platforms, including multicore processors and field-programmable gate arrays. Communication between resources in architecturally diverse systems is frequently a limitation to overall application performance. Understanding these bottlenecks is crucial to understanding overall application performance. Direct measurement of inter-resource communications channel occupancy is not readily achievable without significantly impacting performance of the application itself. Here, we present TimeTrial’s approach to monitoring those queues that cross platform boundaries. Since the approach includes a combination of direct measurement and modeling, we also describe circumstances under which the model can be shown to be inappropriate. Examples with several micro-benchmark applications (for which the true measurement is known) and an application that uses Monte Carlo techniques to solve Laplace’s equation are used for illustrative purposes.

    Lancaster, J. M., Wingbermuehle, J. G., Beard, J. C., & Chamberlain, R. D. (2011). Crossing Boundaries in TimeTrial: Monitoring Communications Across Architecturally Diverse Computing Platforms. In Proc. of Ninth IEEE/IFIP Int’l Conf. on Embedded and Ubiquitous Computing (pp. 280-287).
    @inproceedings{lancaster11b,
      author = {Lancaster, Joseph M. and Wingbermuehle, Joseph G. and Beard, Jonathan C. and Chamberlain, Roger D.},
      title = {Crossing Boundaries in {TimeTrial}: Monitoring Communications Across
      Architecturally Diverse Computing Platforms},
      booktitle = {Proc. of Ninth IEEE/IFIP Int’l Conf. on Embedded and Ubiquitous
      Computing},
      month = oct,
      year = {2011},
      pages = {280-287}
    }