Open Source Software (things I lead)

  • RaftLib - runtime for heterogeneous data-flow/streami processing using a C++ DSL, was also my thesis. It’s Apache 2.0, so free to use for pretty much everything.

Interviews

Talks (just talks, not papers)

Organized Workshops and Conferences

In the News

Research Publications

  1. The Non-Uniform Compute Device (NUCD) Architecture for Lightweight Accelerator Offload

    Heterogeneous architectures have arisen as a well-suited approach for the post-Moore era. Among them, architectures that integrate programmable accelerators in or near memory are gaining popularity due to the potential advantages of reduced data movement. Such near-memory accelerators benefit from launching a large number of fine-grain tasks to hide memory latency while exploiting bandwidth gains. This requires low-overhead and portable mechanisms for interfacing of accelerators. If not managed carefully, the hard and soft costs of host and accelerator interactions, such as programming and device driver overheads for actuation, context transfer and synchronization can severely limit acceleration benefits. We present the non-uniform compute device (NUCD) system architecture as a novel lightweight and generic accelerator offload mechanism that is tightly-coupled with a general-purpose processor core. Different from conventional offload mechanisms that rely primarily on device drivers and software queues, the NUCD system architecture extends a host core micro-architecture to enable a low-latency out-of-order task offload to heterogeneous devices. Results demonstrate that the NUCD system architecture can achieve an average performance improvement of 21%-128% over a conventional driver-based offload mechanism. This in turn enables whole new forms of fine-grain task offloading that would otherwise not see any performance benefits.

    Asri, M., Dunham, C., Rusitoru, R., Gerstlauer, A., & Beard, J. (2020). The Non-Uniform Compute Device (NUCD) Architecture for Lightweight Accelerator Offload.
    @article{adrgb20,
      title = {The Non-Uniform Compute Device (NUCD) Architecture for Lightweight Accelerator Offload},
      author = {Asri, Mochamad and Dunham, Curtis and Rusitoru, Roxana and Gerstlauer, Andreas and Beard, Jonathan},
      publisher = {28th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing},
      series = {PDP2020},
      year = {2020},
      month = mar
    }
    

  2. Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data

    The problem of efficiently feeding processing elements and finding ways to reduce data movement is pervasive in computing. Efficient modeling of both temporal and spatial locality of memory references is invaluable in identifying superfluous data movement in a given application. To this end, we present a new way to infer both spatial and temporal locality using reuse distance analysis. This is accomplished by performing reuse distance analysis at different data block granularities: specifically, 64B, 4KiB, and 2MiB sizes. This process of simultaneously observing reuse distance with multiple granularities is called multi-spectral reuse distance. This approach allows for a qualitative analysis of spatial locality, through observing the shifting of mass in an application’s reuse signature at different granularities. Furthermore, the shift of mass is empirically measured by calculating the Earth Mover’s Distance between reuse signatures of an application. From the characterization, it is possible to determine how spatially dense the memory references of an application are based on the degree to which the mass has shifted (or not shifted) and how close (or far) the Earth Mover’s Distance is to zero as the data block granularity is increased. It is also possible to determine an appropriate page size from this information, and whether or not a given page is being fully utilized. From the applications profiled, it is observed that not all applications will benefit from having a larger page size. Additionally, larger data block granularities subsuming smaller ones suggest that larger pages will allow for more spatial locality exploitation, but examining the memory footprint will show whether those larger pages are fully utilized or not.

    Cabrera, A. M., Chamberlain, R. D., & Beard, J. C. (2019, September). Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data.
    @inproceedings{ccb19,
      title = {Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data},
      author = {Cabrera, Anthony M. and Chamberlain, Roger D. and Beard, Jonathan C.},
      publisher = {The IEEE High Performance Extreme Computing Conference 2019},
      series = {HPEC2019},
      year = {2019},
      month = sep,
      slides = {../slides/HPEC-2019-anthony.pdf}
    }
    

  3. SPiDRE: Accelerating Sparse Memory Access Patterns

    Development in process technology has led to an exponential increase in processor speed and memory capacity. However, memory latencies have not improved as dramatically and represent a well-known problem in computer architecture. Cache memories provide more bandwidth with lower latencies than main memories but they are capacity limited. Locality- friendly applications benefit from a large and deep cache hi- erarchy. Nevertheless, this is a limited solution for applications suffering from sparse and irregular memory access patterns, such as data analytics. In order to accelerate them, we should maximize usable bandwidth, reduce latency and maximize moved data reuse. In this work we explore the Sparse Data Rearrange Engine (SPiDRE), a novel hardware approach to accelerate these applications through near-memory data reorganization.

    Barredo, A., Beard, J. C., & Moretó, M. (2019, September). SPiDRE: Accelerating Sparse Memory Access Patterns.
    @inproceedings{bbm19,
      title = {SPiDRE: Accelerating Sparse Memory Access Patterns},
      author = {Barredo, Adri\'an and Beard, Jonathan C. and Moret\'o, Miquel},
      publisher = {28th International Conference on Parallel Architectures and Compilation Techniques (PACT)},
      series = {PACT2019},
      year = {2019},
      month = sep
    }
    

  4. This Architecture Tastes Like Microarchitecture

    Instruction set architecture bridges the gap between actual implementations, or microarchitecture, and the software that runs on them. Traditionally, instruction sets were a direct reflection of the hardware resources and capabilities. The two drifted apart in the rise of CISC and its microcoded implementations. In the 1980s, the RISC movement reasserted the philosophy that the two should correspond, and that microcode was a less desirable approach. Nevertheless, time has shown that the natural tendency in industrial designs is to treat the instruction set as an abstraction. In this paper we review, with several decades of hindsight, an early RISC proposal in the form of the original MIPS architecture. While we find that the RISC movement left a legacy congruent with its philosophy, the specific techniques proposed in this seminal work were considerably more aggressive and did not succeed. In our investigation, we find that RISC’s impact on microarchitecture should be contrasted with its impact on ISA design, where a promising and under explored approach is to specify and therefore assume less about how the machine works, not more. To that end, the authors review several competing ISA design proposals from others; some that are aligned with the idea that less detail about the machine is actually more and others, such as transport triggered architecture, that take machine detail to the extreme.

    Dunham, C., & Beard, J. C. (2018). This Architecture Tastes Like Microarchitecture. The 2nd Workshop on Pioneering Processor Paradigms.
    @online{db18a,
      title = {This Architecture Tastes Like Microarchitecture},
      author = {Dunham, Curtis and Beard, Jonathan C},
      publisher = {The 2nd Workshop on Pioneering Processor Paradigms},
      series = {WP3},
      year = {2018}
    }
    

  5. The Sparse Data Reduction Engine (SPiDRE): Chopping Sparse Data One Byte at a Time

    Sparse data and irregular data access patterns are hugely important to many applications, such as molecular dynamics and data analytics. Accelerating applications with these characteristics requires maximizing usable bandwidth at all levels of the memory hierarchy, reducing latency, maximizing reuse of moved data, and minimizing the amount the data is moved in the first place. Many specialized data structures have evolved to meet these requisites for specific applications, however, there are no general solutions for improving the performance of sparse applications. The structure of the memory hierarchy itself, conspires against general hardware for accelerating sparse applications, being designed for efficient bulk transport of data versus one byte at a time. This paper presents a general solution for a programmable data rearrangement/reduction engine near-memory to deliver bulk byte-addressable data access. The key technology presented in this paper is the Sparse Data Reduction Engine (SPDRE), which builds previous similar efforts to provide a practical near-memory reorganization engine. In addition to the primary contribution, this paper describes a programmer interface that enables all combinations of rearrangement, analysis of the methodology on a small series of applications, and finally a discussion of future work.

    Beard, J. C. (2017, October). The Sparse Data Reduction Engine (SPiDRE): Chopping Sparse Data One Byte at a Time. Proceedings of the Second International Symposium on Memory Systems.
    @inproceedings{b17a,
      title = {The Sparse Data Reduction Engine (SPiDRE): Chopping Sparse Data One Byte at a Time},
      author = {Beard, Jonathan C},
      booktitle = {Proceedings of the Second International Symposium on Memory Systems},
      year = {2017},
      month = oct,
      organization = {ACM},
      slides = {../slides/memsys2017_SPiDRE_Beard.pdf}
    }
    

  6. Eliminating Dark Bandwidth: a data-centric view of scalable, efficient performance, post-Moore

    Most of computing research has focused on the computing technologies themselves versus how full systems make use of them (e.g., memory fabric, interconnect, software, and compute elements combined). Technologists have largely failed to look at the compute system as a whole, instead optimizing subsystems mostly in isolation. The result, for example, is that systems are built where applications can only ask for a fixed multiple of data (e.g., 64-bytes from DRAM), even if what is required is far less. This is efficient from a hardware interface perspective, however,it results in consuming valuable bandwidth that is never utilized by the core; this hidden bandwidth is effectively dark to the system. The causes of dark bandwidth are systemic, built into the very core of our virtual memory abstractions and memory interfaces. Continued focus on newer, revolutionary memory technologies to improve surface performance characteristics without a systems focus on reducing data movement will simply push the problem off onto future systems. This paper examines the problem of dark bandwidth and offers a holistic approach to reduce overall data movement within future compute systems.

    Beard, J. C., & Randall, J. (2017). Eliminating Dark Bandwidth: a data-centric view of scalable, efficient performance, post-Moore. Proc. High Performance Computing Post-Moore (HCPM’17).
    @article{br17a,
      title = {Eliminating Dark Bandwidth: a data-centric view of scalable, efficient performance, post-Moore},
      author = {Beard, Jonathan C and Randall, Joshua},
      booktitle = {Proc. High Performance Computing Post-Moore (HCPM'17)},
      series = {Lecture Notes in Computer Science},
      year = {2017},
      month = jun,
      slides = {../slides/beard_hcpm2017.pdf}
    }
    

  7. RaftLib: A C++ template library for high performance stream parallel processing

    Stream processing is a compute paradigm that has been around for decades, yet until recently has failed to garner the same attention as other mainstream languages and libraries (e.g. C++, OpenMP, MPI). Stream processing has great promise: the ability to safely exploit extreme levels of parallelism to process huge volumes of streaming data. There have been many implementations, both libraries and full languages. The full languages implicitly assume that the streaming paradigm cannot be fully exploited in legacy languages, while library approaches are often preferred for being integrable with the vast expanse of extant legacy code. Libraries, however are often criticized for yielding to the shape of their respective languages. RaftLib aims to fully exploit the stream processing paradigm, enabling a full spectrum of streaming graph optimizations, while providing a platform for the exploration of integrability with legacy C/C++ code. RaftLib is built as a C++ template library, enabling programmers to utilize the robust C++ standard library, and other legacy code, along with RaftLib’s parallelization framework. RaftLib supports several online optimization techniques: dynamic queue optimization, automatic parallelization, and real-time low overhead performance monitoring.

    Beard, J. C., Li, P., & Chamberlain, R. D. (2016). RaftLib: A C++ template library for high performance stream parallel processing. International Journal of High Performance Computing Applications. https://doi.org/http://dx.doi.org/10.1177/1094342016672542
    @article{blc16,
      author = {Beard, Jonathan C and Li, Peng and Chamberlain, Roger D},
      title = {RaftLib: A C++ template library for high performance stream parallel processing},
      year = {2016},
      doi = {http://dx.doi.org/10.1177/1094342016672542},
      eprint = {http://hpc.sagepub.com/content/early/2016/10/18/1094342016672542.full.pdf+html},
      journal = {International Journal of High Performance Computing Applications}
    }
    

  8. Online Automated Reliability Classification of Queueing Models for Streaming Processing using Support Vector Machines

    When do you trust a performance model? More specifically, when can a particular model be used for a specific application? Once a stochastic model is selected, its parameters must be determined. This involves instrumentation, data collection, and finally interpretation; which are very time consuming. Even when done correctly, the results hold for only the conditions under which the system was characterized. For modern, dynamic stream processing systems, this is far too slow if a model-based approach to performance tuning is to be considered. This work demonstrates the use of a Support Vector Machine (SVM) to determine if a stochastic queueing model is usable or not for a particular queueing station within a streaming application. When combined with methods for online service rate approximation, our SVM approach can select models while the application is executing (online). The method is tested on a variety of hardware and software platforms. The technique is shown to be highly effective for determining the applicability of M/M/1 and M/D/1 queueing models to stream processing applications.

    Beard, J. C., Epstein, C., & Chamberlain, R. D. (2015). Online Automated Reliability Classification of Queueing Models for Streaming Processing using Support Vector Machines. Proceedings of Euro-Par 2015 Parallel Processing, 82-93.
    @inproceedings{bec15b,
      title = {Online Automated Reliability Classification of Queueing Models for Streaming Processing using Support Vector Machines},
      author = {Beard, Jonathan C. and Epstein, Cooper and Chamberlain, Roger D.},
      booktitle = {Proceedings of Euro-Par 2015 Parallel Processing},
      year = {2015},
      month = aug,
      pages = {82-93},
      publisher = {Springer}
    }
    

  9. Run Time Approximation of Non-blocking Service Rates for Streaming Systems

    Stream processing is a compute paradigm that promises safe and efficient parallelism. Its realization requires optimization of multiple compute kernels and communications links. Most techniques to optimize these use queueing network models or network flow models, which require estimates of the execution rate of each compute kernel. What we want to know is how fast can each kernel process input data. This is known as the “service rate” of the kernel within the queueing literature. Current approaches to divining service rates are static. Modern workloads, however, are often dynamic. It is therefore desirable to continuously re-estimate kernel service rates and re-tune an application during run time in response to changing conditions. Our approach enables online service rate monitoring under most conditions, obviating the need to rely on steady state predictions for what are likely non-steady state phenomena.

    Beard, J. C., & Chamberlain, R. D. (2015). Run Time Approximation of Non-blocking Service Rates for Streaming Systems. Proceedings of the 17th IEEE International Conference on High Performance and Communications, 792-797. https://doi.org/http://dx.doi.org/10.1109/HPCC-CSS-ICESS.2015.64
    @inproceedings{bc15f,
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      booktitle = {Proceedings of the 17th IEEE International Conference on High Performance and Communications},
      title = {Run Time Approximation of Non-blocking Service Rates for Streaming Systems},
      year = {2015},
      pages = {792-797},
      month = aug,
      publisher = {IEEE},
      doi = {http://dx.doi.org/10.1109/HPCC-CSS-ICESS.2015.64},
      slides = {../slides/hpcc2015_public.pdf}
    }
    

  10. Online Modeling and Tuning of Parallel Stream Processing Systems

    Writing performant computer programs is hard. Code for high performance applications is profiled, tweaked, and re-factored for months specifically for the hardware for which it is to run. Consumer application code doesn’t get the benefit of endless massaging that benefits high performance code, even though heterogeneous processor environments are beginning to resemble those in more performance oriented arenas. This thesis offers a path to performant, parallel code (through stream processing) which is tuned online and automatically adapts to the environment it is given. This approach has the potential to reduce the tuning costs associated with high performance code and brings the benefit of performance tuning to consumer applications where otherwise it would be cost prohibitive. This thesis introduces a stream processing library and multiple techniques to enable its online modeling and tuning.

    Beard, J. C. (2015). Online Modeling and Tuning of Parallel Stream Processing Systems [PhD thesis]. Department of Computer Science and Engineering, Washington University in St. Louis.
    @phdthesis{beardthesis,
      author = {Beard, Jonathan C.},
      title = {Online Modeling and Tuning of Parallel Stream Processing Systems},
      school = {Department of Computer Science and Engineering, Washington University
      in St. Louis},
      month = aug,
      year = {2015},
      link = {http://www.jonathanbeard.io//pdf/beard-thesis.pdf}
    }
    

  11. Run Time Approximation of Non-blocking Service Rates for Streaming Systems

    Stream processing is a compute paradigm that promises safe and efficient parallelism. Modern big-data problems are often well suited for stream processing’s throughput-oriented nature. Realization of efficient stream processing requires monitoring and optimization of multiple communications links. Most techniques to optimize these links use queueing network models or network flow models, which require some idea of the actual execution rate of each independent compute kernel within the system. What we want to know is how fast can each kernel process data independent of other communicating kernels. This is known as the “service rate’’ of the kernel within the queueing literature. Current approaches to divining service rates are static. Modern workloads, however, are often dynamic. Shared cloud systems also present applications with highly dynamic execution environments (multiple users, hardware migration, etc.). It is therefore desirable to continuously re-tune an application during run time (online) in response to changing conditions. Our approach enables online service rate monitoring under most conditions, obviating the need for reliance on steady state predictions for what are probably non-steady state phenomena. First, some of the difficulties associated with online service rate determination are examined. Second, the algorithm to approximate the online non-blocking service rate is described. Lastly, the algorithm is implemented within the open source RaftLib framework for validation using a simple microbenchmark as well as two full streaming applications.

    Beard, J. C., & Chamberlain, R. D. (2015). Run Time Approximation of Non-blocking Service Rates for Streaming Systems. ArXiv Preprint ArXiv:1504.00591v2.
    @article{bc15b,
      title = {Run Time Approximation of Non-blocking Service Rates for Streaming Systems},
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      journal = {arXiv preprint arXiv:1504.00591v2},
      year = {2015},
      month = apr,
      link = {http://arxiv.org/pdf/1504.00591v2}
    }
    

  12. Deadlock-free Buffer Configuration for Stream Computing

    Stream computing is a popular paradigm for parallel and distributed computing, which features computing nodes connected by first-in first-out (FIFO) data channels. To increase the efficiency of communication links and boost application throughput, output buffers are often used that are some multiple greater than the size required. However, the connection between the configuration of output buffers and application deadlocks has not been studied. In this paper, we show that bad configuration of output buffers can lead to application deadlock. We prove necessary and sufficient condition for deadlock-free buffer configurations. We also propose an efficient method based on all-pair shortest path algorithms to detect unsafe buffer configuration. We also provide a method to adjust unsafe buffer configuration to a safe one.

    Li, P., Beard, J. C., & Buhler, J. (2015). Deadlock-free Buffer Configuration for Stream Computing. Proceedings of Programming Models and Applications on Multicores and Manycores, 164-169.
    @inproceedings{lbb15,
      author = {Li, Peng and Beard, Jonathan C. and Buhler, Jeremy},
      title = {Deadlock-free Buffer Configuration for Stream Computing},
      publisher = {ACM},
      address = {New York, NY, USA},
      year = {2015},
      month = feb,
      series = {PMAM 2015},
      booktitle = {Proceedings of Programming Models and Applications on Multicores and Manycores},
      pages = {164-169}
    }
    

  13. RaftLib: A C++ template library for high performance stream parallel processing

    Stream processing or data-flow programming is a compute paradigm that has been around for decades in many forms yet has failed garner the same attention as other mainstream languages and libraries (e.g., C++ or OpenMP). Stream processing has great promise: the ability to safely exploit extreme levels of parallelism. There have been many implementations, both libraries and full languages. The full languages implicitly assume that the streaming paradigm cannot be fully exploited in legacy languages, while library approaches are often preferred for being integrable with the vast expanse of legacy code that exists in the wild. Libraries, however are often criticized for yielding to the shape of their respective languages. RaftLib aims to fully exploit the stream processing paradigm, enabling a full spectrum of streaming graph optimizations while providing a platform for the exploration of integrability with legacy C/C++ code. RaftLib is built as a C++ template library, enabling end users to utilize the robust C++ standard library along with RaftLib’s pipeline parallel framework. RaftLib supports dynamic queue optimization, automatic parallelization, and real-time low overhead performance monitoring.

    Beard, J. C., Li, P., & Chamberlain, R. D. (2015). RaftLib: A C++ template library for high performance stream parallel processing. Proceedings of Programming Models and Applications on Multicores and Manycores, 96-105.
    @inproceedings{blc15,
      author = {Beard, Jonathan C. and Li, Peng and Chamberlain, Roger D.},
      title = {RaftLib: A {C++} template library for high performance stream parallel processing},
      publisher = {ACM},
      address = {New York, NY, USA},
      year = {2015},
      month = feb,
      series = {PMAM 2015},
      booktitle = {Proceedings of Programming Models and Applications on Multicores and Manycores},
      pages = {96-105}
    }
    

  14. Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines

    When do you trust a model? More specifically, when can a model be used for a specific application? This question often takes years of experience and specialized knowledge to answer correctly. Once this knowledge is acquired it must be applied to each application. This involves instrumentation, data collection and finally interpretation. We propose the use of a trained Support Vector Machine (SVM) to give an automated system the ability to make an educated guess as to model applicability. We demonstrate a proof-of-concept which trains a SVM to correctly determine if a particular queueing model is suitable for a specific queue within a streaming system. The SVM is demonstrated using a micro-benchmark to simulate a wide variety of queueing conditions.

    Beard, J. C., Epstein, C., & Chamberlain, R. D. (2015). Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines. Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, 325-328.
    @inproceedings{bec15,
      author = {Beard, Jonathan C. and Epstein, Cooper and Chamberlain, Roger D.},
      title = {Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines},
      month = jan,
      year = {2015},
      booktitle = {Proceedings of the 6th ACM/SPEC international conference on Performance engineering},
      series = {ICPE 2015},
      publisher = {ACM},
      address = {New York, NY, USA},
      pages = {325-328}
    }
    

  15. Use of a Levy Distribution for Modeling Best Case Execution Time Variation

    Minor variations in execution time can lead to out-sized effects on the behavior of an application as a whole. There are many sources of such variation within modern multi-core computer systems. For an otherwise deterministic application, we would expect the execution time variation to be non-existent (effectively zero). Unfortunately, this expectation is in error. For instance, variance in the realized execution time tends to increase as the number of processes per compute core increases. Recognizing that characterizing the exact variation or the maximal variation might be a futile task, we take a different approach, focusing instead on the best case variation. We propose a modified (truncated) Levy distribution to characterize this variation. Using empirical sampling we also derive a model to parametrize this distribution that doesn’t require expensive distribution fitting, relying only on known parameters of the system. The distributional assumptions and parametrization model are evaluated on multi-core systems with the common Linux completely fair scheduler.

    Beard, J. C., & Chamberlain, R. D. (2014). Use of a Levy Distribution for Modeling Best Case Execution Time Variation. In A. Horvath & K. Wolter (Eds.), Computer Performance Engineering (Vol. 8721, pp. 74-88). Springer International Publishing.
    @incollection{bc14a,
      year = {2014},
      month = sep,
      isbn = {978-3-319-10884-1},
      booktitle = {Computer Performance Engineering},
      volume = {8721},
      series = {Lecture Notes in Computer Science},
      editor = {Horvath, A. and Wolter, K.},
      title = {Use of a {Levy} Distribution for Modeling Best Case Execution Time Variation},
      publisher = {Springer International Publishing},
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      pages = {74-88},
      slides = {../slides/EPEW2014.pdf}
    }
    

  16. Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications

    Current state of the art systems contain various types of multicore processors, including General Purpose Graphics Processing Units (GPGPUs) and occasionally Digital Signal Processors (DSPs) or Field-Programmable Gate Arrays (FPGAs). With heterogeneity comes multiple abstraction layers that hide underlying complexity. While necessary to ease programmability of these systems, this hidden complexity makes quantitative performance modeling a difficult task. This paper outlines a computationally simple approach to modeling the overall throughput and buffering needs of a streaming application deployed on heterogeneous hardware.

    Beard, J. C., & Chamberlain, R. D. (2013). Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications. Proc. of IEEE Int’l Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 345-349.
    @inproceedings{bc13b,
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      title = {Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications},
      booktitle = {Proc. of IEEE Int’l Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems},
      month = aug,
      year = {2013},
      pages = {345-349}
    }
    

  17. Use of Simple Analytic Performance Models of Streaming Data Applications Deployed on Diverse Architectures

    Modern hardware is often heterogeneous. With heterogeneity comes multiple abstraction layers that hide underlying complex systems. This complexity makes quantitative performance modeling a difficult task. Designers of high-performance streaming applications for heterogeneous systems must contend with unpredictable and often non-generalizable models to predict performance of a particular application and hardware mapping. This paper outlines a computationally simple approach that can be used to model the overall throughput and buffering needs of a streaming application on heterogeneous hardware. The model presented is based upon a hybrid maximum flow and decomposed discrete queueing model. The utility of the model is assessed using a set of real and synthetic benchmarks with model predictions compared to measured application performance.

    Beard, J. C., & Chamberlain, R. D. (2013). Use of Simple Analytic Performance Models of Streaming Data Applications Deployed on Diverse Architectures. Proc. of Int’l Symp. on Performance Analysis of Systems and Software, 138-139.
    @inproceedings{bc13a,
      author = {Beard, Jonathan C. and Chamberlain, Roger D.},
      title = {Use of Simple Analytic Performance Models of Streaming Data
      Applications Deployed on Diverse Architectures},
      booktitle = {Proc. of Int’l Symp. on Performance Analysis of Systems
      and Software},
      month = apr,
      year = {2013},
      pages = {138-139}
    }
    

  18. Crossing Boundaries in TimeTrial: Monitoring Communications Across Architecturally Diverse Computing Platforms

    TimeTrial is a low-impact performance monitor that supports streaming data applications deployed on a variety of architecturally diverse computational platforms, including multicore processors and field-programmable gate arrays. Communication between resources in architecturally diverse systems is frequently a limitation to overall application performance. Understanding these bottlenecks is crucial to understanding overall application performance. Direct measurement of inter-resource communications channel occupancy is not readily achievable without significantly impacting performance of the application itself. Here, we present TimeTrial’s approach to monitoring those queues that cross platform boundaries. Since the approach includes a combination of direct measurement and modeling, we also describe circumstances under which the model can be shown to be inappropriate. Examples with several micro-benchmark applications (for which the true measurement is known) and an application that uses Monte Carlo techniques to solve Laplace’s equation are used for illustrative purposes.

    Lancaster, J. M., Wingbermuehle, J. G., Beard, J. C., & Chamberlain, R. D. (2011). Crossing Boundaries in TimeTrial: Monitoring Communications Across Architecturally Diverse Computing Platforms. Proc. of Ninth IEEE/IFIP Int’l Conf. on Embedded and Ubiquitous Computing, 280-287.
    @inproceedings{lancaster11b,
      author = {Lancaster, Joseph M. and Wingbermuehle, Joseph G. and Beard, Jonathan C. and Chamberlain, Roger D.},
      title = {Crossing Boundaries in {TimeTrial}: Monitoring Communications Across
      Architecturally Diverse Computing Platforms},
      booktitle = {Proc. of Ninth IEEE/IFIP Int’l Conf. on Embedded and Ubiquitous
      Computing},
      month = oct,
      year = {2011},
      pages = {280-287}
    }