Research Statement Download PDF

Gordon Moore conjectured that the number of transistors on a computer chip would double every 18 months. This became known as “Moore’s Law,” which up until recently has been assumed to imply that computing power will roughly double every two years (Schaller, 1997). Increasing frequency and subsequently pipeline depth of processors led the charge to increase computing performance until the start of the 21’st century. Diminishing returns due to power limitations has led to the end of this trend. The next (and current) driver of performance is attributed to continuing miniaturization of feature sizes enabling more compute cores to be placed on a chip. Increasing the number of cores on a chip means higher performance for parallel applications, but it also means more power is needed per unit area. With high density compute cores comes the need to dissipate heat efficiently. This problem is known as the “Power Wall” (Esmaeilzadeh et al., 2011). Yet another wall, termed the “Memory Wall,” came about because memory access times (i.e., latency and throughput) have stagnated while compute power has doubled roughly every two years (Wulf & McKee, 1995). While general purpose multi-core processors are excellent at multi-purpose computation, increasingly more efficient special purpose processors are used for targeted computational tasks, examples include: General Purpose Graphics Processors (GPGPUs) (Owens et al., 2007), the Xeon Phi (Jeffers & Reinders, 2013), customized embedded processors (Shalf et al., 2011) and Field Programmable Gate Arrays (FPGAs) (Villasenor & Mangione-Smith, 1997). Systems that include multiple types of processors are generally termed “heterogeneous.” Most modern compute systems are heterogeneous, each typically contains some sort of multi-core processor and graphics engine (often in the form of a system on chip). Almost all of these systems are capable of some sort of heterogeneous parallel processing (i.e., executing cooperative code on multiple types of resources concurrently). Effectively programming high performance applications for heterogeneous architectures is currently the domain of expert programmers. I want to change that. With the right programming paradigm and combination of techniques; safe and performant parallel programs can be authored by novices.

Stream processing (also termed data-flow programming) is a compute paradigm that views an application as a set of compute kernels connected via communications links or “streams.” Streaming languages and libraries include RaftLib (Beard et al., 2015) (my own), StreamIt (Thies et al., 2002), S-Net (Grelck et al., 2006), and others. Stream processing is increasingly used by multi-disciplinary fields with names such as computational-“x” and “x”-informatics (e.g., biology, astrophysics) where the focus is on safe and fast parallelization of a specific application (missing reference). With non-expert programmers in mind, intuitive linguistic semantics are quite important to parallel systems. Linguistically, stream processing systems can take many forms. Traditionally programmers assemble building blocks of “kernel” compute units. With RaftLib I took a mixed approach, attempting to balance C++ style semantics with the more traditional building block approach. Unfortunately there currently are no agreed upon quantitative matrics to quantify which programming semantics work best. One avenue of research I will explore is how best to quantitatively evaluate linguistic choices to specify parallel programming semantics.

Many “big-data” applications involve real-time or latency sensitive processing, necessitating usage of many parallel kernels on several compute cores. Optimizing the orchestration of kernels has been the focus of much theoretical and empirical work. Purely theoretical parallel programming models can fail when the assumptions implicit within the model are mis-matched with reality (i.e., the model is incorrectly applied). Full empirical optimization solves this problem by extensively searching the range of configurations under native conditions, this however is expensive in both time and energy. My current and past research focuses on the middle ground; how to quickly apply theory and know when to fall back to tried and true empirical evaluation.

The future of computation is in parallelism. Lowering the cost of entry to parallel computation requires a cross-layer approach; hardware through programming languages. I want to make effective programming high performance applications for heterogeneous parallel architectures simple. My approach is to automate as much as possible; using algorithms and probabilistic methods to optimize applications for novices. What follows is a bit about my motivation and a brief discussion of my research direction.

Motivation, Past and Future Research

As a biology student I loved writing small programs to calculate things to make my life easier. As a bioinformatics major and director of medical informatics for the US Army in Europe I gained an appreciation for what “big-data” really means. Sequential execution took days for some data sets whereas parallel code was often too expensive (i.e., increased time to author, debug and optimize). Parallel programs are also fraught with pitfalls, naive implementations are sometimes slower than sequential ones. Massively parallel execution is clearly desirable, however it is out of reach for most non-expert programmers. I want parallel tools in the hands of the non-expert programmers that are capable of producing parallel executables which are safe, fast to author, and intuitive to understand. Computing should enable the discovery, not hinder it.

As computer scientists and engineers, it is expected that we can design and author performant parallel code. We have spent years studying classic problems like race conditions and thread access anomaly detection. Computer engineers and operations researchers have found solutions to many stochastic communications problems that exist in distributed systems. A researcher in other fields likely has studied none of these. If they have, it is typically at a cursory level, not at the depth necessary to efficiently solve many of the performance problems that frequently arise in these systems. Speed on today’s processors and applications accelerators requires parallel execution to gain full utilization of the hardware. Multi-core and heterogeneous architectures are available on even basic desktop systems and will be part of high performance clusters for the foreseeable future. Language semantics, parallel execution theory, and performance modeling must all come together harmoniously for efficient parallel execution. My research and interests lie at the confluence of these pieces.

Most modern systems (both commodity and special purpose) are non-uniform in processor type and memory access (also known as heterogeneous systems). Extracting performance from heterogeneous systems requires deep understanding of the interactions between hardware and software. In order to utilize and exploit these complex systems, automated tools and linguistic constructs must be developed in order to make parallel development efficient. Languages that allow sequential logic to be executed in parallel could be more effective than requiring programmers to wrangle deterministic behavior from non-deterministic constructs (e.g., Pthreads, MPI). Stream processing is a modality that enables this. My prior work with the Auto-Pipe (Franklin et al., 2006) system, and current work on the C++ streaming template library RaftLib are but two examples of stream processing systems that enable parallel execution from otherwise sequentially constructed code. My work with RaftLib laid the groundwork for the study of stream parallel code mixed with legacy C++ code. As a scientist I want to quantify what makes a language or programming modality “good.” Currently decisions about language/library semantics, and indeed adoption are largely aesthetically based (including those within RaftLib). I want to explore other things like crowd sourcing to quantify the “best” semantic constructs based upon some criteria (e.g., intuitiveness). Another avenue of research is complete language construction. New linquistic designs have the advantage of a clean slate design. How should parallel execution be specified (if at all)? My current work with the Raft language suggests that there is room for significant innovation when semantically specifying parallel execution on heterogeneous systems (e.g., should paradigm specific modalities be allowed such as CUDA or VHDL?).

Optimizing the execution of a parallel streaming run-time is extremely difficult. Communication between threads must be managed and optimized as it is typically the largest source of overhead. Resource mapping is also critical, a compute kernel that exhibits heavy instruction level parallelism should be assigned (or mapped) to a resource that can exploit this characteristic most effectively. If a novice or lay programmer is to construct performant parallel programs, not only should the program execute in parallel but the resource mapping and communications between those resources must be automatically managed. Managing communications links within a parallel streaming system is analogous to optimizing a finite capacity queueing network. Optimizing a queueing network is often most efficiently done using a combination of stochastic queueing and other performance models. Performance models of complex systems are often computationally intensive (Dongarra et al., 2011), hard to validate and error prone. As systems progress from the relatively simple (single processor) to the extremely complex (_k non-uniform processors) they become extremely hard to model and therefore optimize. Essentially a modeling wall is reached. For massively parallel systems, these decisions cannot be made fast enough with a human in the loop, deciding what to model and how must be done by algorithm. My current and previous works involve making fast and “good-enough” predictions for parallel streaming systems. Deciding how “good” and how fast decisions must be is a complex task. Another related avenue (publications in preparation) is control theoretic online optimization of stream processing systems.

When optimizing a hardware/software system, performance models are often used. When considering hundreds or thousands of application partitions and thousands of potential tuning parameters designers are faced with a combinatorially intractable problem (Garey & Johnson, 1979). Modeling each of these parameters as accurately as possible is time consuming, possibly taking several weeks to model and optimize for even a simple application (Padmanabhan et al., 2011). To reduce evaluation time, higher levels of abstraction are often sought. Abstractions reduce the number of modeled parameters, but how good are these models that don’t even claim to model all contributing variables? Where can the model be trusted to provide an accurate reflection of reality? This is generally termed model validation and is often done through rigorous empirical measurement. Preliminary work using flow models are abstractions for a queueing network used this approach (Beard & Chamberlain, 2013). Computer systems have a disadvantage when compared to many modeled systems, simply observing the behavior can change the nature of the system. In some cases, as it is with supercomputers, simply running the machine to observe it is prohibitively expensive. If modeling behavior of stochastic systems is hard under stationary conditions, what about dynamic non-stationary ones?

Parallel processing systems and stream processing systems in particular must adapt to changing environments. Cloud computing is largely supplanting dedicated high performance compute clusters. The workloads that parallel system must manage are also not static, they change from one execution and indeed within a single execution. In prior work (Beard & Chamberlain, 2014) I demonstrated that a modified Levy distribution could be used to model the “best-case” execution time variation for an application running on a multi-core processor. This demonstrated that multiple types of hardware (‘x86’, ‘ARM’, ‘PowerPC’) and schedulers exhibit a quantifiable and regular noise pattern that could be used to accept or reject performance models. In later work (Beard et al., 2015) I demonstrated that a Support Vector Machine (SVM) could in fact distinguish similar patterns which can be used to reliably classify stochastic queueing models into two categories “use” and “don’t use.” Essentially the SVM takes the place of someone trained to utilize stochastic models and performs the same selection work in microseconds instead of minutes. This enables fast online model selection, making dynamic optimization and re-optimization possible. Work currently in preparation will demonstrate that this selection process at work within meta-heuristic search to select an “accurate enough” model with the shortest evaluation time for each stage of an optimization process. Future work could utilize more advanced machine learning methods which could learn as the application executes.

Modeling behavior is usually faster than directly observing behavior. At a large scale, even modeling becomes computationally intractable. In order to make a model more tractable (i.e., less computationally intensive), simpler or more abstract models are used. How do we find simpler models that work? Humans typically use intuition and experience to discover ways to simplify existing models or find new approximations. My current research is in finding simpler approximations for complex performance models. The techniques being developed are designed for online dynamic modeling in order to make the necessary information available fast while it is still relevant. These techniques can be combined with the aforementioned SVM to provide a means to properly use newly “created” models.

Integrated adaptive techniques, such as using pattern matching for dynamic optimization have been mentioned as a means for optimizing computing systems that are often too complex to model correctly. To make such a concept truly practical, minimalist instrumentation techniques must be used to mitigate impact on the system being monitored. A real risk is that the monitoring used to inform dynamic change utilizes more compute power and bandwidth than the computation itself. In order to minimize the overall effects of instrumentation I have developed a series of techniques to find areas where there is evidence that assumptions made during modeling could fail and areas where there are probable unknowns that were not modeled at all. As an example, if a tandem queue is operating at a low utilization then most basic queueing models can be used to give a very stable result. Systems operating outside of this low-utilization range require more information to make a good estimate of buffer size. To quote Rumsfeld “…There are known unknowns. That is to say there are things that we now know we don’t know. But there are also unknown unknowns” (Rumsfeld, 2002). My current and future work will enable parallel systems to prepare for the known unknowns and ready themselves dynamically for the unknown unknowns. Low overhead monitoring techniques have been developed that enable monitoring of non-blocking service rates which informs everything from throughput to queue occupancy (manuscript in preparation). Current work involves getting as much information as possible out of an individual compute kernel while still viewing it as a “black-box.” Future work will involve perturbing the system to discover as much about the system stability as possible without actually “instrumenting” the internal workings of the compute kernel.

Models are inherently an imperfect reflection of reality. Complex systems (such as parallel systems on heterogeneous and distributed hardware) have many hidden interactions that may not be immediately apparent to even expert programmers. Long running and shared systems may have environments that fluctuate widely. Ultimately the optimization of a program is completed with imperfect models and the application is executed. At this point an optimization system and its associated models have laid the best possible plan for execution, indeed they’ve made the best decisions with the information they had (in the Candide sense (Voltaire, 1991)). Unfortunately, “no plan survives contact with the enemy” (Moltke, 1893), the enemy in this case being the environment in which the application of interest is executing (e.g., processors might have internal changes that make one particular piece of code execute slower or caching effects due to the nature of a particular workload might reduce expected latency between threads). To this end, the parallel systems I envision must be adaptively optimized. Work in preparation uses meta-data from the optimization process to boot-strap online adaptation driven by the environment. This has the potential to produce not only more robust and fault tolerant systems but also more performant ones.

As high performance computing moves into the cloud era, one thing is clear: security and data compression must be integral to the next generation of parallel systems. Current streaming systems typically don`t implement compression of their streams, however experience with operating systems suggests that it can be done efficiently. My current research into dynamic stream optimization can be applied to online decisions about whether or not to compress data within a queue, potentially leading to increased cache hit rates and higher overall memory sub-system performance. Security is also something that must be considered. When computing on a cloud system, the data transmitted through a TCP stack could potentially be transmitted over non-private networks. For computation on any type of confidential information, securing the computation itself is essential to gaining widespread acceptance of stream processing (and in the case of healthcare or financial data, meet regulatory standards). Which algorithms are best for securing the streams within a streaming application? Can some mathematical operations be performed on the data while it is either compressed and/or encrypted (e.g., homomorphic encryption (Micciancio, 2010)), and is this efficient for stream processing applications? Security and compression within stream processing systems are two topics I hope to explore in the near future.

The era of “big-data” is here to stay. Indeed, what is considered “big-data” today will undoubtedly pale in comparison to what will face us in the coming decades. By-in-large the data we have drives innovation, either directly or indirectly. How innovators process that data will directly reflect the rate of innovation in society. I want to be at the forefront of providing tools for simple, safe and performant parallel execution for today’s and tomorrow’s “big-data” research.

#References

RaftLib: A C++ template library for high performance stream parallel processing

Beard, J. C., Li, P., & Chamberlain, R. D. (2015, February). RaftLib: A C++ template library for high performance stream parallel processing. Proceedings of Programming Models and Applications on Multicores and Manycores.

@inproceedings{bc15b,
  author = {Beard, Jonathan C. and Li, Peng and Chamberlain, Roger D.},
  title = {RaftLib: A {C++} template library for high performance stream parallel processing},
  note = {to be published},
  publisher = {ACM},
  address = {New York, NY, USA},
  year = {2015},
  month = feb,
  series = {PMAM'15},
  booktitle = {Proceedings of Programming Models and Applications on Multicores and Manycores}
}

Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines

Beard, J. C., Epstein, C., & Chamberlain, R. D. (2015, January). Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines. Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering.

@inproceedings{bc15a,
  author = {Beard, Jonathan C. and Epstein, Cooper and Chamberlain, Roger D.},
  title = {Automated Reliability Classification of Queueing Models for Streaming Computation using Support Vector Machines},
  month = jan,
  year = {2015},
  booktitle = {Proceedings of the 6th ACM/SPEC international conference on Performance engineering},
  series = {ICPE '15},
  publisher = {ACM},
  address = {New York, NY, USA}
}

Use of a Levy Distribution for Modeling Best Case Execution Time Variation

Beard, J. C., & Chamberlain, R. D. (2014). Use of a Levy Distribution for Modeling Best Case Execution Time Variation. In A. Horváth & K. Wolter (Eds.), Computer Performance Engineering (Vol. 8721, pp. 74–88). Springer International Publishing. https://doi.org/http://dx.doi.org/10.1007/978-3-319-10885-8_6

@incollection{bc14a,
  year = {2014},
  month = sep,
  isbn = {978-3-319-10884-1},
  booktitle = {Computer Performance Engineering},
  volume = {8721},
  series = {Lecture Notes in Computer Science},
  editor = {Horv\'{a}th, A. and Wolter, K.},
  doi = {http://dx.doi.org/10.1007/978-3-319-10885-8_6},
  title = {Use of a {Levy} Distribution for Modeling Best Case Execution Time Variation},
  publisher = {Springer International Publishing},
  author = {Beard, Jonathan C. and Chamberlain, Roger D.},
  pages = {74-88}
}

Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications

Beard, J. C., & Chamberlain, R. D. (2013). Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications. Proc. of IEEE Int’l Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 345–349.

@inproceedings{bc13b,
  author = {Beard, Jonathan C. and Chamberlain, Roger D.},
  title = {Analysis of a Simple Approach to Modeling Performance for Streaming Data Applications},
  booktitle = {Proc. of IEEE Int'l Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems},
  month = aug,
  year = {2013},
  pages = {345-349}
}

Jeffers, J., & Reinders, J. (2013). Intel Xeon Phi Coprocessor High-Performance Programming. Elsevier Science. http://books.google.com/books?id=KJORYTHOxbEC
Padmanabhan, S., Chen, Y., & Chamberlain, R. D. (2011). Optimal Design-space Exploration of Streaming Applications. Proc. IEEE Int’l Conf. Application-Specific Systems, Architectures and Processors, 227–230. https://doi.org/http://dx.doi.org/10.1109/ASAP.2011.6043274
Shalf, J., Quinlan, D., & Janssen, C. (2011). Rethinking Hardware-Software Codesign for Exascale Systems. Computer, 44(11), 22–30.
Esmaeilzadeh, H., Blem, E., St Amant, R., Sankaralingam, K., & Burger, D. (2011). Dark silicon and the end of multicore scaling. Proc. of 38th Int’l Symp. on Computer Architecture, 365–376.
Dongarra, J., Beckman, P., Moore, T., Aerts, P., Aloisio, G., Andre, J. C., Barkai, D., Berthou, J. Y., Boku, T., Braunschweig, B., & others. (2011). The International Exascale Software Project Roadmap. International Journal of High Performance Computing Applications, 25(1), 3–60.
Micciancio, D. (2010). A First Glimpse of Cryptography’s Holy Grail. Commun. ACM, 53(3), 96–96. https://doi.org/10.1145/1666420.1666445
Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A. E., & Purcell, T. J. (2007). A Survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1), 80–113.
Franklin, M. A., Tyson, E. J., Buckley, J., Crowley, P., & Maschmeyer, J. (2006, April). Auto-Pipe and the X language: A pipeline design tool and description language. Proc. of Int’l Parallel and Distributed Processing Symp.
Grelck, C., Scholz, S.-B., & Shafarenko, A. (2006). S-Net: A Typed Stream Processing Language. Proc. of 18th Int’l Symp. on Implementation and Application of Functional Languages, 81–97.
Rumsfeld, D. (2002). Secretary Rumsfeld Press Conference at NATO Headquarters, Brussels, Belgium. United States Department of Defense.
Thies, W., Karczmarek, M., & Amarasinghe, S. (2002). StreamIt: A Language for Streaming Applications. In R. Horspool (Ed.), Proc. of Int’l Conf. on Compiler Construction (Vol. 2304, pp. 49–84).
Villasenor, J., & Mangione-Smith, W. H. (1997). Configurable computing. Scientific American, 276(6), 54–59.
Schaller, R. R. (1997). Moore’s law: past, present and future. IEEE Spectrum, 34(6), 52–59.
Wulf, W. A., & McKee, S. A. (1995). Hitting the memory wall: implications of the obvious. ACM SIGARCH Computer Architecture News, 23(1), 20–24.
Voltaire, F.-M. A. (1991). Candide. 1759. Trans, by H. Morley. London: George Routledge.
Garey, M. R., & Johnson, D. S. (1979). Computers and Intractability (Vol. 174). Freeman San Francisco, CA.
Moltke, H. (1893). Militärische werke (Vol. 3). ES Mittler und sohn.