Overview

The next two topics are about my research as a PhD student at Georgia Tech and as a Master's student at CMU. The primary focus at GT was on preparing systems software for the invasion by heterogeneous many-cores! My internships at Intel Labs, HP Labs and IBM Research were instrumental in this work as were the efforts at Georgia Tech under my advisor Prof. Karsten Schwan. Hybrid Virtual Machines is an umbrella research effort at Georgia Tech and the components described under it are the portions that I have worked on or contributed to.

HyVM: Hybrid Virtual Machines

A challenge posed by future computer architectures is the efficient exploitation of their many and often, diverse computational cores, with examples ranging from graphics processors, to IBM.s Cell processor, to I/O-centric accelerators sharing chip space with general purpose computational cores like the Cell's PowerPC cores. This challenge is exacerbated by the diverse facilities for data movement and data sharing across cores resident on such platforms, which range from cache-level methods for data sharing, to non-coherent shared memory with DMA support, to PCI-based connectivity for I/O. The key technical challenge to be addressed by our work is one that deals with the potentially serious mismatch of computational vs. memory or I/O bandwidths present on future platforms. Toward that end, we propose to pursue a course of study that will:

GViM: GPU-accelerated Virtual Machines

The use of virtualization to abstract underlying hardware can aid in sharing such resources and in efficiently managing their use by high performance applications. Unfortunately, virtualization also prevents efficient access to accelerators, such as Graphics Processing Units (GPUs), that have become critical components in the design and architecture of HPC systems. Supporting General Purpose computing on GPUs (GPGPU) with accelerators from different vendors presents significant challenges due to proprietary programming models, heterogeneity, and the need to share accelerator resources between different Virtual Machines (VMs).

To address this problem, this paper presents GViM, a system designed for virtualizing and managing the resources of a general purpose system accelerated by graphics processors. Using the NVIDIA GPU as an example, we discuss how such accelerators can be virtualized without additional hardware support and describe the basic extensions needed for resource management. Our evaluation with a Xen-based implementation of GViM demonstrate efficiency and flexibility in system usage coupled with only small performance penalties for the virtualized vs. non-virtualized solutions.

Pegasus: Coordinated Scheduling in Virtualized Accelerator-based Platforms

Heterogeneous multi-cores -- platforms comprised of both general purpose and accelerator cores -- are becoming increasingly common. While applications wish to freely utilize all cores present on such platforms, operating systems continue to view accelerators as specialized devices. The Pegasus system, developed on top of GViM, uses an alternative approach that offers a uniform usage model for all cores on heterogeneous chip multiprocessors. Operating at the hypervisor level, its novel scheduling methods fairly and efficiently share accelerators across multiple virtual machines, thereby making accelerators into first class schedulable entities of choice for many-core applications. Using the NVIDIA GPU coupled with IA-based general purpose host cores, a Xen-based implementation of Pegasus demonstrates improved performance for applications by better managing combined platform resources. With moderate virtualization penalties, performance improvements range from 18% to 140% over base GPU driver scheduling when accelerators are shared.

Shadowfax: Dynamically Composed GPGPU Assemblies

GPGPUs have proven to be advantageous for increasing application scalability both in the HPC and enterprise domains. This has resulted in an increase in the array of programming languages and range of physical compute capabilities of current hardware. Yet applications scalability and portability remain limited with respect to both their degree of customization and the physical limitations of compute nodes to contain any number and composition of devices. This research defines the notion of a GPGPU assembly for CUDA applications resident in Xen virtual machines on high-performance clusters, presenting to applications a set of GPGPUs as locally-available devices to best match their needs, easing programmability and portability. We characterize workloads to best match them with available GPGPUs. Techniques such as API interposition, function marshalling and batching, as well as dynamic binary instrumentation (future work) enable global scheduling policies, admission control and dynamic retargeting of execution streams. This paper presents the initial idea of GPU assemblies.

Since this project started out as an extension to the GViM system I developed, I am playing a mentor role, primarily, by helping in a) defining the project requirements, b) defining important interfaces with their associated interactions, c) picking suitable protocols for distributed communication and d) providing input on various resource management policies. The primary student leads for this project are Alexander Merritt and Naila Farooqui with help from Abhishek Verma and Manohar Karlapalem. The instructors for the project are Ada Gavrilovska and Karsten Schwan. The larger goal for the project is to develop and deploy software on the Keeneland Delivery System. Additional people involved in the larger effort are Magda Slawinska, Jeffrey Vetter, Sudhakar Yalamanchili and the Ocelot team.

Montage: Scheduling and Resource Management in Heterogeneous Many-core Systems

Details to follow.

Other people involved in the effort are Niraj Tolia, Vanish Talwar and Partha Ranganathan from HP Labs; Rob Knauerhase, Paul Brett and Scott Hahn from Intel Labs.

Cellule: Lightweight Virtualization of Accelerators

Initial steps in this research have focused on the efficient use of accelerators, using IBM's Cell BE processor as the key platform addressed by this work. Here, experiences with running the Linux operating system on the Power core of the Cell processor have shown that this core is less efficient than the general purpose cores in hosting a full fledged operating system. In part, this is because the Power core was principally designed to be a `service processor' responsible for coordinating the Cell.s SPEs. The first challenge faced by our research, then, has been to make efficient use of this service processor in order to exploit Cell as a remote accelerator utilized by one or more general purpose machines, with hardware configurations like those in the Roadrunner project. More generally, this research is investigating the opportunities presented by combining the concepts of virtualization and accelerators to simplify the Cell execution model, to enable its effective utilization by the applications running on the general purpose machines.

The first technical outcome of the proposed research is the "Cellule" execution environment. This is a virtualized Cell B.E based system which hosts a small high performance execution environment, called the Special Execution Environment (SEE), on the hypervisor to run SPE applications. To realize this environment, we have ported IBM's research hypervisor (rHype) to work on the Cell board, created wrappers for libSPE, which is the standard interface used by the Cell applications and facilitated the creation of SEE to run the application. The SEE can be compared to a real time OS environment that has exactly the elements necessary for any libSPE based application to run. Initial experimental results have shown that the Cellule environment offers at least as good as or better performance than Linux, which has encouraged our endeavors in this direction.

Other people involved in the effort are Jimi Xenidis and Dilma Da Silva from IBM Austin and IBM TJ Watson Research Center. The details of the work can be found in the paper and poster here.

Area Driven Pervasive Computing Applications

The past few years have witnessed exponential growth in the number of handhelds. Along with the steep rise in the number of handhelds there has also been a tremendous increase in the number of applications and services that have been developed keeping in mind the constraints and flexibilities offered by the mobile devices. A common characteristic that all these applications and services share is that the user has to initiate some action in order to use the results of the application or service. There are hardly any applications that proactively initiate communication with the user. The applications considered in this thesis diverge from these traditional applications and services since the environment is made smart. The idea is to allow the application to specify the areas where it would want to track users and take suitable action in case a user is detected in that area. While much research has focused on developing services architectures for location-aware systems, less attention has been paid to the fundamental and challenging problem of providing capability to an application in defining physical areas where it would want tracking of mobile users, especially in in-building environments. The goal is to be able to determine with high probability when the user is in the area of interest to the application. In this paper we present an infrastructure for area-driven applications enabling them to specify area-based user tracking requirements and achieve the intended purpose. We make use of the existing infrastructure of wireless access points to determine the area where a user is. We also study the improvement that can be brought about in the granularity of an area with the use of history information.

Here is the final presentation.