WebRTC is an industry and standards effort to provide real-time communication capabilities into all browsers and make these capabilities accessible to software developers via standard HTML5 and Javascript APIs. WebRTC fills a critical gap in web technologies by allowing (a) the browser to access native devices (e.g., microphone, webcam) through a Javascript API and (b) to share the captured streams through using browser-to-browser Real-Time Communication. WebRTC also provides data sharing.
We are investigating issues in WebRTC behavior and performance. To this end we have developed a benchmark suite, WebRTCBench. The goal of WebRTCBench is to provide a quantitative comparison of WebRTC implementations across browsers and devices (i.e., hardware platforms). WebRTC accomplishes three main tasks: Acquiring audio and video; Communicating Audio and Video; Communicating Arbitrary Data. These tasks are mapped one to one to three main Javascript APIs. These are as follows: MeadiaStream (i.e., getUserMedia); RTCPeerConnection; RTCDataChannel. Hence, a quantitative assessment of WebRTC implementations across browser and devices is performed via collecting performance of MediaStream, RTCPeerConnection, and RTCDataChannel. Because a MediaStream contains one or more media stream tracks (e.g., Webcam and Microphone), WebRTCBenc allows to define MediaStreams composed of Video, Audio, Data and any combination thereof. Likewise single peer connection with media server and multiple peer connections between browsers are supported in a WebRTC triangle.
The current version of the benchmark can be used via http://core7.uci.edu

Supported by the Intel Corp.

Compiler and performance optimization using similarity analysis

Maintaining and improving program performance in the multi-core era requires a large engineering effort (e.g., a large number of time consuming trials & tests). It involves finding, as efficiently as possible, a combination of attributes that characterize (i.e., expose similarity in) algorithmic optimizations, compiler optimizations, execution environment settings and hardware configurations to reach certain performance goals. This problem currently is not solved in its entirety because the number of attributes involved is very large. Thus programs are optimized based on a very limited number of such attributes at a time.
This project investigates how to construct empirical performance models that provide program performance prediction across system configurations, where the term system includes the development environment, e.g., compilers, libraries and their settings, and execution environment, e.g., operating system, run-time environment, hardware and their settings. Predictions are required to be sufficiently accurate to reduce engineering effort (e.g., by replacing trials & tests with predictions).
Specifically, the following issues are investigated: (i) the definition of two types of signatures, feature-aware and feature-agnostic respectively, to characterize programs and/or systems; (ii) techniques to expose and analyze the structure of similarity induced by a given type of signature on programs and systems; and (iii) techniques that leverage (learn from) such a structure of similarity and suggest ways for optimizing both serial and parallel programs.
Feature-aware program signatures are constructed from a subset of hardware performance counters. The counters are tied to a specific performance task to accomplish, e.g., system evaluation, selection of compiler heuristics. Feature-agnostic program signatures are constructed from a collection of completion times for some combinations of (program, system). For this type of signature, the characterization of program and system is combined. Techniques that leverage this type of performance modeling can be applied to characterizing and comparing hardware configurations, compilers, run-time environments and even all the above combined.

Supported by the National Science Foundation

Improving single core performance via compiler-assisted out-of-order commit

The growth in uniprocessor (single core) performance resulting from improvements in semiconductor technology has recently slowed down significantly. Sequential applications or sequential portions of parallel applications require further advances to improve their performance. Today's OOO processors complete instructions in their program order, which is a major performance bottleneck because any long-latency instruction, such as access to memory, delays the completion of all subsequent instructions. This project aims to achieve higher single core performance by defining a new, compiler assisted mechanism for out of order instruction completion. It investigates how the use of compile-time program knowledge can be passed to the hardware and be used to simplify the architectural checks required for such out of order completion. The architecture of a standard processor will be fully preserved and legacy software can execute without modification.

Supported by the National Science Foundation

Cache-Aware Synchronization and Scheduling of Data-Parallel Programs for Multi-Core Processors

Multi-core (parallel) processors have become ubiquitous. The use of such systems is key to science, engineering, finance, and other major areas of the economy. However, increased applications performance on such systems can only be achieved with advances in mapping such applications to multi-core machines. This task is made more difficult by the presence of complex memory organizations which is perhaps the key bottleneck to efficient execution, and which has not been addressed effectively. This research involves making the mapping of the program to the machine aware of the complexities of the memory-hierarchy in all phases of the compilation process. This will ensure a good fit between the application code and the actual machine and thereby guarantee much more effective utilization of the hardware (and thus efficient/fast execution) than was previously possible.
Multi-cores can benefit from new cache-hierarchy-aware compilation and runtime system (i.e., including compilation, scheduling, and static/dynamic processor mapping of parallel programs). These tasks have one thing in common: they all need accurate estimates of data element (iteration, task) computation and memory access times which are currently beyond the (cache-oblivious) state-of-the-art. This research thus develops new techniques for iteration space partitioning, scheduling, and synchronization which capture the variability due to cache, memory, and conditional statement behavior and their interaction.

Supported by the National Science Foundation

Acceleration of neural simulations

We are collaborating with scientists who study and model how human brain performs certain tasks, e.g. vision. Computer simulation of such models is extremely compute-bound. We are looking at parallel, application-specific or custom architectures to accelerate such computations. Preliminary experience with FPGA-based, Cell, GPU, and parallel architectures is very encouraging.

Reducing Power Consumption in Processors and Systems

Power dissipation is a major issue in designing new processors and systems. In particular, CMOS technology scaling has significantly increased the leakage power dissipation so that it accounts for an increasingly large share of processor power dissipation. One of the main issue is how to achieve power savings without loss of performance.
Much of our work in this area has focused on cache power dissipation. We addressed issues in L1 I- and D-cache dynamic as well as static power consumption. This included way caching to save static and dynamic power in high-associativity caches (as an alternative to way prediction), cached load-store queue as a low-cost alternative to L0 cache, using branch prediction information to save power in instruction caches. We addressed L2 power consumption, in particular leakage power in L2 peripheral circuits. The results of this research are applicable in both embedded and high-performance processors.
Another aspect of this research is low-power instruction queue design for out-of-order processors. CAM-based instruction queues are not scalable and consume significant amount of power due to wide issue and CAM search on each cycle. One approach we proposed used a banked queue, thus dividing a CAM into smaller banks with faster search. A pointer table indicates which bank an instruction belongs to. A more complex approach disposed of CAM-based queue altogether and used instruction dependence pointers and RAM-based queue for "direct" wakeup. It solved the problem of how to achieve fast branch misprediction recovery when using pointers while using dependent pointers.
We have investigated the problem of power consumption in the register file. Content-aware register file utilized knowledge of instruction operand and effective address width to reduce the number of bits read from the RF and to speed up TLB access using an "L0 TLB". This type of register file was also shown to enable a new type of clustered processor with improved performance and reduced power.
Leakage in peripheral circuits of SRAM-based units is a major contributor to overall power dissipation as well as temperature increases. We have developed a number of circuit techniques using sleep transistors to reduce this leakage as well as architectural techniques to control the application of leakage reduction techniques.
Finally, we studied power consumption in the main memory (DRAM) of embedded systems. For certain types of embedded systems and applications this is as important a component of overall power as the processor itself. We proposed ways to reduce power consumption in the DRAMs by utiizing buffering, delayed writes, and prefetching techniques.

Supported by the National Science Foundation and DARPA

Past Projects

Speeding up Mobile Code Execution on Resource-Constrained Embedded Processors
Supported by the National Science Foundation

Compiler-Controlled Continuous Power-Performance Management
Supported by DARPA

Adaptive Memory Reconfiguration & Management
Supported by DARPA