7.1. Static Compilation Framework for Energy-aware compiler driven I/O Optimizations

Base 1/4

The prevailing part of applications in modern high performance computing systems are data-intensive scientific computations. Thus, a considerable amount of energy is consumed by the I/O (Input/Output) subsystem performing I/O operations on terabytes of data. For better performance, each data file is split into stripes placed on distinct disks of a large array of disks. We have a complex systems consisting of many disks. The energy consumed by each disk depends on its rotational speed: the disk power consumption is quadratically proportional to the disk rotational speed. Thus, if we can perform some I/O operations at lower rate without degrading the performance of the overall computation, then we have opportunity for energy savings.

High performance computing applications are dominated by input-independent behaviour : The sequence of operations (also of I/O operations) does not depend on the values of input data. A compiler can effectively determine at compile time I/O system parameters:

disk speeds
data layouts
disk data prefetching

Code analysis 2/4

Consider the following code fragment:

for \( i=0 \) to \(N-1\) do
- for \( j=0 \) to \( M-1 \) do
  - \( \ldots V_1[i][j] \ldots \)

The two-dimensional array \( V_1 \) is assumed to be striped over many disks.

Let \(T_d\) denote the time (number of cycles) needed to read a single block \(D_i\) of data.
Let \(T_c\) denote the time (number of cycles) needed to process a single block \(D_i\) of data.

If we do not apply any prefetching, then we have to wait \( T_d \) cycles for reading each block of data before processing it for the next \( T_c\) cycles. This is illustrated on the following timing diagram:

Since I/O operations can be performed concurrently with CPU operations, the time efficiency of the algorithm can be increased by appropriate inserting data-prefetching instructions at compilation time. Our example code would be transformed by the compiler as follows:

for \(i=0\) to \(N-1\) do
- \(PF(\& V_1[i][0])\); /* prolog */
- for \(jj=0\) to \(M-1-d\) step \(b_1\) do
  - /* steady-state */
  - \(PF(\& V_1[i][jj+d])\);
  - for \(j=jj\) to \(jj+b_1\) do
    - \(\ldots V_1[i][j] \ldots\)
- for \(j=M-d\) to \(M-1\) do /* epilog */
  - \(\ldots V_1[i][j] \ldots\)

Here, the prefetch distance \(d\) (the time needed to cover the I/O latency, measured in iterations) can be calculated as follows: \[ d=\left\lceil\frac{T_d}{s+T_{pf}}\right\rceil, \] where

\(T_d\) is estimated I/O latency to prefetch one block (measured in cycles),
\(T_{pf}\) is the overhead of executing a prefetch instruction (measured in cycles),
\(s \) is the number of cycles in the shortest path through the loop body.

Here, \(d\) iterations of \(j\) loop are required to hide I/O latency and \(b_1\) is the strip size used for stripe mining. Execution of this optimised code is illustrated by the following timing diagram:

So far, we have optimised the time efficiency. However, in our example, \(d\) is much lower than \(b_1\). By reducing the rotational speed of the disks we can save disk energy and increase prefetch distance without significant increase of the computation time:

Here, the lower height of the red blocks indicates the reduced power consumption of disks. We have reduced the rotational speed twice and, thus, the power consumption have been reduced four times.

Seung Woo Son and Mahmut Kandemir Proposal

Seung Woo Son and Mahmut Kandemir proposed general integrated static compilation framework that can be applied to an array-based, loop-intensive program \( \cal{P} \) that consists of \( s \) loop nests. The original code of \( \cal{P}\) is transformed with the consideration of system parameters such as:

memory capacity
disk block size
I/O latency and available rotation speed level of disks

Besides transforming of the original program, the layout of the processed arrays on groups of disks is determined.

Questions 3/4

What is the dependency between the rotational speed of disks and their power consumption?
Why could we reduce the speed of I/O devices without significant increase of overall computation time?
List some examples of algorithms with input-independent behaviour.

Bibliography 4/4

Wu-chun Feng (Editor): The Green Computing Book Tackling Energy Efficiency at Large Scale.

CRC Press 2014, Print ISBN: 978-1-4398-1987-6, eBook ISBN: 978-1-4398-1988-3.
Chapter 2. Compiler-Driven Energy Efficiency. Mahmut Kandemir, Shekhar Srikantaiah.

Seung Woo Son and Mahmut Kandemir: Energy-aware data prefetching for multi-speed disks.

In Proceedings of the 3rd Conference on Computing Frontiers, pages 105–114, Ischia, Italy, May 2006.

Projekt Cloud Computing – nowe technologie w ofercie dydaktycznej Politechniki Wrocławskiej (UDA.POKL.04.03.00-00-135/12)jest realizowany w ramach Programu Operacyjnego Kapitał Ludzki, Priorytet IV. Szkolnictwo wyższe i nauka, Działanie 4.3. Wzmocnienie potencjału dydaktycznego uczelni w obszarach kluczowych w kontekście celów Strategii Europa 2020, współfinansowanego ze środków Europejskiego Funduszu Społecznego i budżetu Państwa