7.3. DISCUSSION
of the network, including the fabric topology. For instance, the network bandwidth and the way contention
is managed may change the way the distance within the network is expressed or measured.
Modeling the data-movement requirements of an application in terms of network traffic and I/O can be
supported through performance-analysis tools such as Scalasca [37]. It can also be done by tracing data
exchange at the runtime level with a system like OVIS [73, 85], by monitoring the messages transferred
between MPI processes for instance. Moreover, compilers, by analyzing the code and the way the array are
accessed can, in some cases, determine the behavior of the application regarding this aspect.
Resource managers or job scheduler, such as SLURM [95], OAR [16], LSF [97] or PBS [41] have the role
to allocate resources for executing the application. They feature technical differences but basically they offer
the same set of services: reserving nodes, confining application, executing application in batch mode, etc.
However, none of them is able to match the application requirements in terms of communication with the
topology of the machine and the constraints incurred by already mapped applications.
Parallel file systems such as Lustre [14], GPFS [83], PVFS [17], and PanFS [90] and I/O libraries such as
ROMIO [86], HDF5 [40], and Parallel netCDF [62] are responsible for organizing data on external storage
(e.g., disks) and moving data between application memory and external storage over system network(s). The
“language” used for communicating between applications and libraries and underlying parallel file systems is
the POSIX file interface [48]. This interface was designed to hide underlying topological information, and so
parallel file system developers have provided proprietary enhancements in some cases to expose information
such as the server holding particular data objects, or the method of distribution of data across storage
devices. Additionally, work above the parallel file system is beginning to uncover methods of orchestrating
I/O between applications [30]. This type of high-level coordination can assist in managing shared resources
such as network links and I/O gateways, and is complementary to an understanding of the storage data
layout itself.
7.3
Discussion
To address the locality problem at system scale, several challenges must be solved. First, scalability is a very
important cross-cutting issue since the targets are very large-scale, high-performance computers. On one
hand, applications scalability will mostly depends on the way data is accessed and locality is managed and,
on the other hand, the proposed solutions and mechanisms have to run at the same scale of the application
and their inner decision time must therefore be very short.
Second, it is important to tackle the problem for the whole system: taking into account the whole
ecosystem of the application (e.g., storage, resource manager) and the whole architecture (i.e., from cores to
network). It is important to investigate novel approaches to control data locality system-wide, by integrating
cross-layer I/O stack mechanisms with cross-node topology-aware mechanisms.
Third, most of the time, each layer of the software stack is optimized independently to address the locality
problem with the result that sometimes there are conflicting outcomes. It is, therefore, important to observe
the interaction of different approaches with each-other and propose integrated solutions that provide a global
optimizations across different layers. An example of such an approach is mapping independent application
data accesses to a set of storage resources in a balanced manner, which requires an ability to interrogate
the system regarding what resources are available, some “distance” metric in terms of application processes,
and coordination across those processes (perhaps supported by a system service) to perform an appropriate
mapping. Ultimately, the validation of the models and solutions to the concerns and challenges above will
be a key challenge.
7.4
Research Plan
To solve the problems described above, we propose co-design between application communication models,
specific network structures, and algorithms for allocation and task mapping.
For instance, we currently lack tools and APIs to connect the application with the mapping of its com-
ponents. While Process topologies in MPI offers a facility to communicate the application’s communications
pattern to the MPI library, there is no such option for the reverse direction. Therefore, the applications have
no way of customizing themselves to the underlying system architecture even when such potential exists.
Programming Abstractions for Data Locality
40
7.4. RESEARCH PLAN
Another issue is that the process topology can only be exploited at the MPI level. Lower levels; with task
mapping for threaded/hybrid programming mechanisms for querying the topology are available (e.g., hwloc)
but there are no automatic frameworks supporting data mapping yet.
Moreover, process and data mapping is based on the notion of affinity. How to measure this affinity is
still an open question. Different metrics (i.e., size of the data, number of messages, etc.) exist but they
do not account for other factors that may have a role to play depending on the structure of the exchange
(i.e., dilatation, congestion, number of hops). We currently lack the ability to confidently express what is
required to be optimized within the algorithmic process. The effect of over-decomposition on application
communication and mapping is also an open question.
We also need to develop hierarchical algorithms that will deal with different levels of models and abstrac-
tion in order to tackle the problem system at full scale in parallel. A global view of the system will also be
needed to account for the whole environments (storage, shared resources, other running application, network
usage, etc.)
Storage services must be adapted to expose a locality model as well. One approach would be to follow
models being used to express topology in networks, and similarly techniques for balancing network traffic
across network links might be adapted to assist in mapping data accesses to storage resources. That said,
because data is persistent and may be used in the far future in ways the storage service cannot predict, the
need for applications to explicitly control data layout on storage resources, based on knowledge of future
requirements, is also apparent. These issues motivate an investigation of alternative storage service models,
beyond current parallel file system models.
Programming Abstractions for Data Locality
41