BIBLIOGRAPHY
[28] Mehmet Deveci, Sivasankaran Rajamanickam, Vitus J Leung, Kevin Pedretti, Stephen L Olivier,
David P Bunde, Umit V Cataly¨
urek, and Karen Devine. Exploiting geometric partitioning in task
mapping for parallel computers. In IPDPS, PHOENIX (Arizona) USA, May 2014.
[29] Simplice Donfack, Laura Grigori, William D. Gropp, and Vivek Kale. Hybrid Static/dynamic Scheduling
for Already Optimized Dense Matrix Factorization. In IPDPS, pages 496–507. IEEE Computer Society,
2012.
[30] Matthieu Dorier, Gabriel Antoniu, Robert Ross, Dries Kimpe, and Shadi Ibrahim. CALCioM: Miti-
gating I/O interference in HPC systems through cross-application coordination. In Proceedings of the
International Parallel and Distributed Processing Symposium, May 2014.
[31] H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. Kokkos: Enabling manycore perfor-
mance portability through polymorphic memory access patterns. Journal of Parallel and Distributed
Computing, 2014.
[32] Sebastian Erdweg, Tillmann Rendel, Christian K¨
astner, and Klaus Ostermann. Sugarj: Library-based
syntactic language extensibility. SIGPLAN Not., 46(10):391–406, October 2011.
[33] F. Pellegrini. Scotch and LibScotch 5.1 User’s Guide. ScAlApplix project, INRIA Bordeaux – Sud-
Ouest, ENSEIRB & LaBRI, UMR CNRS 5800, August 2008. http://www.labri.fr/perso/pelegrin/
scotch/.
[34] Oliver Fuhrer, Mauro Bianco, Isabelle Bey, and Christoph Schr. Grid tools: Towards a library for hard-
ware oblivious implementation of stencil based codes. http://www.pasc-ch.org/projects/projects/
grid-tools/.
[35] Karl F¨
urlinger, Colin Glass, Jose Gracia, Andreas Kn¨
upfer, Jie Tao, Denis H¨
unich, Kamran Idrees,
Matthias Maiterth, Yousri Mhedheb, and Huan Zhou. DASH: Data structures and algorithms with
support for hierarchical locality. In Euro-Par Workshops, 2014.
[36] Michael Garland, Manjunath Kudlur, and Yili Zheng. Designing a unified programming model for
heterogeneous machines. In Supercomputing 2012, November 2012.
[37] Markus Geimer, Felix Wolf, Brian J. N. Wylie, Erika ´
Abrah´
am, Daniel Becker, and Bernd Mohr. The
Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience,
22(6):702–719, April 2010.
[38] Brice Goglin. Managing the Topology of Heterogeneous Cluster Nodes with Hardware Locality (hwloc).
In Proceedings of 2014 International Conference on High Performance Computing & Simulation (HPCS
2014), Bologna, Italy, July 2014.
[39] Brice Goglin, Joshua Hursey, and Jeffrey M. Squyres.
netloc: Towards a Comprehensive View of
the HPC System Topology. In Proceedings of the fifth International Workshop on Parallel Software
Tools and Tool Infrastructures (PSTI 2014), held in conjunction with ICPP-2014, Minneapolis, MN,
September 2014.
[40] HDF5. http://www.hdfgroup.org/HDF5/.
[41] Robert L Henderson. Job scheduling under the portable batch system. In Job scheduling strategies for
parallel processing, pages 279–294. Springer, 1995.
[42] Berk Hess, Carsten Kutzner, David van der Spoel, and Erik Lindahl.
GROMACS 4: Algorithms
for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. J. Chem. Theory Comput.,
4(3):435–447, 2008.
[43] T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski, R. Thakur, and J. L. Traeff. The Scal-
able Process Topology Interface of MPI 2.2. Concurrency and Computation: Practice and Experience,
23(4):293–310, Aug. 2010.
Programming Abstractions for Data Locality
46
BIBLIOGRAPHY
[44] Torsten Hoefler, Emmanuel Jeannot, and Guillaume Mercier.
Chapter 5: An overview of process
mapping techniques and algorithms in high-performance computing. In Emmanuel Jeannot and Julius
ˇ
Zilinskas, editors, High Performance Computing on Complex Environments, pages 65–84. Wiley, 2014.
To be published.
[45] Torsten Hoefler and Marc Snir. Generic Topology Mapping Strategies for Large-Scale Parallel Architec-
tures. In David K. Lowenthal, Bronis R. de Supinski, and Sally A. McKee, editors, Proceedings of the
25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31 - June 04, 2011,
pages 75–84. ACM, 2011.
[46] Paul Hudak. Building domain-specific embedded languages. ACM Computing Surveys, 28, 1996.
[47] hwloc. Portable Hardware Locality. http://www.open-mpi.org/projects/hwloc/.
[48] IEEE. 2004 (ISO/IEC) [IEEE/ANSI Std 1003.1, 2004 Edition] Information Technology — Portable
Operating System Interface (POSIX
®) — Part 1: System Application: Program Interface (API) [C
Language]. IEEE, New York, NY USA, 2004.
[49] Intel Corporation. Threading Building Blocks. https://www.threadingbuildingblocks.org/.
[50] E. Jeannot and G. Mercier. Near-optimal Placement of MPI Processes on Hierarchical NUMA Ar-
chitectures. In Pasqua D’Ambra, Mario Rosario Guarracino, and Domenico Talia, editors, Euro-Par
2010 - Parallel Processing, 16th International Euro-Par Conference, volume 6272 of Lecture Notes on
Computer Science, pages 199–210, Ischia, Italy, SEPT 2010. Springer.
[51] Emmanuel Jeannot, Guillaume Mercier, and Fran¸
cois Tessier. Process Placement in Multicore Clusters:
Algorithmic Issues and Practical Techniques. IEEE Transactions on Parallel and Distributed Systems,
25(4):993–1002, April 2014.
[52] Laxmikant V. Kale and Sanjeev Krishnan. Charm++: A portable concurrent object oriented sys-
tem based on c++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming
Systems, Languages, and Applications, OOPSLA ’93, pages 91–108, New York, NY, USA, 1993. ACM.
[53] Amir Kamil and Katherine Yelick. Hierarchical computation in the SPMD programming model. In The
26th International Workshop on Languages and Compilers for Parallel Computing, September 2013.
[54] George Karypis, Kirk Schloegel, and Vipin Kumar. Parmetis. Parallel graph partitioning and sparse
matrix ordering library. Version, 2, 2003.
[55] A. Yar Khan, J. Kurzak, and J. Dongarra. QUARK Users’ Guide: QUeueing And Runtime for Kernels.
University of Tennessee Innovative Computing Laboratory Technical Report, ICL-UT-11-02, 2011.
[56] J. Kim, W.J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. In
Computer Architecture, 2008. ISCA ’08. 35th International Symposium on, pages 77–88, June 2008.
[57] Andreas Kl¨
ockner, Nicolas Pinto, Yunsup Lee, Bryan C. Catanzaro, Paul Ivanov, and Ahmed Fasih. Py-
cuda and pyopencl: A scripting-based approach to GPU run-time code generation. Parallel Computing,
38(3):157–174, 2012.
[58] Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty
Denneau, Paul Franzen, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler,
Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Ster-
ling, R. Stanley Williams, and Katherine Yelick. ExaScale computing study: Technology challenges in
achieving exascale systems. Technical report, DARPA, May 2008.
[59] Peter M. Kogge and John Shalf. Exascale computing trends: Adjusting to the ”new normal”’ for
computer architecture. Computing in Science and Engineering, 15(6):16–26, 2013.
Programming Abstractions for Data Locality
47
BIBLIOGRAPHY
[60] William Kramer. Is petascale completely done? what should we do now? joint-lab on petsacale com-
puting workshophttps://wiki.ncsa.illinois.edu/display/jointlab/Joint-lab+workshop+Nov.
+25-27+2013, November 2013.
[61] Xavier Lapillonne and Oliver Fuhrer. Using compiler directives to port large scientific applications to
gpus: An example from atmospheric science. Parallel Processing Letters, 24(1), 2014.
[62] Jianwei Li, Wei keng Liao, Alok Choudhary, Robert Ross, Rajeev Thakur, William Gropp, Rob Latham,
Andrew Siegel, Brad Gallagher, , and Michael Zingale. Parallel netCDF: A high-performance scientific
I/O interface. In Proceedings of SC2003, November 2003.
[63] Hatem Ltaief and Rio Yokota. Data-Driven Execution of Fast Multipole Methods. CoRR, abs/1203.0889,
2012.
[64] Richard Membarth, Frank Hannig, J¨
urgen Teich, Mario K¨
orner, and Wieland Eckert. Generating device-
specific GPU code for local operators in medical imaging. In Proceedings of the 26th IEEE International
Parallel and Distributed Processing Symposium (IPDPS), pages 569–581. IEEE, May 2012.
[65] Richard Membarth, Frank Hannig, J¨
urgen Teich, and Harald K¨
ostler. Towards domain-specific com-
puting for stencil codes in HPC. In Proceedings of the 2nd International Workshop on Domain-Specific
Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pages 1133–
1138, November 2012.
[66] Q. Meng and M. Berzins. Scalable Large-Scale Fluid-Structure Interaction Solvers in the Uintah Frame-
work via Hybrid Task-Based Parallelism Algorithms. Concurrency and Computation: Practice and
Experience, 26(7):1388–1407, 2014.
[67] Jun Nakashima, Sho Nakatani, and Kenjiro Taura. Design and Implementation of a Customizable Work
Stealing Scheduler. In International Workshop on Runtime and Operating Systems for Supercomputers,
June 2013.
[68] netloc. Portable Network Locality. http://www.open-mpi.org/projects/netloc/.
[69] OCR Developers. Open Community Runtime. https://01.org/open-community-runtime.
[70] Stephen L. Olivier, Allan K. Porterfield, Kyle B. Wheeler, Michael Spiegel, and Jan F. Prins. OpenMP
task scheduling strategies for multicore NUMA systems. International Journal of High Performance
Computing Applications, 26(2):110–124, May 2012.
[71] OpenMP
ARB.
OpenMP
Application
Program
Interface.
http://openmp.org/wp/
openmp-specifications/.
[72] Carlos Osuna, Oliver Fuhrer, Tobias Gysi, and Mauro Bianco. STELLA: A domain-specific language
for stencil methods on structured grids. Poster Presentation at the Platform for Advanced Scientific
Computing (PASC) Conference, Zurich, Switzerland.
[73] ovis. Main Page - OVISWiki. https://ovis.ca.sandia.gov.
[74] Szil´
ard Pall, Mark James Abraham, Carsten Kutzner, Berk Hess, and Erik Lindahl. Tackling exascale
software challenges in molecular dynamics simulations with GROMACS. Lecture Notes in Computer
Science, page in press, 2014.
[75] Miquel Peric`
as, Kenjiro Taura, and Satoshi Matsuoka. Scalable Analysis of Multicore Data Reuse and
Sharing. In Proceedings of ICS’14, June 2014.
[76] B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg, and T. Hoefler. Efficient Task
Placement and Routing in Dragonfly Networks . In Proceedings of the 23rd ACM International Sympo-
sium on High-Performance Parallel and Distributed Computing (HPDC’14). ACM, Jun. 2014.
Programming Abstractions for Data Locality
48
BIBLIOGRAPHY
[77] Sander Pronk, Szil´
ard P´
all, Roland Schulz, Per Larsson, P¨
ar Bjelkmar, Rossen Apostolov, Michael R.
Shirts, Jeremy C. Smith, Peter M. Kasson, David van der Spoel, Berk Hess, and Erik Lindahl. GRO-
MACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinfor-
matics, 29(7):845–854, 2013.
[78] Sabela Ramos and Torsten Hoefler. Modeling communication in cache-coherent smp systems: A case-
study with xeon phi. In Proceedings of the 22Nd International Symposium on High-performance Parallel
and Distributed Computing, HPDC ’13, pages 97–108, New York, NY, USA, 2013. ACM.
[79] Florian Rathgeber, Graham R. Markall, Lawrence Mitchell, Nicolas Loriant, David A. Ham, Carlo
Bertolli, and Paul H.J. Kelly. PyOP2: A high-level framework for performance-portable simulations
on unstructured meshes. In Proceedings of the 2nd International Workshop on Domain-Specific Lan-
guages and High-Level Frameworks for High Performance Computing (WOLFHPC), pages 1116–1123,
November 2012.
[80] E. Rodrigues, F. Madruga, P. Navaux, and J. Panetta. Multicore Aware Process Mapping and its
Impact on Communication Overhead of Parallel Applications. In Proceedings of the IEEE Symp. on
Comp. and Comm., pages 811–817, July 2009.
[81] Tiark Rompf and Martin Odersky. Lightweight modular staging: a pragmatic approach to runtime code
generation and compiled dsls. Commun. ACM, 55(6):121–130, 2012.
[82] Francis P. Russell, Michael R. Mellor, Paul H. J. Kelly, and Olav Beckmann.
Desola: An active
linear algebra library using delayed evaluation and runtime code generation. Sci. Comput. Program.,
76(4):227–242, 2011.
[83] Frank Schmuck and Roger Haskin. GPFS: A shared-disk file system for large computing clusters. In
First USENIX Conference on File and Storage Technologies (FAST’02), Monterey, CA, January 28-30
2002.
[84] John Shalf, Sudip S. Dosanjh, and John Morrison. Exascale computing technology challenges. In
International Meeting on High Performance Computing for Computational Science, volume 6449 of
Lecture Notes in Computer Science, pages 1–25. Springer, 2010.
[85] M. Showerman, J. Enos, J. Fullop, P. Cassella, N. Naksinehaboon, N. Taerat, T. Tucker, J. Brandt,
A. Gentile, and B. Allan. Large scale system monitoring and analysis on blue waters using ovis. In
Proceedings of the 2014 Cray User’s Group, CUG 2014, May 2014.
[86] Rajeev Thakur, William Gropp, and Ewing Lusk. On implementing MPI-IO portably and with high
performance. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems,
pages 23–32, May 1999.
[87] Didem Unat, Cy Chan, Weiqun Zhang, John Bell, and John Shalf. Tiling as a durable abstraction for
parallelism and data locality. Workshop on Domain-Specific Languages and High-Level Frameworks for
High Performance Computing, November 18, 2013.
[88] Field G. Van Zee, Ernie Chan, Robert A. van de Geijn, Enrique S. Quintana-Ort´ı, and Gregorio
Quintana-Ort´ı. The libflame Library for Dense Matrix Computations. IEEE Des. Test, 11(6):56–63,
November 2009.
[89] Todd L. Veldhuizen and Dennis Gannon. Active libraries: Rethinking the roles of compilers and libraries.
CoRR, math.NA/9810022, 1998.
[90] B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou. Scalable
performance of the Panasas parallel file system. In Proceedings of the 6th USENIX Conference on File
and Storage Technologies (FAST), pages 17–33, 2008.
[91] Martin Wimmer, Daniel Cederman, Jesper Larsson Tr¨
aff, and Philippas Tsigas. Work-stealing with
configurable scheduling strategies. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, PPoPP ’13, pages 315–316, New York, NY, USA, 2013. ACM.
Programming Abstractions for Data Locality
49
BIBLIOGRAPHY
[92] Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. Hierarchical place trees: A portable abstrac-
tion for task parallelism and data movement. In Proceedings of the 22nd International Workshop on
Languages and Compilers for Parallel Computing, October 2009.
[93] Katherine Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy,
Paul Hilfinger, Susan Graham, David Gay, Phillip Colella, and Alexander Aiken. Titanium: A high-
performance Java dialect. In Workshop on Java for High-Performance Network Computing, February
1998.
[94] Qing Yi and Daniel J. Quinlan. Applying loop optimizations to object-oriented abstractions through
general classification of array semantics. In Rudolf Eigenmann, Zhiyuan Li, and Samuel P. Midkiff,
editors, LCPC, volume 3602 of Lecture Notes in Computer Science, pages 253–267. Springer, 2004.
[95] Andy B Yoo, Morris A Jette, and Mark Grondona. Slurm: Simple linux utility for resource management.
In Job Scheduling Strategies for Parallel Processing, pages 44–60. Springer, 2003.
[96] Yili Zheng, Amir Kamil, Michael Driscoll, Hongzhang Shan, and Katherine Yelick. UPC++: A PGAS
extension for C++. In The 28th IEEE International Parallel and Distributed Processing Symposium
(IPDPS14), May 2014.
[97] Songnian Zhou. Lsf: Load sharing in large heterogeneous distributed systems. In I Workshop on Cluster
Computing, 1992.
Programming Abstractions for Data Locality
50
Document Outline - 1 Introduction
- 1.1 Workshop
- 1.2 Organization of this Report
- 1.3 Summary of Findings and Recommendations
- 1.3.1 Motivating Applications and Their Requirements
- 1.3.2 Data Structures and Layout Abstractions
- 1.3.3 Language and Compiler Support for Data Locality
- 1.3.4 Data Locality in Runtimes for Task Models
- 1.3.5 System-Scale Data Locality Management
- 2 Background
- 2.1 Hardware Trends
- 2.2 Limitations of Current Approaches
- 3 Motivating Applications and Their Requirements
- 4 Data Structures and Layout Abstractions
- 4.1 Terminology
- 4.2 Key Points
- 4.3 State of the Art
- 4.4 Discussion
- 4.5 Research Plan
- 5 Language and Compiler Support for Data Locality
- 5.1 Key Points
- 5.2 State of the Art
- 5.3 Discussions
- 5.3.1 Multiresolution Tools
- 5.3.2 Partition Data vs Partition Computation
- 5.3.3 Compositional Metadata
- 5.4 Research Plan
- 6 Data Locality in Runtimes for Task Models
- 6.1 Key Points
- 6.1.1 Concerns for Task-Based Programming Model
- a) Runtime Scheduling Mode
- b) Task Granularity
- 6.1.2 Expressing Data Locality with Task-Based systems
- 6.2 State of the art
- 6.3 Discussion
- 6.4 Research Plan
- 6.4.1 Performance of task-based runtimes
- 6.4.2 Debugging tools
- 6.4.3 Hint framework
- 7 System-Scale Data Locality Management
- 7.1 Key points
- 7.2 State of the Art
- 7.3 Discussion
- 7.4 Research Plan
- 8 Conclusion
- 8.1 Priorities
- 8.2 Research Areas
- 8.3 Next Steps
- References
Dostları ilə paylaş: |