Training Roadmap for HPC Users
Using the Training Roadmap
High Performance Computing (HPC) involves the utilization of the most powerful computer systems to advance the frontiers of science and engineering. Many, perhaps most, HPC users have or are pursuing formal education in a scientific or engineering discipline and are (or will become) self-taught programmers. The purpose of this roadmap is to orient such users to the world of HPC. Our goal is to outline the key concepts and skills that come to bear when utilizing HPC systems, to provide concise descriptions of important terms, and to point to more detailed resources and references for future study.
In many cases, useful explanations, tutorials, software engineering tools, or numerical libraries are available online (and often for free), but users are simply unaware of them. They do not know what they do not know. If they only knew the right keywords to type into a search engine, they might be able to find resources to help solve their problem relatively quickly. Instead, they spend hours retreading the mistakes of others or reinventing the wheel.
The above flowchart outlines some basic concepts and skills that are relevant to the HPC community. Not all skills are necessary for all users, but it is helpful for all users to understand the overall landscape. Many new users are first exposed to HPC because of a specific need or application, and they may not appreciate the additional challenges that present themselves when applications are scaled to state-of-the-art systems. By familiarizing themselves with the roadmap early on, they will come to appreciate what they do not know. While the details of many of the topics may in fact not be immediately relevant to their work, our hope is that when they eventually come up against the challenges, questions, error messages, and bugs that are common to the HPC experience, they will remember that the roadmap exists and will return to find pointers to resources that will help them.
The initial entry point in the roadmap is a collection of basic HPC concepts. These include a basic understanding of computer architectures, including the architecture of supercomputers. Additionally, users should understand principles of parallelization, efficiency, scalability, and the numerical issues that arise when dealing with floating point operations. Not all HPC users will need to program. Some may have or develop expertise in a particular application domain and may need to run applications on HPC machines but not necessarily modify source code. These users can still benefit from an understanding of the visualization tools that are available; of the tools and practices for manipulating, moving, and storing large datasets; of workflow management; and of principles of verification and validation.
Some users are accustomed to graphical user interfaces and will benefit by developing some familiarity with UNIX-style operating systems and the use of command-line interfaces and tools. Most HPC users will eventually need to write some code. Before delving deep into parallel software with all of its complexities, users should learn some basic programming skills in C or Fortran. They should also understand some best practices from the field of software engineering, particularly with respect to code structure, documentation, and testing.
Once users have developed basic programming skills, they can learn about more advanced tools involved in the software life cycle, including debuggers, code profilers, and third-party libraries. Some new HPC users may be experienced programmers who are familiar with such tools but will need to learn the additional complexities of debugging, profiling, and optimizing parallel codes. Before writing parallel programs, users should understand principles of shared memory and distributed memory systems so that they can make informed decisions about which avenues to pursue. Finally, the roadmap includes references for a variety of parallel programming languages and technologies.
The nodes in the flowchart are hyperlinks to lists and descriptions of other resources. The resources include professionally developed online tutorials (such as MPI How-Tos), self-study courses, and videos of keynote addresses or live tutorials or lectures from NCSA's past virtual summer school courses. Some of the links are simply pointers to the home pages of projects for tools like TAU, PerfSuite, or ATLAS. These are resources that new users would easily find themselves if they only knew that such tools existed and knew what terms to type into a search engine. Our hope is to increase awareness of such tools by providing this broad overview of the HPC space before the need for them arises.