Collaborative Research: Integrated HPC Systems Usage and Performance of Resources Monitoring and Modeling (SUPReMM)
Abani Patra Principal Investigator
MetadataShow full item record
Todays high-performance computing systems are a complex combination of software, processors, memory, networks, and storage systems characterized by frequent disruptive technological advances. In this environment, system managers, users and sponsors find it difficult if not impossible to know if optimal performance of the infrastructure is being realized, or even if all subcomponents are functioning properly. Users of such systems are often engaged in science at the extreme where system uncertainties can significantly delay or even confound the scientific investigations. Critically, for systems based on open source software systems which includes a large fraction of XSEDE resources, the data and information necessary to use and manage these complex systems is not available. HPC centers and their users, are to some extent flying blind, without a clear understanding of system behavior. Anomalous behavior has to be diagnosed and remedied with incomplete and sparse data. It is difficult for users to assess the effectiveness with which they are using the available resources to generate knowledge in their sciences. NSF lacks a comprehensive knowledge base to evaluate the effectiveness of its investments in HPC systems.<br/><br/>This award will address this problem through the creation of a comprehensive set of tools for developing the needed knowledge bases. This will be accomplished by building on and combining work on HPC systems monitoring and reporting currently underway at the University at Buffalo under the Technology Audit Service (TAS) of the XSEDE project and University of Texas/ Texas Advance Computing Center (TACC) as part of the Ranger Technology Insertion effort with many elements of existing monitoring and analysis tools. The PIs will provide the knowledge bases required to understand the current operations of XSEDE, to enhance and increase the productivity of all of the stakeholders of XSEDE (service providers, users and sponsors), and ultimately to provide open source tools to greatly increase the operational efficiency and productivity of HPC systems in general.