Center for Multiscale Theory and Simulation

NSF CENTER for CHEMICAL INNOVATION

Cyberinfrastructure

CMTS Cyberinfrastructure and workspace

Molecular simulation studies routinely and repeatedly run community programs (e.g., NAMD, LAMMPS, etc.) on diverse large-scale systems operated by NSF (XSEDE, Blue Waters), and other agencies and institutions.  Interleaved with the simulations themselves are custom codes (written in C/C++, FORTRAN, etc.) and scripts (written in Python, Tcl, shell, etc.) for analyzing and organizing the data.  Multiscale modeling adds a level of complexity by requiring that two or more resolutions be considered, often via separate programs.

CMTS is addressing these issues by deploying and operating an integrated cyberinfrastructure. This infrastructure will integrate a methodical and structured method of scripting computational procedures with a high-performance/high-productivity workspace for computation, data management, tracking, transport, sharing, and dissemination. This approach encapsulates the protocols of multiscale modeling into reusable software procedures to improve reproducibility, reuse common workflow patterns, and share research methods.

CMTS uses the Swift parallel scripting language to specify, execute, track, and share simulation procedures and data. Swift imparts a high level of abstraction to simulation scripts, succinctly expressing science logic and separating it from the concerns of execution platform, location, and parallelism. Users specify the inputs, outputs, and dependencies of their application programs. Swift then automatically performs resource selection, parallel scheduling, data transfer, error retry, and provenance tracking.

The multiscale simulation workspace shown below enables users to manage application codes, scripts, data, results, metadata, and provenance of simulations. The workspace enables collaborators to perform runs, test hypotheses, explore multiple branches of inquiry, and track all conditions under which simulations were executed.

Script development and execution. CMTS library procedures and applications are inserted into a workspace catalog and enabled for execution on campus and national cyber resources.

Dataset storage. A scalable project storage server will serve as the core data storage server of the workspace, providing backed-up archival storage for CMTS. Storage resources located at collaborator sites can be tracked as well. Online catalog services will track and share all CMTS datasets, applications, scripts, and code.

Data transport. Molecular dynamics simulations yield huge files that are slow and failure-prone to move, limiting the researcher’s ability to leverage multiple computing resources.  CMTS users and scripts use Globus Online Transfer to move data between storage and computation sites. This service automates data scheduling, transfer optimization, and recovery, freeing users from the complexities of wide-area data transport.

Metadata. The workspace enables annotation of both data and scripts. With web interfaces and scriptable commands, users and their scripts define and locate subsets of data by querying metadata from computations and experiments, thus enabling discovery, validation, and sharing.

Provenance. A provenance storage and query system will be integrated into the workspace, enabling reproducibility and allowing researchers to trace simulation data back to the tool versions, parameters and settings, that produced them.  We will enhance provenance query mechanisms and privacy controls based on CMTS needs.

Gateways. The workspace web and interactive login servers and user/group authentication databases will be provided by Globus Online, based on Open Science Grid/Connect

For further information, please contact Michael Wilde.