A tight lower bound for required I/O when computing a matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of phases needed to perform C:=AB, where each phase is a series of operations involving S reads and writes to and from fast memory, and S is the size of fast memory. A lower bound on the number of phases was then determined by obtaining an upper bound on the number of scalar multiplications performed per phase. This paper follows the same high level approach, but improves the lower bound by considering C:=AB+C instead of C:=AB, and obtains the maximum number of scalar fused multiply-adds (FMAs) per phase instead of scalar additions. Key to obtaining the new result is the decoupling of the per-phase I/O from the size of fast memory. The new lower bound is 2mnk/S - 2S where S is the size of fast memory. The constant for the leading term is an improvement of a factor 4/2. A theoretical algorithm that attains the lower bound is given, and how the state-of-the-art Goto's algorithm also in some sense meets the lower bound is discussed.

Given a shape regular tetrahedron and a curved surface which is defined implicitly by a nonlinear level set function and divides the tetrahedron into two sub-domains, a general-purpose, robust, and arbitrarily high order numerical algorithm is proposed in this paper for computing both volume integrals in the sub-domains and surface integrals on their common boundary. The algorithm uses a direct approach which decomposes 3D volume integrals or 2D surface integrals into multiple 1D integrals and computes the 1D integrals with Gaussian quadratures. It only requires finding roots of univariate nonlinear functions in given intervals and evaluating the integrand, the level set function, and the gradient of the level set function at given points. It can achieve arbitrarily high order by increasing the orders of Gaussian quadratures, and does not need extra a priori knowledge about the integrand and the level set function. The code for the algorithm is freely available in the open source finite element toolbox Parallel Hierarchical Grid (PHG) and can serve as a basic building block for implementing 3D high order numerical algorithms involving implicit interfaces or boundaries.

We present a new software package, offering h-adaptive and p-adaptive linear multistep methods for first order initial value problems in ODEs. The implementation is based on a new parametric, grid-independent representation of multistep methods. Parameters are supplied for over 60 methods. For nonstiff problems, all maximal order methods (p=k for explicit and p=k+1 for implicit methods) are supported. For stiff computation, implicit methods of order p=k are included.
A collection of step-size controllers based on digital filters is provided, generating smooth step-size sequences offering improved computational stability. Controllers may be selected to match method and problem classes. A new system for automatic order control ($p$-adaptivity) is also provided for designated families of multistep methods, offering simultaneous $h$- and $p$-adaptivity.
Implemented as a \textsc{Matlab} toolbox, the software covers high order computations with linear multistep methods within a unified framework. Computational experiments show that the new software is competitive and offers qualitative improvements. The toolbox is available for downloading. While not yet supporting special features required in industrial computation, it offers a platform for developing a new generation of state-of-the-art multistep solvers, as well as for true *ceteris paribus* evaluation of algorithmic components. This also enables **method comparisons** within a single implementation.

?Code Generation for Generally Mapped Finite Elements? includes performance results for the finite element methods discussed in that manuscript. The authors provided a Zenodo archive with the Firedrake components and dependencies used, as well as the scripts that generated the results. The software was installed on two similar platforms; then, new results were gathered and compared to the original results. After completing this process, the results have been deemed replicable by the reviewer.

There are several AD tools available, which all implement different strategies for the reverse mode of AD. The major strategies are primal value taping (implemented e.g. by ADOL-c) and Jacobi taping (implemented e.g. by adept and dco/c++). Especially for Jacobi taping, recent advances by using expression templates make this approach very attractive for large scale software. The current implementations are either closed source or miss essential features and flexibility. Therefore, we present the new AD tool CoDiPack (Code Differentiation Package) in this paper. It is specifically designed for a minimal memory consumption and optimal runtime, such that it can be used for the differentiation of large scale software. An essential part of the design of CoDiPack is the modular layout and the recursive data structures, which do not only allow the efficient implementation of the Jacobi taping approach, but will also enable other approaches like the primal value taping or new research ideas. We will also present the performance value of CoDiPack on a generic PDE example and on the SU2 code.

A paper by Karati and Sarkar at Asiacrypt'17 has pointed out the potential for Kummer lines in genus one, by observing that its SIMD-friendly arithmetic is competitive with the status quo. A more recent preprint explores the connection with (twisted) Edwards curves. In this paper we extend this work and significantly simplify their treatment. We show that their Kummer line is the \(x\)-line of a Montgomery curve translated by a point of order two, and exhibit a natural isomorphism to a twisted Edwards curve. Moreover, we show that the Kummer line presented by Gaudry and Lubicz can be obtained via the action of a point of order two on the \(y\)-line of an Edwards curve. The maps connecting these curves and lines are all very simple. As a result, a cryptographic implementation can use the arithmetic that is optimal for its instruction set at negligible cost.

Iterated-integral signatures and log signatures are sequences calculated from a path that characterise its shape. They originate from the work of K. T. Chen and have become important through Terry Lyons? theory of differential equations driven by rough paths, which is an important developing area of stochastic analysis. They have applications in statistics and machine learning, where there can be a need to calculate finite parts of them quickly for many paths. We introduce the signature and the most basic information (displacement and signed areas) which it contains. We present algorithms for efficiently calculating these signatures. For log signatures this requires consideration of the structure of free Lie algebras. We benchmark the performance of the algorithms. The methods are implemented in C++ and released as a Python extension package which also supports differentiation. In combination with a machine learning library (Tensorflow, PyTorch or Theano), this allows end-to-end learning of neural networks involving signatures.

Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly-optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process---from mathematical equations down to C++ code---is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications.

Conventional GPU implementations of Strassen?s algorithm (Strassen) rely on the existing high-performance matrix multiplication (GEMM), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, ?squarish? matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from GEMM, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for GEMM and Strassen. Overall, our 1-level Strassen can achieve up to 1.11× speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19× speedup with a crossover point at 7,680.

QNSTOP consists of serial and parallel (OpenMP) Fortran 2003 codes for the quasi-Newton stochastic optimization method of Castle and Trosset for stochastic search problems. A complete description of QNSTOP for both local search with stochastic objective and global search with ``noisy'' deterministic objective is given here for the first time. For stochastic search problems, some convergence theory exists for particular algorithmic choices and parameter values. Both the parallel driver subroutine, which offers several parallel decomposition strategies, and the serial driver subroutine can be used for local stochastic search or global deterministic search, based on an input switch. Some performance data for computational systems biology problems is given.

Operator products occur naturally in a range of regularized boundary integral equation formulations. However, while a Galerkin discretisation only depends on the domain space and the test (or dual) space of the operator, products require a notion of the range. In the boundary element software package Bempp we have implemented a complete operator algebra that depends on knowledge of the domain, range and test space. The aim was to develop a way of working with Galerkin operators in boundary element software that is as close to working with the strong form on paper as possible while hiding the complexities of Galerkin discretisations. In this paper, we demonstrate the implementation of this operator algebra and show, using various Laplace and Helmholtz example problems, how it significantly simplifies the definition and solution of a wide range of typical boundary integral equation problems.

We present a computational procedure to evaluate the integral ?_{x}^{y} s^{p-1} e^{-?s} ds, for 0 ? x < y ?+?, ? = ±1, p > 0, which generalizes the lower (x=0) and upper (y=+?) incomplete gamma functions. To allow for large values of x, y, and p while avoiding under/overflow issues in the standard double precision floating point arithmetic, we use an explicit normalization that is much more efficient than the classical ratio with the complete gamma function. The generalized incomplete gamma function is estimated with continued fractions, integrations by parts, or, when x ? y, with Romberg numerical integration algorithm. We show that the accuracy reached by our algorithm improves a recent state-of-the-art method by two orders of magnitude, and is essentially optimal considering the limitations imposed by the floating point arithmetic. Moreover, the admissible parameter range of our algorithm (0 ? p,x,y ? 10^{15}) is much larger than competing algorithms and its robustness is assessed through massive usage in an image processing application.

In this paper, we present an algorithm for the singular value decomposition (SVD) of a bidiagonal matrix by means of the eigenpairs of an associated symmetric tridiagonal matrix. The algorithm is particularly suited for the computation of a subset of singular values and corresponding vectors. We focus on a sequential version of the algorithm, and discuss special cases and implementation details. We use a large set of bidiagonal matrices to assess the accuracy of the implementation in single and double precision, as well as to identify potential shortcomings. We show that the algorithm can be up to three orders of magnitude faster than existing algorithms, which are limited to the computation of a full SVD. We also show time comparisons of an implementation that uses the strategy discussed in the paper as a building block for the computation of the SVD of general matrices.

In a recent paper, Lucambio Pérez and Prudente extended the Wolfe conditions for the vector-valued optimization. Here, we propose a line search algorithm for finding a step size satisfying the strong Wolfe conditions in the vector optimization setting. Well definedness and finite termination results are provided. We discuss practical aspects related to the algorithm and present some numerical experiments illustrating its applicability. Codes supporting this paper are written in Fortran 90 and are freely available for download.

Many classical finite elements such as the Argyris and Bell elements have long been absent from high-level PDE software. Building on recent theoretical work, we describe how to implement very general finite element transformations in FInAT and hence into the Firedrake finite element system. Numerical results evaluate the new elements, comparing them to existing methods for classical problems. For a second order model problem, we find that new elements give smooth solutions at a mild increase in cost over standard Lagrange elements. For fourth-order problems, however, the newly-enabled methods significantly outperform interior penalty formulations. We also give some advanced use cases, solving the nonlinear Cahn-Hilliard equation and some biharmonic eigenvalue problems (including Chladni plates) using $C^1$ discretizations.

We are interested in obtaining error bounds for the classical FFT algorithm in floating-point arithmetic, for the 2-norm as well as for the infinity norm. For that purpose we also give some results on the relative error of the complex multiplication by a root of unity, and on the largest value that can take the real or imaginary part of one term of the FFT of a vector $x$, assuming that all terms of $x$ have real and imaginary parts less than some value $b$.

We present FloatX (Float eXtended), a \CC framework to investigate the effect of leveraging customized floating-point formats in numerical applications. Among other properties, FloatX facilitates an incremental transformation of the code, relies on hardware-supported floating-point types as back end to preserve efficiency, and incurs no storage overhead. The paper discusses in detail the design principles, programming interface and datatype casting rules behind FloatX. Furthermore, it demonstrates FloatX's usage and benefits via several case studies from well-known numerical dense linear algebra libraries, such as BLAS and LAPACK; the Ginkgo library for sparse linear systems; and two neural network applications related with image processing and text recognition.

A scientific framework for simulations of large-scale networks, such as is required for the analysis of critical infrastructure interaction and interdependencies, is needed for applications on exascale computers. Such a framework must be able to manage heterogeneous physics and unstructured topology, and must be reusable. To this end we have developed DMNetwork, a class in PETSc that provides data and topology management and migration for network problems, along with multiphysics solvers to exploit the problem structure. It eases the application development cycle by providing the necessary infrastructure through simple abstrac- tions to define and query the network. This paper presents the design of the DMNetwork, illustrates its user interface, and demonstrates its ability to solve large network problems through the numerical simulation of a water pipe network with more than 2 billion variables on extreme-scale computers using up to 30,000 processor cores.