ACM DL

ACM Transactions on

Mathematical Software (TOMS)

Menu
Latest Articles

Computing the Braid Monodromy of Completely Reducible n-gonal Curves

Braid monodromy is an important tool for computing invariants of curves and surfaces. In this paper, the rectangular braid diagram (RBD) method is... (more)

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Matrices coming from elliptic Partial Differential Equations have been shown to have a low-rank... (more)

Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression

Hierarchical matrices are space- and time-efficient representations of dense matrices that exploit the low-rank structure of matrix blocks at different levels of granularity. The hierarchically low-rank block partitioning produces representations that can be stored and operated on in near-linear complexity instead of the usual polynomial complexity... (more)

randUTV: A Blocked Randomized Algorithm for Computing a Rank-Revealing UTV Factorization

A randomized algorithm for computing a so-called UTV factorization efficiently is presented. Given a matrix A, the algorithm “randUTV” computes a factorization A = UTV*, where U and V have orthonormal columns, and T is triangular (either upper or lower, whichever is preferred). The algorithm randUTV is developed primarily... (more)

Mathematics and Speed for Interval Arithmetic: A Complement to IEEE 1788

After a short introduction, the article begins with an axiomatic definition of rounded arithmetic. The concepts of rounding and of rounded arithmetic operations are defined in an axiomatic manner fully independent of special data formats and encodings. Basic properties of floating-point and interval arithmetic can directly be derived from this... (more)

Polar Affine Arithmetic: Optimal Affine Approximation and Operation Development for Computation in Polar Form Under Uncertainty

Uncertainties practically arise from numerous factors, such as ambiguous information, inaccurate model, and environment disturbance. Interval arithmetic has emerged to solve problems with uncertain parameters, especially in the computational process where only the upper and lower bounds of parameters can be ascertained. In rectangular coordinate... (more)

A Unified 2D/3D Large-Scale Software Environment for Nonlinear Inverse Problems

Large-scale parameter estimation problems are among some of the most computationally demanding problems in numerical analysis. An academic... (more)

Extended BACOLI: Solving One-Dimensional Multiscale Parabolic PDE Systems With Error Control

BACOLI is a Fortran software package for solving one-dimensional parabolic partial differential equations (PDEs) with separated boundary conditions by B-spline adaptive collocation methods. A distinguishing feature of BACOLI is its ability to estimate and control error and correspondingly adapt meshes in both space and time. Many models of... (more)

Verified Newton–Raphson Iteration for Multiplicative Inverses Modulo Powers of Any Base

We identify two faults in a published algorithm for fast computation of multiplicative inverses... (more)

Spin Summations: A High-Performance Perspective

In addition to tensor contractions, one of the most pronounced computational bottlenecks in the nonorthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms—including CCSD(T) and CCSDT(Q)—are spin summations. At a first sight, spin summations are operations similar to tensor... (more)

On Quality of Implementation of Fortran 2008 Complex Intrinsic Functions on Branch Cuts

Branch cuts in complex functions have important uses in fracture mechanics, jet flow, and aerofoil analysis. This article introduces tests for... (more)

Algorithm 991: The 2D Tree Sliding Window Discrete Fourier Transform

We present a new algorithm for the 2D sliding window discrete Fourier transform. Our algorithm avoids repeating calculations in overlapping windows by storing them in a tree data-structure based on the ideas of the Cooley-Tukey fast Fourier transform. For an N0 × N1 array and n0 × n1 windows, our algorithm takes O(N0 N1 n0 n1)... (more)

NEWS

TOMS Replicated Computational Results (RCR) Initiative.

ACM TOMS has introduced a new initiative to optionally review the computational results of a TOMS submission. This new effort is intended to assist in improving the quality of scientific publication for TOMS and for the computational science community as a whole. Manuscripts that successfully complete the RCR Review process receive the RCR designation when published. If you are interested in participating in this initiative, either as an author or reviewer, please contact the TOMS Editor-in-Chief. Details of the TOMS RCR Initiative are available here.

Welcome Video for Associate Editors

ACM has produced a short video explaining the role of new Associate Editors. Even if you aren't new to the job; it's still worth a view!

About TOMS

The purpose of the ACM Transactions on Mathematical Software (TOMS) is to communicate important research results addressing the development, evaluation and use of mathematical software...

READ MORE
Forthcoming Articles
Client-side computational optimization

Mobile platforms have matured to a point where they can provide the infrastructure needed by sophisticated optimization codes. This opens both the possibility to envisage new market interest for distributed application codes and the opportunity to intensify research on algorithms that require limited computational power, such as that which is oered by mobile platforms. This paper reports on some exploratory experiences in this area, with reference to requests and opportunities for real world applications and to opening computational validation using matheuristics codes. A specic use case requiring to address the feasibility version of the generalized assignment problem is discussed.

Adjoint Code Design Patterns

Adjoint methods have become fundamental ingredients of the scientific computing toolbox over the past decades. Large-scale parameter sensitivity analysis, uncertainty quantification and nonlinear optimization would otherwise turn out computationally infeasible. The symbolic derivation of adjoint mathematical models for relevant problems in science and engineering and their implementation in consistency with the implementation of the underlying primal model frequently proves highly challenging. Hence an increased interest in algorithmic adjoints can be observed. The algorithmic derivation of adjoint numerical simulation programs shifts some of the problems faced from functional and numerical analysis to computer science. It becomes a highly complex software engineering task requiring expertise in software analysis, transformation and optimization. Despite rather mature software tool support for algorithmic differentiation substantial user intervention is typically required when targeting nontrivial numerical programs. A large number of patterns shared by numerous application codes results in repeated duplication of development effort. The adjoint code design patterns introduced in this paper aim to reduce this problem through improved formalization from the software engineering perspective.

Algorithm 9xx: SuiteSparse:GraphBLAS: graph algorithms in the language of sparse linear algebra

SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which defines a set of sparse matrix operations on an extended algebra of semirings using an almost unlimited variety of operators and types. When applied to sparse adjacency matrices, these algebraic operations are equivalent to computations on graphs. GraphBLAS provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. An overview of the GraphBLAS specification is given, followed by a description of the key features of its implementation in the SuiteSparse:GraphBLAS package.

Algorithm xxx: BBCPOP: A Sparse Doubly Nonnegative Relaxation of Polynomial Optimization Problems with Binary, Box and Complementarity Constraints

BBCPOP is a MATLAB implementation of a hierarchy of sparse doubly nonnegative (DNN) relaxations of a class of polynomial optimization problems (POPs) with binary, box and complementarity (BBC) constraints. Given a POP in the class and a relaxation order, BBCPOP constructs a simple conic optimization problem (COP), which serves as a DNN relaxation of the POP, then solves the COP by applying the bisection and projection method. The COP is expressed with a linear objective function and constraints described as a single hyperplane and two cones, which are the Cartesian product of positive semidefinite cones and a polyhedral cone induced from the BBC constraints. BBCPOP aims to compute a tight lower bound for the optimal value of a large-scale POP in the class that is beyond the comfort zone of existing software packages. The robustness, reliability and efficiency of BBCPOP are demonstrated in comparison to the software SDP package SDPNAL+ on randomly generated sparse POPs of degree 2 and 3 with up to a few thousands variables, and ones of degree 4, 5, 6. and 8 with up to a few hundred variables. Comparison with other BBC POPs that arise from combinatorial optimization such as quadratic assignment problems are reported.

Remark on "Algorithm 680: evaluation of the complex error function": Cause and Remedy for the Loss of Accuracy Near the Real Axis

In this remark, we point out a cause of loss-of-accuracy in calculating the Faddeyeva function, w(z) using Algorithm 680, near the real axis, and provide a simple remedy to recover the accuracy of the code as one of the important references for accuracy comparison.

Enclosing Chebyshev Expansions in Linear Time

We consider the problem of computing rigorous enclosures for polynomials represented in Chebyshev basis. Our aim is to compare and develop algorithms with a linear complexity in terms of the polynomial degree. A first category of methods rely on a direct interval evaluation of the given Chebyshev expansion in which Chebyshev polynomials are bounded e.g., with a divide and conquer strategy. Our main category of methods which are based on the Clenshaw recurrence includes interval Clenshaw with defect correction (ICDC), and the spectral transformation of Clenshaw recurrence rewritten as a discrete dynamical system. An extension of the barycentric representation to interval arithmetic is also considered which has a log-linear complexity as it takes advantage of a verified discrete cosine transform. We compare different methods and provide illustrative numerical experiments. In particular, our eigenvalue-based methods are interesting for bounding the range of high-degree interval polynomials. Some of the methods rigorously compute narrow enclosures for high degree Chebyshev expansions at thousands of points in a few seconds on an ordinary machine. We also illustrate how to employ our methods as an automatic a posteriori forward error analysis tool to monitor the accuracy of Chebfun feval command.

A QDWH-Based SVD Software Framework on Distributed-Memory Manycore Systems

This paper presents a high performance software framework for computing a dense SVD on distributed-memory manycore systems. Originally introduced by Nakatsukasa et al., the SVD solver relies on the polar decomposition using the QR Dynamically-Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying to the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling trade-offs. The QDWH-SVD framework achieves up to 3/8-fold on the Haswell/KNL-based platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well and ill-conditioned matrices.

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modelling

Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]. One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains. The major challenge with sparse tiling is implementation  not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art introducing: a more efficient sparse tiling algorithm; support for distributed-memory parallelism; implementation in a publicly-available library, SLOPE; and a study of the performance impact in Seigen, an elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.

Algorithm xxx: An Efficient Parallel Anisotropic Unstructured Delaunay Mesh Generator for Two-Dimensional Aerospace Computational Fluid Dynamics Analysis

A bottom-up approach to parallel anisotropic mesh generation is presented by building a mesh generator from principles. Applications focusing on high-lift design or dynamic stall, or numerical methods and modeling test cases still focus on two-dimensional domains. This automated parallel mesh generation approach can generate high-fidelity unstructured meshes with anisotropic boundary layers for use in the computational fluid dynamics field. The anisotropy requirement adds a level of complexity to a parallel meshing algorithm by making computation depend on the local alignment of elements, which in turn is dictated by geometric boundaries and the density functions. This approach yields computational savings in mesh generation and flow solution through well-shaped anisotropic triangles, instead of isotropic triangles. A 79% parallel weak scaling efficiency on 1024 distributed memory nodes, and a 72% parallel efficiency over the fastest sequential isotropic mesh generator on 512 distributed memory nodes is shown through numerical experiments.

Algorithm xxx: Fast Algorithms for the Computation of the Minimum Distance of a Random Linear Code

The minimum distance of a code is an important concept in information theory. Hence, computing the minimum distance of a code with a minimum computational cost is a crucial process to many problems in this area. In this paper, we present and evaluate a family of algorithms and implementations to compute the minimum distance of a random linear code over F2 that are faster than current implementations, both commercial and public domain. In addition to the basic sequential implementations, we present parallel and vectorized implementations that render high performances on modern architectures. The attained performance results show the benefits of the developed optimized algorithms, which obtain remarkable performance improvements compared with state-of-the-art implementations widely used nowadays.

Algorithm xxx: pySDC - Prototyping spectral deferred corrections

In this paper we present the Python framework pySDC for solving collocation problems with spectral deferred correction methods (SDC) and their time-parallel variant PFASST, the parallel full approximation scheme in space and time. pySDC features many implementations of SDC and PFASST, from simple implicit time-stepping to high-order implicit-explicit or multi-implicit splitting and multi-level spectral deferred corrections. It comes with many different, pre-implemented examples and has seven tutorials to help new users with their first steps. Time-parallelism is implemented either in an emulated way for debugging and prototyping as well as using MPI for benchmarking. The code is fully documented and tested using continuous integration, including most results of previous publications. Here, we describe the structure of the code by taking two different perspectives: the user's and the developer's perspective. While the first sheds light on the front-end, the examples and the tutorials, the second is used to describe the underlying implementation and the data structures. We show three different examples to highlight various aspects of the implementation, the capabilities and the usage of pySDC. Also, couplings to the FEniCS framework and PETSc, the latter including spatial parallelism with MPI, are described.

Algorithm xxx: The Robust LMI Parser: A Toolbox to Construct LMI Conditions for Uncertain Systems

The ROLMIP (Robust LMI Parser) is a toolbox specialized in control theory for uncertain linear systems, built to work under MATLAB jointly with YALMIP, to ease the programming of sufficient Linear Matrix Inequality (LMI) conditions that, if feasible, assure the validity of parameter-dependent LMIs in the entire set of uncertainty considered. This paper presents the new version of the ROLMIP toolbox, that was completely remodeled to provide a high-level user friendly interface, to cope with distinct uncertain domains (hypercube and multi-simplex), and to treat time-varying parameters in discrete- and continuous-time. Through simple commands, the user is able to define matrix polynomials, as well as to described the desired parameter-dependent LMIs in an easy way, considerably reducing the programming time to end up with implementable LMI conditions and, therefore, helping the popularization of the state-of-the-art robust control methods for uncertain systems based on LMIs among graduate students, researchers and engineers in control systems.

Fast matrix-free evaluation of discontinuous Galerkin finite element operators

We present an algorithmic framework for matrix-free evaluation of discontinuous Galerkin finite element operators based on sum factorization on quadrilateral and hexahedral meshes. We identify a set of kernels for fast quadrature on cells and faces targeting a wide class of weak forms originating from linear and nonlinear partial differential equations. Different algorithms and data structures for the implementation of operator evaluation are compared in an in-depth performance analysis. The sum factorization kernels are optimized by vectorization over several cells and faces and an even-odd decomposition of the one-dimensional compute kernels. In isolation our implementation then reaches up to 60% of arithmetic peak on Intel Haswell and Broadwell processors and up to 50% of arithmetic peak on Intel Knights Landing. The full operator evaluation reaches only about half that throughput due to memory bandwidth limitations from loading the input and output vectors, MPI ghost exchange, as well as handling variable coefficients and the geometry. Our performance analysis shows that the results are often within 10% of the available memory bandwidth for the proposed implementation, with the exception of the Cartesian mesh case where the cost of gather operations and MPI communication are more substantial.

Faithfully Rounded Floating-point Computations

We present a pair arithmetic for the four basic operations and square root. It can be regarded as a simplified, more efficient double-double arithmetic. We prove rigorous error bounds for the computed result depending on the relative rounding error unit u according to base ², the size of the arithmetic expression, and possibly a condition measure. Under precisely specified assumptions, the result is proved to be faithfully rounded for up to 1/sqrt(²u)-2 operations. The assumptions are weak enough to apply to many algorithms. For example, our findings cover a number of previously published algorithms to compute faithfully rounded results, among them Horner's scheme, products, sums and dot products, or Euclidean norm. Beyond that, several other problems can be analyzed such as polynomial interpolation, orientation problems, Householder transformations, or the smallest singular value of Hilbert matrices of large size.

Algorithm xxx: Computation of Multi-Degree B-Splines

Multi-degree splines are smooth piecewise-polynomial functions where the pieces can have different degrees. We describe a simple algorithmic construction of a set of basis functions for the space of multi-degree splines, with similar properties to standard B-splines. These basis functions are called multi-degree B-splines (or MDB-splines). The construction relies on an extraction operator that represents all MDB-splines as linear combinations of local B-splines of different degrees. This enables the use of existing efficient algorithms for B-spline evaluations and refinements in the context of multi-degree splines. A Matlab implementation is provided to illustrate the computation and use of MDB-splines.

ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems

Solving dense Hermitian eigenproblems arranged in a sequence with direct solvers fails to take advantage of those spectral properties which are pertinent to the entire sequence, and not just to the single problem. When such features take the form of correlations between the eigenvectors of consecutive problems, as is the case in many real-world applications, the potential benefit of exploiting them can be substantial. We present ChASE, a modern algorithm and library based on subspace iteration with polynomial acceleration. Novel to ChASE is the computation of the spectral estimates that enter in the filter and an optimization of the polynomial degree which further reduces the necessary FLOPs. ChASE is written in C++ using the modern software engineering concepts which favor a simple integration in application codes and a straightforward portability over heterogeneous platforms. When solving sequences of Hermitian eigenproblems for a portion of their extremal spectrum, ChASE greatly benefits from the sequence's spectral properties and outperforms direct solvers in many scenarios. The library ships with two distinct parallelization schemes, supports execution over distributed GPUs, and it is easily extensible to other parallel computing architectures.

The Implementation of the Colored Abstract Simplicial Complex and its Application to Mesh Generation

We introduce CASC: a new, modern, and header-only C++ library which provides a data structure to represent arbitrary dimension abstract simplicial complexes (ASC) with user-defined classes stored directly on the simplices at each dimension. This is accomplished by using the latest C++ language features including variadic template parameters introduced in C++11 and automatic function return type deduction from C++14. Effectively CASC decouples the representation of the topology from the interactions of user data. We present the innovations and design principles of the data structure and related algorithms. This includes a meta-data aware decimation algorithm which is general for collapsing simplices of any dimension. We also present an example application of this library to represent an orientable surface mesh.

Algorithm XXX: Efficient Computation with Kronecker Products

An algorithm for multiplying a chain of Kronecker products by a matrix is described. The algorithm does not require that the Kronecker chain actually be computed and the main computational work is a series of matrix multiplications. Use of the algorithm can lead to substantial savings in both memory usage and computational speed. Although similar algorithms have been described before, this paper makes two novel contributions. First, it shows how shuffling of data can be (largely) avoided. Second, it provides a simple method to determine the optimal ordering of the workflow. A \matlab~implementation is provided in an appendix.

High Performance Implementation of Elliptic Curve Cryptography using Vector Instructions

Elliptic curve cryptosystems are considered as an efficient alternative to conventional systems such as DSA and RSA. Recently, Montgomery and Edwards elliptic curves have been used to implement cryptosystems. In particular, the elliptic curves Curve25519 and Curve448 were used to instantiate the Edwards Digital Signature Algorithm (EdDSA), resulting in new signature schemes called Ed25519 and Ed448. Nowadays, EdDSA is increasingly used for securing Internet communications. In this work, we focus on the secure and efficient software implementation of elliptic curve-based algorithms using SIMD parallel processing. We present software techniques that target the Intel AVX2 vector instruction set for accelerating prime field arithmetic and elliptic curve operations. The set of our contributions leads to a high-performance software library for AVX2-ready processors. For instance, our library computes digital signatures 17-22% (for Ed25519) and 16-18% (for Ed448) faster than previous optimized implementations. In addition, our library improves in about 16-20% the execution time of the Diffie-Hellman key exchange protocol using Curve25519 and Curve448.

All ACM Journals | See Full Journal Index

Search TOMS
enter search term and/or author name