ACM Transactions on

Mathematical Software (TOMS)

Latest Articles

Computing the Braid Monodromy of Completely Reducible n-gonal Curves

Braid monodromy is an important tool for computing invariants of curves and surfaces. In this paper, the rectangular braid diagram (RBD) method is... (more)

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Matrices coming from elliptic Partial Differential Equations have been shown to have a low-rank... (more)

Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression

Hierarchical matrices are space- and time-efficient representations of dense matrices that exploit the low-rank structure of matrix blocks at different levels of granularity. The hierarchically low-rank block partitioning produces representations that can be stored and operated on in near-linear complexity instead of the usual polynomial complexity... (more)

Polar Affine Arithmetic: Optimal Affine Approximation and Operation Development for Computation in Polar Form Under Uncertainty

Uncertainties practically arise from numerous factors, such as ambiguous information, inaccurate model, and environment disturbance. Interval arithmetic has emerged to solve problems with uncertain parameters, especially in the computational process where only the upper and lower bounds of parameters can be ascertained. In rectangular coordinate... (more)

Verified Newton–Raphson Iteration for Multiplicative Inverses Modulo Powers of Any Base

We identify two faults in a published algorithm for fast computation of multiplicative inverses... (more)

Algorithm 991: The 2D Tree Sliding Window Discrete Fourier Transform

We present a new algorithm for the 2D sliding window discrete Fourier transform. Our algorithm avoids repeating calculations in overlapping windows by storing them in a tree data-structure based on the ideas of the Cooley-Tukey fast Fourier transform. For an N0 × N1 array and n0 × n1 windows, our algorithm takes O(N0 N1 n0 n1)... (more)


TOMS Replicated Computational Results (RCR) Initiative.

ACM TOMS has introduced a new initiative to optionally review the computational results of a TOMS submission. This new effort is intended to assist in improving the quality of scientific publication for TOMS and for the computational science community as a whole. Manuscripts that successfully complete the RCR Review process receive the RCR designation when published. If you are interested in participating in this initiative, either as an author or reviewer, please contact the TOMS Editor-in-Chief. Details of the TOMS RCR Initiative are available here.

Welcome Video for Associate Editors

ACM has produced a short video explaining the role of new Associate Editors. Even if you aren't new to the job; it's still worth a view!

About TOMS

The purpose of the ACM Transactions on Mathematical Software (TOMS) is to communicate important research results addressing the development, evaluation and use of mathematical software...

Forthcoming Articles

The Peano software---parallel, automaton-based, dynamically adaptive grid traversals

Client-side computational optimization

Mobile platforms have matured to a point where they can provide the infrastructure needed by sophisticated optimization codes. This opens both the possibility to envisage new market interest for distributed application codes and the opportunity to intensify research on algorithms that require limited computational power, such as that which is oered by mobile platforms. This paper reports on some exploratory experiences in this area, with reference to requests and opportunities for real world applications and to opening computational validation using matheuristics codes. A specic use case requiring to address the feasibility version of the generalized assignment problem is discussed.

BBCPOP: A Sparse Doubly Nonnegative Relaxation of Polynomial Optimization Problems with Binary, Box and Complementarity Constraints

BBCPOP is a MATLAB implementation of a hierarchy of sparse doubly nonnegative (DNN) relaxations of a class of polynomial optimization problems (POPs) with binary, box and complementarity (BBC) constraints. Given a POP in the class and a relaxation order, BBCPOP constructs a simple conic optimization problem (COP), which serves as a DNN relaxation of the POP, then solves the COP by applying the bisection and projection method. The COP is expressed with a linear objective function and constraints described as a single hyperplane and two cones, which are the Cartesian product of positive semidefinite cones and a polyhedral cone induced from the BBC constraints. BBCPOP aims to compute a tight lower bound for the optimal value of a large-scale POP in the class that is beyond the comfort zone of existing software packages. The robustness, reliability and efficiency of BBCPOP are demonstrated in comparison to the software SDP package SDPNAL+ on randomly generated sparse POPs of degree 2 and 3 with up to a few thousands variables, and ones of degree 4, 5, 6. and 8 with up to a few hundred variables. Comparison with other BBC POPs that arise from combinatorial optimization such as quadratic assignment problems are reported.

Remark on "Algorithm 680: evaluation of the complex error function": Cause and Remedy for the Loss of Accuracy Near the Real Axis

In this remark, we point out a cause of loss-of-accuracy in calculating the Faddeyeva function, w(z) using Algorithm 680, near the real axis, and provide a simple remedy to recover the accuracy of the code as one of the important references for accuracy comparison.

Enclosing Chebyshev Expansions in Linear Time

We consider the problem of computing rigorous enclosures for polynomials represented in Chebyshev basis. Our aim is to compare and develop algorithms with a linear complexity in terms of the polynomial degree. A first category of methods rely on a direct interval evaluation of the given Chebyshev expansion in which Chebyshev polynomials are bounded e.g., with a divide and conquer strategy. Our main category of methods which are based on the Clenshaw recurrence includes interval Clenshaw with defect correction (ICDC), and the spectral transformation of Clenshaw recurrence rewritten as a discrete dynamical system. An extension of the barycentric representation to interval arithmetic is also considered which has a log-linear complexity as it takes advantage of a verified discrete cosine transform. We compare different methods and provide illustrative numerical experiments. In particular, our eigenvalue-based methods are interesting for bounding the range of high-degree interval polynomials. Some of the methods rigorously compute narrow enclosures for high degree Chebyshev expansions at thousands of points in a few seconds on an ordinary machine. We also illustrate how to employ our methods as an automatic a posteriori forward error analysis tool to monitor the accuracy of Chebfun feval command.

A QDWH-Based SVD Software Framework on Distributed-Memory Manycore Systems

This paper presents a high performance software framework for computing a dense SVD on distributed-memory manycore systems. Originally introduced by Nakatsukasa et al., the SVD solver relies on the polar decomposition using the QR Dynamically-Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying to the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling trade-offs. The QDWH-SVD framework achieves up to 3/8-fold on the Haswell/KNL-based platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well and ill-conditioned matrices.

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modelling

Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]. One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains. The major challenge with sparse tiling is implementation  not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art introducing: a more efficient sparse tiling algorithm; support for distributed-memory parallelism; implementation in a publicly-available library, SLOPE; and a study of the performance impact in Seigen, an elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.

Algorithm xxx: An Efficient Parallel Anisotropic Unstructured Delaunay Mesh Generator for Two-Dimensional Aerospace Computational Fluid Dynamics Analysis

A bottom-up approach to parallel anisotropic mesh generation is presented by building a mesh generator from principles. Applications focusing on high-lift design or dynamic stall, or numerical methods and modeling test cases still focus on two-dimensional domains. This automated parallel mesh generation approach can generate high-fidelity unstructured meshes with anisotropic boundary layers for use in the computational fluid dynamics field. The anisotropy requirement adds a level of complexity to a parallel meshing algorithm by making computation depend on the local alignment of elements, which in turn is dictated by geometric boundaries and the density functions. This approach yields computational savings in mesh generation and flow solution through well-shaped anisotropic triangles, instead of isotropic triangles. A 79% parallel weak scaling efficiency on 1024 distributed memory nodes, and a 72% parallel efficiency over the fastest sequential isotropic mesh generator on 512 distributed memory nodes is shown through numerical experiments.

Algorithm xxx: Fast Algorithms for the Computation of the Minimum Distance of a Random Linear Code

The minimum distance of a code is an important concept in information theory. Hence, computing the minimum distance of a code with a minimum computational cost is a crucial process to many problems in this area. In this paper, we present and evaluate a family of algorithms and implementations to compute the minimum distance of a random linear code over F2 that are faster than current implementations, both commercial and public domain. In addition to the basic sequential implementations, we present parallel and vectorized implementations that render high performances on modern architectures. The attained performance results show the benefits of the developed optimized algorithms, which obtain remarkable performance improvements compared with state-of-the-art implementations widely used nowadays.

Algorithm xxx: pySDC - Prototyping spectral deferred corrections

In this paper we present the Python framework pySDC for solving collocation problems with spectral deferred correction methods (SDC) and their time-parallel variant PFASST, the parallel full approximation scheme in space and time. pySDC features many implementations of SDC and PFASST, from simple implicit time-stepping to high-order implicit-explicit or multi-implicit splitting and multi-level spectral deferred corrections. It comes with many different, pre-implemented examples and has seven tutorials to help new users with their first steps. Time-parallelism is implemented either in an emulated way for debugging and prototyping as well as using MPI for benchmarking. The code is fully documented and tested using continuous integration, including most results of previous publications. Here, we describe the structure of the code by taking two different perspectives: the user's and the developer's perspective. While the first sheds light on the front-end, the examples and the tutorials, the second is used to describe the underlying implementation and the data structures. We show three different examples to highlight various aspects of the implementation, the capabilities and the usage of pySDC. Also, couplings to the FEniCS framework and PETSc, the latter including spatial parallelism with MPI, are described.

Faithfully Rounded Floating-point Computations

We present a pair arithmetic for the four basic operations and square root. It can be regarded as a simplified, more efficient double-double arithmetic. We prove rigorous error bounds for the computed result depending on the relative rounding error unit u according to base ², the size of the arithmetic expression, and possibly a condition measure. Under precisely specified assumptions, the result is proved to be faithfully rounded for up to 1/sqrt(²u)-2 operations. The assumptions are weak enough to apply to many algorithms. For example, our findings cover a number of previously published algorithms to compute faithfully rounded results, among them Horner's scheme, products, sums and dot products, or Euclidean norm. Beyond that, several other problems can be analyzed such as polynomial interpolation, orientation problems, Householder transformations, or the smallest singular value of Hilbert matrices of large size.

A note on using performance and data profiles for training algorithms

It is shown how to use the performance and data profile benchmarking tools to improve algorithms performance. An illustration for the BFO derivative-free optimizer suggests that the obtained gains are potentially significant.

ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems

Solving dense Hermitian eigenproblems arranged in a sequence with direct solvers fails to take advantage of those spectral properties which are pertinent to the entire sequence, and not just to the single problem. When such features take the form of correlations between the eigenvectors of consecutive problems, as is the case in many real-world applications, the potential benefit of exploiting them can be substantial. We present ChASE, a modern algorithm and library based on subspace iteration with polynomial acceleration. Novel to ChASE is the computation of the spectral estimates that enter in the filter and an optimization of the polynomial degree which further reduces the necessary FLOPs. ChASE is written in C++ using the modern software engineering concepts which favor a simple integration in application codes and a straightforward portability over heterogeneous platforms. When solving sequences of Hermitian eigenproblems for a portion of their extremal spectrum, ChASE greatly benefits from the sequence's spectral properties and outperforms direct solvers in many scenarios. The library ships with two distinct parallelization schemes, supports execution over distributed GPUs, and it is easily extensible to other parallel computing architectures.

Algorithm XXX: Efficient Computation with Kronecker Products

An algorithm for multiplying a chain of Kronecker products by a matrix is described. The algorithm does not require that the Kronecker chain actually be computed and the main computational work is a series of matrix multiplications. Use of the algorithm can lead to substantial savings in both memory usage and computational speed. Although similar algorithms have been described before, this paper makes two novel contributions. First, it shows how shuffling of data can be (largely) avoided. Second, it provides a simple method to determine the optimal ordering of the workflow. A \matlab~implementation is provided in an appendix.

High Performance Implementation of Elliptic Curve Cryptography using Vector Instructions

Elliptic curve cryptosystems are considered as an efficient alternative to conventional systems such as DSA and RSA. Recently, Montgomery and Edwards elliptic curves have been used to implement cryptosystems. In particular, the elliptic curves Curve25519 and Curve448 were used to instantiate the Edwards Digital Signature Algorithm (EdDSA), resulting in new signature schemes called Ed25519 and Ed448. Nowadays, EdDSA is increasingly used for securing Internet communications. In this work, we focus on the secure and efficient software implementation of elliptic curve-based algorithms using SIMD parallel processing. We present software techniques that target the Intel AVX2 vector instruction set for accelerating prime field arithmetic and elliptic curve operations. The set of our contributions leads to a high-performance software library for AVX2-ready processors. For instance, our library computes digital signatures 17-22% (for Ed25519) and 16-18% (for Ed448) faster than previous optimized implementations. In addition, our library improves in about 16-20% the execution time of the Diffie-Hellman key exchange protocol using Curve25519 and Curve448.

All ACM Journals | See Full Journal Index

Search TOMS
enter search term and/or author name