Latest Articles

## A Computational Architecture for Coupling Heterogeneous Numerical Models and Computing Coupled Derivatives

One of the challenges in computational modeling is coupling models to solve multidisciplinary... (more)

## Practical Polytope Volume Approximation

We experimentally study the fundamental problem of computing the volume of a convex polytope given as an intersection of linear halfspaces. We implement and evaluate randomized polynomial-time algorithms for accurately approximating the polytope’s volume in high dimensions (e.g., few hundreds) based onhit-and-run random walks. To carry out... (more)

## BootCMatch: A Software Package for Bootstrap AMG Based on Graph Weighted Matching

This article has two main objectives: one is to describe some extensions of an adaptive Algebraic Multigrid (AMG) method of the form previously proposed by the first and third authors, and a second one is to present a new software framework, named BootCMatch, which implements all the components needed to build and apply the described adaptive AMG... (more)

## Solution of Dense Linear Systems via Roundoff-Error-Free Factorization Algorithms: Theoretical Connections and Computational Comparisons

Exact solving of systems of linear equations (SLEs) is a fundamental subroutine within number theory, formal verification of mathematical proofs, and... (more)

## Interval Enclosures of Upper Bounds of Roundoff Errors Using Semidefinite Programming

A long-standing problem related to floating-point implementation of numerical programs is to provide efficient yet precise analysis of output errors.... (more)

## BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization

Basic Linear Algebra Subroutines for Embedded Optimization (BLASFEO) is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and small-scale high-performance computing, in general. A key difference with respect to existing high-performance implementations of... (more)

## ROPTLIB: An Object-Oriented C++ Library for Optimization on Riemannian Manifolds

Riemannian optimization is the task of finding an optimum of a real-valued function defined on a Riemannian manifold. Riemannian optimization has been a topic of much interest over the past few years due to many applications including computer vision, signal processing, and numerical linear algebra. The substantial background required to... (more)

## Validated and Numerically Efficient Chebyshev Spectral Methods for Linear Ordinary Differential Equations

In this work, we develop a validated numeric method for the solution of linear ordinary differential... (more)

## Secure and Fast Encryption (SAFE) with Classical Random Number Generators

Pseudo-random number generators (PRNGs) play an important role in both areas of computer simulation and computer security. Currently, there appears to... (more)

## Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

Sparse matrix vector multiplication (SpMV) is an important computational kernel in traditional high-performance computing and emerging data-intensive... (more)

## Algorithm 989: perm_mateda: A Matlab Toolbox of Estimation of Distribution Algorithms for Permutation-based Combinatorial Optimization Problems

Permutation problems are combinatorial optimization problems whose solutions are naturally codified as permutations. Due to their complexity, motivated principally by the factorial cardinality of the search space of solutions, they have been a recurrent topic for the artificial intelligence and operations research community. Recently, among the... (more)

## Algorithm 990: Efficient Atlasing and Search of Configuration Spaces of Point-Sets Constrained by Distance Intervals

For configurations of point-sets that are pairwise constrained by distance intervals, the EASAL software implements a suite of algorithms that characterize the structure and geometric properties of the configuration space. The algorithms generate, describe, and explore these configuration spaces using generic rigidity properties, classical results... (more)

##### NEWS

TOMS Replicated Computational Results (RCR) Initiative.

ACM TOMS has introduced a new initiative to optionally review the computational results of a TOMS submission. This new effort is intended to assist in improving the quality of scientific publication for TOMS and for the computational science community as a whole. Manuscripts that successfully complete the RCR Review process receive the RCR designation when published. If you are interested in participating in this initiative, either as an author or reviewer, please contact the TOMS Editor-in-Chief. Details of the TOMS RCR Initiative are available here.

##### Forthcoming Articles
Client-side computational optimization

Mobile platforms have matured to a point where they can provide the infrastructure needed by sophisticated optimization codes. This opens both the possibility to envisage new market interest for distributed application codes and the opportunity to intensify research on algorithms that require limited computational power, such as that which is oered by mobile platforms. This paper reports on some exploratory experiences in this area, with reference to requests and opportunities for real world applications and to opening computational validation using matheuristics codes. A specic use case requiring to address the feasibility version of the generalized assignment problem is discussed.

BBCPOP: A Sparse Doubly Nonnegative Relaxation of Polynomial Optimization Problems with Binary, Box and Complementarity Constraints

BBCPOP is a MATLAB implementation of a hierarchy of sparse doubly nonnegative (DNN) relaxations of a class of polynomial optimization problems (POPs) with binary, box and complementarity (BBC) constraints. Given a POP in the class and a relaxation order, BBCPOP constructs a simple conic optimization problem (COP), which serves as a DNN relaxation of the POP, then solves the COP by applying the bisection and projection method. The COP is expressed with a linear objective function and constraints described as a single hyperplane and two cones, which are the Cartesian product of positive semidefinite cones and a polyhedral cone induced from the BBC constraints. BBCPOP aims to compute a tight lower bound for the optimal value of a large-scale POP in the class that is beyond the comfort zone of existing software packages. The robustness, reliability and efficiency of BBCPOP are demonstrated in comparison to the software SDP package SDPNAL+ on randomly generated sparse POPs of degree 2 and 3 with up to a few thousands variables, and ones of degree 4, 5, 6. and 8 with up to a few hundred variables. Comparison with other BBC POPs that arise from combinatorial optimization such as quadratic assignment problems are reported.

Extended BACOLI: Solving one-dimensional multi-scale parabolic PDE systems with error control

BACOLI is a Fortran software package for solving one-dimensional parabolic partial differential equations with separated boundary conditions by B-spline adaptive collocation methods. A distinguishing feature of BACOLI is its ability to estimate and control error as well as correspondingly adapt meshes in both space and time. However, many models of scientific interest can be formulated as multi-scale parabolic PDE systems, i.e., models that couple a system of parabolic partial differential equations describing dynamics on a global scale with a system of spatially uncoupled partial differential equations describing dynamics on a local scale. This article describes the Fortran software eBACOLI, the extension of \BACOLI\ to solve such multi-scale models. The performance of the extended code is demonstrated to be statistically equivalent to the original for purely parabolic PDE systems. Results from \eBACOLI\ are given for various multi-scale models from the extended problem class considered.Temporal work-precision behaviour for a difficult multi-scale example shows that it is generally advisable to choose the lowest-degree B-splines that yield a convergent solution.

Remark on "Algorithm 680: evaluation of the complex error function": Cause and Remedy for the Loss of Accuracy Near the Real Axis

In this remark, we point out a cause of loss-of-accuracy in calculating the Faddeyeva function, w(z) using Algorithm 680, near the real axis, and provide a simple remedy to recover the accuracy of the code as one of the important references for accuracy comparison.

Verified Newton-Raphson Iteration for Multiplicative Inverses Modulo Powers of Any Base

We identify two faults in a published algorithm for efficient computation of multiplicative inverses modulo prime powers. We patch the algorithm and present machine assisted proofs of correctness of the repair. Our formal proofs also reveal that being prime is an unnecessary demand for the power base, thus attributing a wider scope of applications to the repaired algorithm.

Polar Affine Arithmetic: Optimal Approximation and Operation Development for computation in polar form under uncertainty

Interval arithmetic has emerged to solve problems with uncertain parameters which are represented by upper and lower bounds. In rectangular coordinate systems, the basic interval operations and improved interval algorithms have been developed and adopted in the numerical analysis. On the other hand, in polar coordinate systems, interval arithmetic still suffers from significant issues of complex computation and overestimation. This paper defines a polar affine quantity and develops a polar affine arithmetic (PAA) that extends affine arithmetic to the polar coordinate systems, which performs much better in many aspects than the corresponding polar interval arithmetic (PIA). Basic arithmetic operations are developed based on the complex affine arithmetic. The Chebyshev approximation theory and the min-range approximation theory are used to identify the best affine approximation of quantities. PAA can accurately keep track the interdependency among multiple variables throughout the calculation procedure, which prominently reduces the solution conservativeness. Numerical case studies in MATLAB programs show that, compared with benchmark results from the Monte Carlo (MC) method, the proposed PAA ensures the completeness of the exact solution, while presenting a much more compact solution region than PIA. PAA has a great potential in research fields including numerical analysis, computer graphics, and engineering optimization.

A QDWH-Based SVD Software Framework on Distributed-Memory Manycore Systems

This paper presents a high performance software framework for computing a dense SVD on distributed-memory manycore systems. Originally introduced by Nakatsukasa et al., the SVD solver relies on the polar decomposition using the QR Dynamically-Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying to the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling trade-offs. The QDWH-SVD framework achieves up to 3/8-fold on the Haswell/KNL-based platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well and ill-conditioned matrices.

Spin Summations: A High-Performance Perspective

Besides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate formsincluding CCSD(T) and CCSDT(Q)are spin summations. At a first sight, spin summations are operations similar to tensor transpositions; a closer look instead reveals additional challenges to high- performance calculations, including temporal locality as well as scattered memory accesses. This publication explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g. caches, vectorization), or the problem itself (e.g. factorizability). The final algorithm combines the advantages of all the solutions, while avoiding their drawbacks; this algorithm, achieves high-performance through parallelization, vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4× and 5.5× over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2× over an out-of-place variant.

Computing the Braid Monodromy of Completely Reducible $n$-gonal Curves

Braid monodromy is an important tool for computing invariants of curves and surfaces. In this paper, the rectangular braid diagram technique is proposed to compute the braid monodromy of a completely reducible $n$-gonal curve, i.e. the curves in the form $(y-y_1(x))...(y-y_n(x))=0$ where $n \in \mathbb{Z}^{+}$ and $y_i \in \mathbb{C}[x]$. Also, an algorithm is implemented to compute the Alexander polynomial of these curve complements using Burau representations of braid groups. Examples for each computation are provided.

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modelling

Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]. One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains. The major challenge with sparse tiling is implementation  not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art introducing: a more efficient sparse tiling algorithm; support for distributed-memory parallelism; implementation in a publicly-available library, SLOPE; and a study of the performance impact in Seigen, an elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.

Algorithm xxx: An Efficient Parallel Anisotropic Unstructured Delaunay Mesh Generator for Two-Dimensional Aerospace Computational Fluid Dynamics Analysis

A bottom-up approach to parallel anisotropic mesh generation is presented by building a mesh generator from principles. Applications focusing on high-lift design or dynamic stall, or numerical methods and modeling test cases still focus on two-dimensional domains. This automated parallel mesh generation approach can generate high-fidelity unstructured meshes with anisotropic boundary layers for use in the computational fluid dynamics field. The anisotropy requirement adds a level of complexity to a parallel meshing algorithm by making computation depend on the local alignment of elements, which in turn is dictated by geometric boundaries and the density functions. This approach yields computational savings in mesh generation and flow solution through well-shaped anisotropic triangles, instead of isotropic triangles. A 79% parallel weak scaling efficiency on 1024 distributed memory nodes, and a 72% parallel efficiency over the fastest sequential isotropic mesh generator on 512 distributed memory nodes is shown through numerical experiments.

Algorithm xxx: Fast Algorithms for the Computation of the Minimum Distance of a Random Linear Code

The minimum distance of a code is an important concept in information theory. Hence, computing the minimum distance of a code with a minimum computational cost is a crucial process to many problems in this area. In this paper, we present and evaluate a family of algorithms and implementations to compute the minimum distance of a random linear code over F2 that are faster than current implementations, both commercial and public domain. In addition to the basic sequential implementations, we present parallel and vectorized implementations that render high performances on modern architectures. The attained performance results show the benefits of the developed optimized algorithms, which obtain remarkable performance improvements compared with state-of-the-art implementations widely used nowadays.

An OpenGL and C++ based function library for curve and surface modeling in a large class of extended Chebyshev spaces

Applying original and existing theoretical results, we propose a platform-independent multi-threaded function library that provides data structures to generate, differentiate and render both the ordinary basis and the non-negative normalized B-basis of an arbitrary extended Chebyshev (EC) space that comprises the constants and can be identified with the solution space of a user-defined constant-coefficient homogeneous linear differential equation. Using the obtained non-negative normalized B-bases, our library can also generate, (partially) differentiate, modify and visualize a large family of so-called B-curves and tensor product B-surfaces. Moreover, the library also implements methods that can be used to perform general order elevation, to subdivide B-curves and B-surfaces by means of general de Casteljau-like B-algorithms, and to generate general basis transformations for the control point based exact description of arbitrary integral curves and surfaces that are described in traditional parametric form by means of the ordinary bases of the underlying EC spaces. Independently of the algebraic, exponential, trigonometric or mixed type of the applied EC space, the proposed library is numerically stable and efficient up to a reasonable dimension number and may be useful for academics and engineers in the fields of Approximation Theory, Computer Aided Geometric Design, Computer Graphics, Isogeometric and Numerical Analysis.

The Peano software---parallel, automaton-based, dynamically adaptive grid traversals

We discuss the design decisions, design alternatives and rationale behind the third generation of Peano, a framework for dynamically adaptive Cartesian meshes derived from spacetrees. Peano ties the mesh traversal to the mesh storage and supports only one element-wise traversal order resulting from space-filling curves. The user is not free to choose a traversal order herself. The traversal can exploit regular grid subregions and shared memory as well as distributed memory systems with almost no modifications to a serial application code. Relying on a formalism of the software design at hands of two interacting automata---one automaton for the multiscale grid traversal and one for the application-specific algorithmic steps---we discuss the chosen callback-based programming paradigm, supported application types and the two data storage schemes realised, before we detail high-performance computing aspects and lessons learned. Special emphasis is put on observations regarding the used programming and algorithmic concepts. This transforms our report from a one way to implement things'' code description into a generic alternatives and rationale summary for some design decisions to be made for any tree-based adaptive mesh refinement software.

Algorithm xxx: pySDC - Prototyping spectral deferred corrections

In this paper we present the Python framework pySDC for solving collocation problems with spectral deferred correction methods (SDC) and their time-parallel variant PFASST, the parallel full approximation scheme in space and time. pySDC features many implementations of SDC and PFASST, from simple implicit time-stepping to high-order implicit-explicit or multi-implicit splitting and multi-level spectral deferred corrections. It comes with many different, pre-implemented examples and has seven tutorials to help new users with their first steps. Time-parallelism is implemented either in an emulated way for debugging and prototyping as well as using MPI for benchmarking. The code is fully documented and tested using continuous integration, including most results of previous publications. Here, we describe the structure of the code by taking two different perspectives: the user's and the developer's perspective. While the first sheds light on the front-end, the examples and the tutorials, the second is used to describe the underlying implementation and the data structures. We show three different examples to highlight various aspects of the implementation, the capabilities and the usage of pySDC. Also, couplings to the FEniCS framework and PETSc, the latter including spatial parallelism with MPI, are described.

Faithfully Rounded Floating-point Computations

We present a pair arithmetic for the four basic operations and square root. It can be regarded as a simplified, more efficient double-double arithmetic. We prove rigorous error bounds for the computed result depending on the relative rounding error unit u according to base ², the size of the arithmetic expression, and possibly a condition measure. Under precisely specified assumptions, the result is proved to be faithfully rounded for up to 1/sqrt(²u)-2 operations. The assumptions are weak enough to apply to many algorithms. For example, our findings cover a number of previously published algorithms to compute faithfully rounded results, among them Horner's scheme, products, sums and dot products, or Euclidean norm. Beyond that, several other problems can be analyzed such as polynomial interpolation, orientation problems, Householder transformations, or the smallest singular value of Hilbert matrices of large size.

A note on using performance and data profiles for training algorithms

It is shown how to use the performance and data profile benchmarking tools to improve algorithms performance. An illustration for the BFO derivative-free optimizer suggests that the obtained gains are potentially significant.

ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems

Solving dense Hermitian eigenproblems arranged in a sequence with direct solvers fails to take advantage of those spectral properties which are pertinent to the entire sequence, and not just to the single problem. When such features take the form of correlations between the eigenvectors of consecutive problems, as is the case in many real-world applications, the potential benefit of exploiting them can be substantial. We present ChASE, a modern algorithm and library based on subspace iteration with polynomial acceleration. Novel to ChASE is the computation of the spectral estimates that enter in the filter and an optimization of the polynomial degree which further reduces the necessary FLOPs. ChASE is written in C++ using the modern software engineering concepts which favor a simple integration in application codes and a straightforward portability over heterogeneous platforms. When solving sequences of Hermitian eigenproblems for a portion of their extremal spectrum, ChASE greatly benefits from the sequence's spectral properties and outperforms direct solvers in many scenarios. The library ships with two distinct parallelization schemes, supports execution over distributed GPUs, and it is easily extensible to other parallel computing architectures.

On quality of implementation of Fortran 2008 complex intrinsic functions on branch cuts

Branch cuts in complex functions in combination with signed zero and signed infinity have important uses in fracture mechanics, jet flow and aerofoil analysis. We present benchmarks for validating Fortran 2008 complex functions - LOG, SQRT, ASIN, ACOS, ATAN, ASINH, ACOSH and ATANH - on branch cuts with arguments of all 3 IEEE floating point binary formats: binary32, binary64 and binary128. Results are reported with 8 Fortran 2008 compilers: GCC, Flang, Cray, Oracle, PGI, Intel, NAG and IBM. Multiple test failures were revealed, e.g. wrong signs of results or unexpected overflow, underflow, or NaN. We conclude that the quality of implementation of these Fortran 2008 intrinsics in many compilers is not yet sufficient to remove the need for special code for branch cuts. The test results are complemented by conformal maps of the branch cuts and detailed derivations of the values of these functions on branch cuts, to be used as a reference. The benchmarks are freely available from cmplx.sf.net. This work will be of interest to engineers who use complex functions, as well as to compiler and maths library developers.

Algorithm XXX: Efficient Computation with Kronecker Products

An algorithm for multiplying a chain of Kronecker products by a matrix is described. The algorithm does not require that the Kronecker chain actually be computed and the main computational work is a series of matrix multiplications. Use of the algorithm can lead to substantial savings in both memory usage and computational speed. Although similar algorithms have been described before, this paper makes two novel contributions. First, it shows how shuffling of data can be (largely) avoided. Second, it provides a simple method to determine the optimal ordering of the workflow. A \matlab~implementation is provided in an appendix.

A Unified 2D/3D Large Scale Software Environment for Nonlinear Inverse Problems

Large scale parameter estimation problems are some of the most computationally demanding problems. An academic researcher's domain-specific knowledge often precludes that of software design, which results in software frameworks for inversion that are technically correct, but not scalable to realistically-sized problems. On the other hand, the computational demands of the problem for realistic problems result in industrial codebases that are geared solely for performance, rather than comprehensibility or flexibility. We propose a new software design that bridges the gap between these two seemingly disparate worlds. A hierarchical and modular design allows a user to delve into as much detail as she desires, while using high performance primitives at the lower levels. Our code has the added benefit of actually reflecting the underlying mathematics of the problem, which lowers the cognitive load on user using it and reduces the initial startup period before a researcher can be fully productive. We also introduce a new preconditioner for the Helmholtz equation that is suitable for fault-tolerant distributed systems. Numerical experiments on a variety of 2D and 3D test problems demonstrate the effectiveness of this approach on scaling algorithms from small to large scale problems with minimal code changes.

High Performance Implementation of Elliptic Curve Cryptography using Vector Instructions

Elliptic curve cryptosystems are considered as an efficient alternative to conventional systems such as DSA and RSA. Recently, Montgomery and Edwards elliptic curves have been used to implement cryptosystems. In particular, the elliptic curves Curve25519 and Curve448 were used to instantiate the Edwards Digital Signature Algorithm (EdDSA), resulting in new signature schemes called Ed25519 and Ed448. Nowadays, EdDSA is increasingly used for securing Internet communications. In this work, we focus on the secure and efficient software implementation of elliptic curve-based algorithms using SIMD parallel processing. We present software techniques that target the Intel AVX2 vector instruction set for accelerating prime field arithmetic and elliptic curve operations. The set of our contributions leads to a high-performance software library for AVX2-ready processors. For instance, our library computes digital signatures 17-22% (for Ed25519) and 16-18% (for Ed448) faster than previous optimized implementations. In addition, our library improves in about 16-20% the execution time of the Diffie-Hellman key exchange protocol using Curve25519 and Curve448.