Latest Articles

## A Computational Architecture for Coupling Heterogeneous Numerical Models and Computing Coupled Derivatives

One of the challenges in computational modeling is coupling models to solve multidisciplinary... (more)

## Practical Polytope Volume Approximation

We experimentally study the fundamental problem of computing the volume of a convex polytope given as an intersection of linear halfspaces. We implement and evaluate randomized polynomial-time algorithms for accurately approximating the polytope’s volume in high dimensions (e.g., few hundreds) based onhit-and-run random walks. To carry out... (more)

## BootCMatch: A Software Package for Bootstrap AMG Based on Graph Weighted Matching

This article has two main objectives: one is to describe some extensions of an adaptive Algebraic Multigrid (AMG) method of the form previously proposed by the first and third authors, and a second one is to present a new software framework, named BootCMatch, which implements all the components needed to build and apply the described adaptive AMG... (more)

## Solution of Dense Linear Systems via Roundoff-Error-Free Factorization Algorithms: Theoretical Connections and Computational Comparisons

Exact solving of systems of linear equations (SLEs) is a fundamental subroutine within number theory, formal verification of mathematical proofs, and... (more)

## Interval Enclosures of Upper Bounds of Roundoff Errors Using Semidefinite Programming

A long-standing problem related to floating-point implementation of numerical programs is to provide efficient yet precise analysis of output errors.... (more)

## BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization

Basic Linear Algebra Subroutines for Embedded Optimization (BLASFEO) is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and small-scale high-performance computing, in general. A key difference with respect to existing high-performance implementations of... (more)

## ROPTLIB: An Object-Oriented C++ Library for Optimization on Riemannian Manifolds

Riemannian optimization is the task of finding an optimum of a real-valued function defined on a Riemannian manifold. Riemannian optimization has been a topic of much interest over the past few years due to many applications including computer vision, signal processing, and numerical linear algebra. The substantial background required to... (more)

## Validated and Numerically Efficient Chebyshev Spectral Methods for Linear Ordinary Differential Equations

In this work, we develop a validated numeric method for the solution of linear ordinary differential... (more)

## Secure and Fast Encryption (SAFE) with Classical Random Number Generators

Pseudo-random number generators (PRNGs) play an important role in both areas of computer simulation and computer security. Currently, there appears to... (more)

## Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

Sparse matrix vector multiplication (SpMV) is an important computational kernel in traditional high-performance computing and emerging data-intensive... (more)

## Algorithm 989: perm_mateda: A Matlab Toolbox of Estimation of Distribution Algorithms for Permutation-based Combinatorial Optimization Problems

Permutation problems are combinatorial optimization problems whose solutions are naturally codified as permutations. Due to their complexity, motivated principally by the factorial cardinality of the search space of solutions, they have been a recurrent topic for the artificial intelligence and operations research community. Recently, among the... (more)

## Algorithm 990: Efficient Atlasing and Search of Configuration Spaces of Point-Sets Constrained by Distance Intervals

For configurations of point-sets that are pairwise constrained by distance intervals, the EASAL software implements a suite of algorithms that characterize the structure and geometric properties of the configuration space. The algorithms generate, describe, and explore these configuration spaces using generic rigidity properties, classical results... (more)

##### NEWS

TOMS Replicated Computational Results (RCR) Initiative.

ACM TOMS has introduced a new initiative to optionally review the computational results of a TOMS submission. This new effort is intended to assist in improving the quality of scientific publication for TOMS and for the computational science community as a whole. Manuscripts that successfully complete the RCR Review process receive the RCR designation when published. If you are interested in participating in this initiative, either as an author or reviewer, please contact the TOMS Editor-in-Chief. Details of the TOMS RCR Initiative are available here.

##### Forthcoming Articles
Mathematics and Speed for Interval Arithmetic - A Complement to IEEE 1788

Abstract. The paper begins with an axiomatic definition of rounded arithmetic. The concepts of rounding and of rounded arithmetic operations are defined in an axiomatic manner fully independent of special data formats and encodings. Basic properties of floating-point and interval arithmetic can directly be derived from this abstract model. Interval operations are defined as set operations for elements of the set IR of closed and connected sets of real numbers. As such they form an algebraically closed subset of the powerset of the real numbers. This property leads to explicit formulas for the arithmetic operations of floating-point intervals of IF, which are executable on the computer. Arithmetic for intervals of IF forms an exception free calculus, i.e., arithmetic operations for intervals of IF always lead to intervals of IF again. Later sections are concerned with programming support and hardware for interval arithmetic. Section 9 illustrates that interval arithmetic as developed in this paper has already a long tradition. Products based on these ideas have been available since 1980. Implementing what the paper advocates would have a profound effect on mathematical software. Modern processor architecture comes quite close to what is requested in this paper.

Polar Affine Arithmetic: Optimal Approximation and Operation Development for computation in polar form under uncertainty

Interval arithmetic has emerged to solve problems with uncertain parameters which are represented by upper and lower bounds. In rectangular coordinate systems, the basic interval operations and improved interval algorithms have been developed and adopted in the numerical analysis. On the other hand, in polar coordinate systems, interval arithmetic still suffers from significant issues of complex computation and overestimation. This paper defines a polar affine quantity and develops a polar affine arithmetic (PAA) that extends affine arithmetic to the polar coordinate systems, which performs much better in many aspects than the corresponding polar interval arithmetic (PIA). Basic arithmetic operations are developed based on the complex affine arithmetic. The Chebyshev approximation theory and the min-range approximation theory are used to identify the best affine approximation of quantities. PAA can accurately keep track the interdependency among multiple variables throughout the calculation procedure, which prominently reduces the solution conservativeness. Numerical case studies in MATLAB programs show that, compared with benchmark results from the Monte Carlo (MC) method, the proposed PAA ensures the completeness of the exact solution, while presenting a much more compact solution region than PIA. PAA has a great potential in research fields including numerical analysis, computer graphics, and engineering optimization.

randUTV: A blocked randomized algorithm for computing a rank-revealing UTV factorization

A randomized algorithm for computing a so called UTV factorization efficiently is presented. Given a matrix $A$, the algorithm "randUTV" computes a factorization $A = UTV^{*}$, where $U$ and $V$ have orthonormal columns, and $T$ is triangular (either upper or lower, whichever is preferred). The algorithm randUTV is developed primarily to be a fast and easily parallelized alternative to algorithms for computing the Singular Value Decomposition (SVD). randUTV provides accuracy very close to that of the SVD for problems such as low-rank approximation, solving ill-conditioned linear systems, determining bases for various subspaces associated with the matrix, etc. Moreover, randUTV produces highly accurate approximations to the singular values of $A$. Unlike the SVD, the randomized algorithm proposed builds a UTV factorization in an incremental, single-stage, and non-iterative way, making it possible to halt the factorization process once a specified tolerance has been met. Numerical experiments comparing the accuracy and speed of randUTV to the SVD are presented. These experiments demonstrate that in comparison to column pivoted QR, which is another factorization that is often used as a relatively economic alternative to the SVD, randUTV compares favorably in terms of speed while providing far higher accuracy.

Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression

We present high performance GPU implementations of matvec and compression operations for the H2-variant of hierarchical matrices. H2 matrices, an algebraic generalization of FMM, are space and time efficient representations of dense matrices that exploit the low rank structure of matrix blocks at different levels of granularity and employ both a hierarchical block partitioning and hierarchical bases for the block representations. These two operations are at the core of algebraic operations for hierarchical matrices, the matvec being a ubiquitous operation in numerical algorithms and compression representing a key building block for algebraic operations which require periodic recompression during execution. The difficulties in developing efficient GPU algorithms come primarily from the irregular tree data structures that underlie the hierarchical representations, and the key to performance is to expose fine grained parallelism by recasting the computations on flattened trees and marshaling the irregularly laid out data in ways that allow batched linear algebra operations to be performed. Our numerical results on covariance matrices from 2D and 3D problems from spatial statistics show the high efficiency our routines achieve: over 550 GB/s for the bandwidth-limited matrix-vector operation and over 850 GFLOPS/s for the compression operation on the P100 Pascal GPU.

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Matrices coming from elliptic Partial Differential Equations have been shown to have a low-rank property which can be efficiently exploited in multifrontal solvers to provide a substantial reduction of their complexity. Among the possible low-rank formats, the Block Low-Rank format (BLR) is easy to use in a general purpose multifrontal solver and its potential compared to standard (full-rank) solvers has been demonstrated. Recently, new variants have been introduced and it was proved that they can further reduce the complexity but their performance has never been analyzed. In this paper, we present a multithreaded BLR factorization, and analyze its efficiency and scalability in shared-memory multicore environments. We identify the challenges posed by the use of BLR approximations in multifrontal solvers and put forward several algorithmic variants of the BLR factorization that overcome these challenges by improving its efficiency and scalability. We illustrate the performance analysis of the BLR multifrontal factorization with numerical experiments on a large set of problems coming from a variety of real-life applications.

Computing the Braid Monodromy of Completely Reducible $n$-gonal Curves

Braid monodromy is an important tool for computing invariants of curves and surfaces. In this paper, the rectangular braid diagram technique is proposed to compute the braid monodromy of a completely reducible $n$-gonal curve, i.e. the curves in the form $(y-y_1(x))...(y-y_n(x))=0$ where $n \in \mathbb{Z}^{+}$ and $y_i \in \mathbb{C}[x]$. Also, an algorithm is implemented to compute the Alexander polynomial of these curve complements using Burau representations of braid groups. Examples for each computation are provided.

#### Algorithm xxx: The 2D Tree Sliding Window Discrete Fourier Transform

An OpenGL and C++ based function library for curve and surface modeling in a large class of extended Chebyshev spaces

Applying original and existing theoretical results, we propose a platform-independent multi-threaded function library that provides data structures to generate, differentiate and render both the ordinary basis and the non-negative normalized B-basis of an arbitrary extended Chebyshev (EC) space that comprises the constants and can be identified with the solution space of a user-defined constant-coefficient homogeneous linear differential equation. Using the obtained non-negative normalized B-bases, our library can also generate, (partially) differentiate, modify and visualize a large family of so-called B-curves and tensor product B-surfaces. Moreover, the library also implements methods that can be used to perform general order elevation, to subdivide B-curves and B-surfaces by means of general de Casteljau-like B-algorithms, and to generate general basis transformations for the control point based exact description of arbitrary integral curves and surfaces that are described in traditional parametric form by means of the ordinary bases of the underlying EC spaces. Independently of the algebraic, exponential, trigonometric or mixed type of the applied EC space, the proposed library is numerically stable and efficient up to a reasonable dimension number and may be useful for academics and engineers in the fields of Approximation Theory, Computer Aided Geometric Design, Computer Graphics, Isogeometric and Numerical Analysis.

Faithfully Rounded Floating-point Computations

We present a pair arithmetic for the four basic operations and square root. It can be regarded as a simplified, more efficient double-double arithmetic. We prove rigorous error bounds for the computed result depending on the relative rounding error unit u according to base ², the size of the arithmetic expression, and possibly a condition measure. Under precisely specified assumptions, the result is proved to be faithfully rounded for up to 1/sqrt(²u)-2 operations. The assumptions are weak enough to apply to many algorithms. For example, our findings cover a number of previously published algorithms to compute faithfully rounded results, among them Horner's scheme, products, sums and dot products, or Euclidean norm. Beyond that, several other problems can be analyzed such as polynomial interpolation, orientation problems, Householder transformations, or the smallest singular value of Hilbert matrices of large size.

Algorithm XXX: Efficient Computation with Kronecker Products

An algorithm for multiplying a chain of Kronecker products by a matrix is described. The algorithm does not require that the Kronecker chain actually be computed and the main computational work is a series of matrix multiplications. Use of the algorithm can lead to substantial savings in both memory usage and computational speed. Although similar algorithms have been described before, this paper makes two novel contributions. First, it shows how shuffling of data can be (largely) avoided. Second, it provides a simple method to determine the optimal ordering of the workflow. A \matlab~implementation is provided in an appendix.

A Unified 2D/3D Large Scale Software Environment for Nonlinear Inverse Problems

Large scale parameter estimation problems are some of the most computationally demanding problems. An academic researcher's domain-specific knowledge often precludes that of software design, which results in software frameworks for inversion that are technically correct, but not scalable to realistically-sized problems. On the other hand, the computational demands of the problem for realistic problems result in industrial codebases that are geared solely for performance, rather than comprehensibility or flexibility. We propose a new software design that bridges the gap between these two seemingly disparate worlds. A hierarchical and modular design allows a user to delve into as much detail as she desires, while using high performance primitives at the lower levels. Our code has the added benefit of actually reflecting the underlying mathematics of the problem, which lowers the cognitive load on user using it and reduces the initial startup period before a researcher can be fully productive. We also introduce a new preconditioner for the Helmholtz equation that is suitable for fault-tolerant distributed systems. Numerical experiments on a variety of 2D and 3D test problems demonstrate the effectiveness of this approach on scaling algorithms from small to large scale problems with minimal code changes.

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.) only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.