Adjoint methods have become fundamental ingredients of the scientific computing toolbox over the past decades. Large-scale parameter sensitivity analysis, uncertainty quantification and nonlinear optimization would otherwise turn out computationally infeasible. The symbolic derivation of adjoint mathematical models for relevant problems in science and engineering and their implementation in consistency with the implementation of the underlying primal model frequently proves highly challenging. Hence an increased interest in algorithmic adjoints can be observed. The algorithmic derivation of adjoint numerical simulation programs shifts some of the problems faced from functional and numerical analysis to computer science. It becomes a highly complex software engineering task requiring expertise in software analysis, transformation and optimization. Despite rather mature software tool support for algorithmic differentiation substantial user intervention is typically required when targeting nontrivial numerical programs. A large number of patterns shared by numerous application codes results in repeated duplication of development effort. The adjoint code design patterns introduced in this paper aim to reduce this problem through improved formalization from the software engineering perspective.

Partitioning graphs is a common and useful operation in many areas, from parallel computing to VLSI design to sparse matrix algorithms. In this paper, we introduce Mongoose, a multilevel hybrid graph par- titioning algorithm and library. Building on previous work in multilevel partitioning frameworks and com- binatoric approaches, we introduce novel stall-reducing and stall-free coarsening strategies, as well as an efficient hybrid algorithm leveraging 1) traditional combinatoric methods and 2) continuous quadratic pro- gramming formulations. We demonstrate how this new hybrid algorithm outperforms either strategy in isolation, and we also compare Mongoose to METIS and demonstrate its effectiveness on large and social networking (power law) graphs.

SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which defines a set of sparse matrix operations on an extended algebra of semirings using an almost unlimited variety of operators and types. When applied to sparse adjacency matrices, these algebraic operations are equivalent to computations on graphs. GraphBLAS provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. An overview of the GraphBLAS specification is given, followed by a description of the key features of its implementation in the SuiteSparse:GraphBLAS package.

BBCPOP is a MATLAB implementation of a hierarchy of sparse doubly nonnegative (DNN) relaxations of a class of polynomial optimization problems (POPs) with binary, box and complementarity (BBC) constraints. Given a POP in the class and a relaxation order, BBCPOP constructs a simple conic optimization problem (COP), which serves as a DNN relaxation of the POP, then solves the COP by applying the bisection and projection method. The COP is expressed with a linear objective function and constraints described as a single hyperplane and two cones, which are the Cartesian product of positive semidefinite cones and a polyhedral cone induced from the BBC constraints. BBCPOP aims to compute a tight lower bound for the optimal value of a large-scale POP in the class that is beyond the comfort zone of existing software packages. The robustness, reliability and efficiency of BBCPOP are demonstrated in comparison to the software SDP package SDPNAL+ on randomly generated sparse POPs of degree 2 and 3 with up to a few thousands variables, and ones of degree 4, 5, 6. and 8 with up to a few hundred variables. Comparison with other BBC POPs that arise from combinatorial optimization such as quadratic assignment problems are reported.

We consider the problem of computing rigorous enclosures for polynomials represented in Chebyshev basis. Our aim is to compare and develop algorithms with a linear complexity in terms of the polynomial degree. A first category of methods rely on a direct interval evaluation of the given Chebyshev expansion in which Chebyshev polynomials are bounded e.g., with a divide and conquer strategy. Our main category of methods which are based on the Clenshaw recurrence includes interval Clenshaw with defect correction (ICDC), and the spectral transformation of Clenshaw recurrence rewritten as a discrete dynamical system. An extension of the barycentric representation to interval arithmetic is also considered which has a log-linear complexity as it takes advantage of a verified discrete cosine transform. We compare different methods and provide illustrative numerical experiments. In particular, our eigenvalue-based methods are interesting for bounding the range of high-degree interval polynomials. Some of the methods rigorously compute narrow enclosures for high degree Chebyshev expansions at thousands of points in a few seconds on an ordinary machine. We also illustrate how to employ our methods as an automatic a posteriori forward error analysis tool to monitor the accuracy of Chebfun feval command.

IPscatt is a free toolbox in MATLAB that solves the inverse medium problem in scattering in two and three dimensions. It uses the Helmholtz equation as a physical model of scattering and provides point sources or plane waves as incident fields as well as simulated near or far field data with noise. Further, it can deal with real-world data in two dimensions from Institute Fresnel. A rapid variational regularization scheme minimizes the defect, i.e. the difference between the predicted data from the model and the given data, as well as penalty terms promoting sparsity, total variation and physical bounds. These regularization scheme relies on a primal-dual algorithm and was introduced
in [F. Bürgel, K. S. Kazimierski, and A. Lechleiter 2017 *Journal of Computational Physics*, **339**:130]. This paper provides a survey on the mathematical concepts, connects them with the implementation and shows the practical usage of the toolbox via guides.

Solving tridiagonal systems is a fundamental computing core in a wide range of scientific applications, and its computation can be modeled with parallel algorithms. These parallel solvers are typically designed to compute problems whose data fit in a common shared-memory space where all the cores taking part in the computation have access. However, when the problem size is large, data cannot be entirely stored in the common shared-memory space, and a high number of high-latency communications are performed. One alternative is to partition the problem among different memory spaces. At this point, conventional parallel algorithms do not facilitate the partition of computation in independent tiles, since each reduction depends on equations which may be in different tiles. This paper proposes an algorithm based on a tree reduction, called the Tree Partitioning Reduction (TPR) method, which partitions the problem into independent slices, that can be partially computed in parallel within different common-shared memory spaces. The TPR method can be implemented for any parallel and distributed programming paradigm. Furthermore, in this work, TPR is efficiently implemented for CUDA GPUs to solve large problem sizes, providing highly competitive performance results with respect to existing packages, being, on average, 22.03x faster than CUSPARSE

A bottom-up approach to parallel anisotropic mesh generation is presented by building a mesh generator from principles. Applications focusing on high-lift design or dynamic stall, or numerical methods and modeling test cases still focus on two-dimensional domains. This automated parallel mesh generation approach can generate high-fidelity unstructured meshes with anisotropic boundary layers for use in the computational fluid dynamics field. The anisotropy requirement adds a level of complexity to a parallel meshing algorithm by making computation depend on the local alignment of elements, which in turn is dictated by geometric boundaries and the density functions. This approach yields computational savings in mesh generation and flow solution through well-shaped anisotropic triangles, instead of isotropic triangles. A 79% parallel weak scaling efficiency on 1024 distributed memory nodes, and a 72% parallel efficiency over the fastest sequential isotropic mesh generator on 512 distributed memory nodes is shown through numerical experiments.

The maximum flow problem is one of the most common network flow problems. This problem involves finding the maximum possible amount of flow between two designated nodes on a network with arcs having flow capacities. The push-relabel algorithm is one of the fastest algorithms to solve this problem. We present a shared memory parallel push-relabel algorithm. Graph coloring is used to avoid collisions between threads for concurrent push and relabel operations. In addition, excess values of target nodes are updated using atomic instructions to prevent race conditions. The experiments show that our algorithm is competitive for wide graphs with low diameters. Results from two different data sets are included, computer vision problems and DIMACS challenge problems. These are compared with existing push-relabel and pseudoflow implementations. We show that high speedup rates are possible using our coloring based parallelization technique on sparse networks. However, we also observe that the pseudoflow algorithm runs faster than the push-relabel algorithm on dense and long networks.

The minimum distance of a code is an important concept in information theory. Hence, computing the minimum distance of a code with a minimum computational cost is a crucial process to many problems in this area. In this paper, we present and evaluate a family of algorithms and implementations to compute the minimum distance of a random linear code over F2 that are faster than current implementations, both commercial and public domain. In addition to the basic sequential implementations, we present parallel and vectorized implementations that render high performances on modern architectures. The attained performance results show the benefits of the developed optimized algorithms, which obtain remarkable performance improvements compared with state-of-the-art implementations widely used nowadays.

In this paper we present the Python framework pySDC for solving collocation problems with spectral deferred correction methods (SDC) and their time-parallel variant PFASST, the parallel full approximation scheme in space and time. pySDC features many implementations of SDC and PFASST, from simple implicit time-stepping to high-order implicit-explicit or multi-implicit splitting and multi-level spectral deferred corrections. It comes with many different, pre-implemented examples and has seven tutorials to help new users with their first steps. Time-parallelism is implemented either in an emulated way for debugging and prototyping as well as using MPI for benchmarking. The code is fully documented and tested using continuous integration, including most results of previous publications. Here, we describe the structure of the code by taking two different perspectives: the user's and the developer's perspective. While the first sheds light on the front-end, the examples and the tutorials, the second is used to describe the underlying implementation and the data structures. We show three different examples to highlight various aspects of the implementation, the capabilities and the usage of pySDC. Also, couplings to the FEniCS framework and PETSc, the latter including spatial parallelism with MPI, are described.

The ROLMIP (Robust LMI Parser) is a toolbox specialized in control theory for uncertain linear systems, built to work under MATLAB jointly with YALMIP, to ease the programming of sufficient Linear Matrix Inequality (LMI) conditions that, if feasible, assure the validity of parameter-dependent LMIs in the entire set of uncertainty considered. This paper presents the new version of the ROLMIP toolbox, that was completely remodeled to provide a high-level user friendly interface, to cope with distinct uncertain domains (hypercube and multi-simplex), and to treat time-varying parameters in discrete- and continuous-time. Through simple commands, the user is able to define matrix polynomials, as well as to described the desired parameter-dependent LMIs in an easy way, considerably reducing the programming time to end up with implementable LMI conditions and, therefore, helping the popularization of the state-of-the-art robust control methods for uncertain systems based on LMIs among graduate students, researchers and engineers in control systems.

We present an algorithmic framework for matrix-free evaluation of discontinuous Galerkin finite element operators based on sum factorization on quadrilateral and hexahedral meshes. We identify a set of kernels for fast quadrature on cells and faces targeting a wide class of weak forms originating from linear and nonlinear partial differential equations. Different algorithms and data structures for the implementation of operator evaluation are compared in an in-depth performance analysis. The sum factorization kernels are optimized by vectorization over several cells and faces and an even-odd decomposition of the one-dimensional compute kernels. In isolation our implementation then reaches up to 60% of arithmetic peak on Intel Haswell and Broadwell processors and up to 50% of arithmetic peak on Intel Knights Landing. The full operator evaluation reaches only about half that throughput due to memory bandwidth limitations from loading the input and output vectors, MPI ghost exchange, as well as handling variable coefficients and the geometry. Our performance analysis shows that the results are often within 10% of the available memory bandwidth for the proposed implementation, with the exception of the Cartesian mesh case where the cost of gather operations and MPI communication are more substantial.

Multi-degree splines are smooth piecewise-polynomial functions where the pieces can have different degrees. We describe a simple algorithmic construction of a set of basis functions for the space of multi-degree splines, with similar properties to standard B-splines. These basis functions are called multi-degree B-splines (or MDB-splines). The construction relies on an extraction operator that represents all MDB-splines as linear combinations of local B-splines of different degrees. This enables the use of existing efficient algorithms for B-spline evaluations and refinements in the context of multi-degree splines. A Matlab implementation is provided to illustrate the computation and use of MDB-splines.

We introduce CASC: a new, modern, and header-only C++ library which provides a data structure to represent arbitrary dimension abstract simplicial complexes (ASC) with user-defined classes stored directly on the simplices at each dimension. This is accomplished by using the latest C++ language features including variadic template parameters introduced in C++11 and automatic function return type deduction from C++14. Effectively CASC decouples the representation of the topology from the interactions of user data. We present the innovations and design principles of the data structure and related algorithms. This includes a meta-data aware decimation algorithm which is general for collapsing simplices of any dimension. We also present an example application of this library to represent an orientable surface mesh.

We present an efficient implementation of hypergeometric functions in arbitrary-precision interval arithmetic. The functions ${}_0F_1$, ${}_1F_1$, ${}_2F_1$ and ${}_2F_0$ (or the Kummer $U$-function) are supported for unrestricted complex parameters and argument, and by extension, we cover exponential and trigonometric integrals, error functions, Fresnel integrals, incomplete gamma and beta functions, Bessel functions, Airy functions, Legendre functions, Jacobi polynomials, complete elliptic integrals, and other special functions. The output can be used directly for interval computations or to generate provably correct floating-point approximations in any format. Performance is competitive with earlier arbitrary-precision software, and sometimes orders of magnitude faster. We also partially cover the generalized hypergeometric function~${}_pF_q$ and computation of high-order parameter derivatives.

Elliptic curve cryptosystems are considered as an efficient alternative to conventional systems such as DSA and RSA. Recently, Montgomery and Edwards elliptic curves have been used to implement cryptosystems. In particular, the elliptic curves Curve25519 and Curve448 were used to instantiate the Edwards Digital Signature Algorithm (EdDSA), resulting in new signature schemes called Ed25519 and Ed448. Nowadays, EdDSA is increasingly used for securing Internet communications. In this work, we focus on the secure and efficient software implementation of elliptic curve-based algorithms using SIMD parallel processing. We present software techniques that target the Intel AVX2 vector instruction set for accelerating prime field arithmetic and elliptic curve operations. The set of our contributions leads to a high-performance software library for AVX2-ready processors. For instance, our library computes digital signatures 17-22% (for Ed25519) and 16-18% (for Ed448) faster than previous optimized implementations. In addition, our library improves in about 16-20% the execution time of the Diffie-Hellman key exchange protocol using Curve25519 and Curve448.