Sparse matrix factorization involves a mix of regular and irregular computation, which is a particular challenge when trying to obtain high-performance on the highly parallel general-purpose computing cores available on graphics processing units (GPUs). We present a sparse multifrontal QR factorization method that meets this challenge, and is up to eleven times faster than a highly optimized method on a multicore CPU. Our method factorizes many frontal matrices in parallel and keeps all the data transmitted between frontal matrices on the GPU. A novel bucket scheduler algorithm extends the communication-avoiding QR factorization for dense matrices, by exploiting more parallelism and by exploiting the staircase form present in the frontal matrices of a sparse multifrontal method.
We present simple, efficient, and relatively accurate approximations for the computation of the Faddeyeva function w(z). Over a wide and fine grid of the complex argument, z=x+iy, covering the whole real axis and yÎ [10-20, 1020], numerical results from the present approximations showed a highest relative error less than 4.0´10-5 for both real and imaginary parts of w. The routines are simple and can be easily implemented in any computational framework. They maintain the claimed accuracy and provide safer results than other competitive techniques. In addition to the calculation of the Faddeyeva function, w, partial derivatives of the real and imaginary parts of the function can be easily calculated and returned as optional outputs.
In order to solve a differential problem the Laplace transform method, when applicable, replaces the problem with a simpler one. The solution is obtained by solving the new problem and then by computing the inverse Laplace Transform of this function. However, in a numerical context, since the solution of the transformed problem consists of a sequence of Laplace Transform samples, most of the software for the numerical inversion of Laplace Transforms cannot be used since the transform, among the input parameters, must be passed as a function. To fill this gap, we present Talbot Suite DE, a C software collection specifically designed for these problems and based on Talbot's method for the inversion step. It contains both sequential and parallel implementations; the latter is accomplished by means of OpenMP API. Talbot Suite DE derives from a previous implementation of Talbot's method, but it results more efficient to solve multi-point inversion problems. Aimed at non-expert users, we illustrate several usage examples and we provide all their source code: some of them are entirely in C and others combine different programming languages, such as C/MATLAB or C/FORTRAN. In addition, for each example, we report performance results about accuracy and efficiency.
Implicitly described domains are a well established tool in the simulation of time dependent problems, e.g. using level-set methods. In order to solve partial differential equations on such domains, a range of numerical methods was developed, e.g. the Immersed Boundary method, Unfitted Finite Element or Unfitted discontinuous Galerkin methods, eXtended or Generalised Finite Element methods, just to name a few. Many of these methods involve integration over cut-cells or their boundaries, as they are described by sub- domains of the original level-set mesh. We present a new algorithm to geometrically evaluate the integrals over domains described by a first- order, conforming level-set function. The integration is based on a polyhedral reconstruction of the implicit geometry, following the concepts of the Marching Cubes algorithm. The algorithm preserves various topo- logical properties of the implicit geometry in its polyhedral reconstruction, making it suitable for Finite Element computations. Numerical experiments show second order accuracy of the integration. An implementation of the algorithm is available as free software, which allows for an easy incorporation into other projects. The software is in productive use within the DUNE framework [Bastian et al. 2008a].
A toolbox called ADiGator is described for algorithmically differentiating mathematical functions in MATLAB. ADiGator performs source transformation via operator overloading using forward mode algorithmic differentiation and produces a derivative file that can be evaluated to obtain the derivative of the original function at a numeric value of the input. A convenient by product of the file generation is the sparsity pattern of the derivative function. Moreover, as both the input and output to the algorithm are source codes, the algorithm may be applied recursively to generate derivatives of any order. A key component of the algorithm is its ability to statically exploit derivative sparsity at the MATLAB operation level in order to improve run-time performances. The algorithm is applied to four different classes of example problems and is shown to produce run-time efficient derivative codes. Due to the static nature of the approach, the algorithm is well suited and intended for use with problems requiring many repeated derivative computations.
A method to compute an explicit solution of triangular systems of first-order linear homogeneous initial-value ordinary differential equations with constant coefficients is described. It is suitable for the limited case of well separated eigenvalues, or for multiple zero eigenvalues provided the entire column of a zero eigenvalue is zero. It requires only to compute a constant matrix using a simple recurrence. Computing the solutions of the system from that matrix, for values of the independent variable, requires to exponentiate only the diagonal of a matrix. It is not necessary to compute the exponential of a general triangular matrix. One application is to solve equations of radioactive decays that do not have fission or neutron absorption.
We present TTC, an open-source parallel compiler for multidimensional tensor transpositions. In order to generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures, and yield significant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search.
To exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, and then introduce ReLAPACK, an open-source library of recursive algorithms to seamlessly replace most of LAPACKs blocked algorithms. In many scenarios, ReLAPACK clearly outperforms reference LAPACK, and even improves upon the performance of optimizes libraries.
A collection of the MATLAB routines that compute first and second derivative approximations of an arbitrary function using high-order compact finite difference schemes is presented. For periodic boundary conditions, up to tenth order accurate compact finite difference schemes are obtained for first and second derivative approximations and up to sixth order accurate compact finite difference schemes are obtained for third and fourth derivative approximations. Compact finite difference schemes for first and second derivative approximations are also extended for Dirichlet and Neumann boundary conditions which are up to sixth order accurate. Moreover, high-order compact finite difference schemes for partial derivative approximations of any functions in two variable are also given for periodic and Dirichlet boundary conditions. For each case a MATLAB routine is provided to compute the differentiation matrix. Fourier analysis of the compact finite difference schemes is explained and it is shown that compact finite difference schemes have better resolution characteristics as compared to classical finite difference schemes.
Sparse symmetric indefinite problems arise in a large number of important application areas; they are often solved through the use of an $LDL^T$ factorization via a sparse direct solver. Whilst for many problems, prescaling the system matrix $A$ is sufficient to maintain stability of the factorization, for a small but important fraction of problems numerical pivoting is required. Pivoting often incurs a significant overhead and consequently a number of techniques have been proposed to try and limit the need for pivoting. In particular, numerically-aware ordering algorithms may be used, that is, orderings that depend not only on the sparsity pattern of $A$ but also on the values of its (scaled) entries. Current approaches identify large entries of $A$ and symmetrically permute them onto the subdiagonal where they can be used as part of a $2\times2$ pivot. This is numerically effective, but the fill in the factor $L$ and hence the runtime of the factorization and subsequent triangular solves may be significantly increased over a standard ordering if no pivoting is required. We present a new algorithm that combines a matching-based approach with a numerically-aware nested dissection ordering. Numerical comparisons with current approaches for some tough symmetric indefinite problems are given.
We analyze several classical basic building blocks of double-word arithmetic (frequently called ``double-double arithmetic'' in the literature): the addition of a double-word number and a floating-point number, the addition of two double-word numbers, the multiplication of a double-word number by a floating-point number, the multiplication of two double-word numbers, the division of a double-word number by a floating-point number, and the division of two double-word numbers. For multiplication and division we get better relative error bounds than the ones previously published. For addition of two double-word numbers, we show that the previously published bound was wrong, and we provide a relative error bound. We introduce new algorithms for division. We also give examples that illustrate the tightness of our bounds.
The detection of heterogeneity among objects assessed on a series of blocks is critical in numerous areas such as clinical research, cosmetic studies or survey analysis. The Cochrans Q test is the most widely used test for identifying heterogeneity on binary data . The purpose of this article is to propose an extremely fast implementation of the exact Cochrans Q test so that one can benefit from its accuracy at virtually no cost regarding computation time. After a short presentation of the Cochrans Q test and the motivation for its exact version, we detail our approach and present its actual implementation. We then demonstrate the gain of this algorithm with performance evaluations and measurements. Comparisons against a well established implementation have shown an increase of the computational velocity by a factor ranging from 100 up to 1x10^6 in the most favorable cases.