Flops, Memory and Parallel Processing and Gaussian elimination (Lecture Notes for 1012013)
We will discuss three important features in the efficiency of Gaussian elimination: floating point operations (flops), memory access and parallel processing. The topics span thirty+ in the development of efficient implementation of Gaussian elimination and associated pivoting schemes.
FLOPS, LINPACK and the 1980’s
LINPACK is the first publicly available, high quality linear algebra package. It was developed with an NSF grant in the 1980’s by, among other researchers, Cleve Moler the creator of Matlab. At that time and state of computer architecture, the number of floating point operations in an algorithm was the most important measure in the efficiency of an algorithm. As we have seen earlier, for larger n, the flop counts for Gaussian elimination with no pivoting (genp), partial pivoting (gepp), rook pivoting (gerp) and complete pivoting (gecp) applied to an n by n matrix is:
The conclusion from the last column is, based on flop count, genp, gepp and gerp should take about the same amount of time and gecp should take about 50% longer. In practice genp can lead to large error growth, so we won’t pursue genp further here.
We can check the theory in the above table for the other routines by doing experiments. When comparing run times it is better to use code produced by a compiled computer language, rather than an interpreted language such as Matlab. When timing an interpreted language code, one is timing the work to understand the code as well as any computations. Since a compiled language (such a C, C++, Fortran, etc.) is translated to machine language before execution, the timing focuses on the computations, not the translation.
Fortunately, in the “Try Your Own Pivot Scheme” assignment, C code implementing gecp is available through the Matlab Mex utility (see item 26 at the course web page). By making a small change in the gecp source code (in the pivot search section of the code we can change the “for (j = k; j <= nminus1; ++j)” loop to “for (j = k; j <= k; ++j)”) we can modify the gecp code to implement gepp. I also have Fortran code, again accessible in Matlab via the Mex utility, that implements rook pivoting to solve Ax = b. All this code is in “LINPACK” style in that the flop count was a key focus in the code development.
We can try out the methods in Matlab (using a midrange, dual processor Acer laptop):
>> n=2000; A=randn(n,n); b=randn(n,1);
>> tic, x = gecp(A,b); toc % complete pivoting
Elapsed time is 20.461706 seconds.
>> tic,x=gerp_unblocked(A,b);toc % rook pivoting with similar style code
Elapsed time is 12.953847 seconds.
>> tic, x = geppinC(A,b); toc % partial pivoting
Elapsed time is 12.884968 seconds.
These runs support the conclusions of the above theory. The complete pivoting code takes about 50% (58% in the above runs) longer than the partial or rook pivoting code and the run times for the partial pivoting code and rook pivoting code are about the same!
MANAGING MEMORY ACCESS, LAPACK and the 1990’s
If we solve the same problem using the built in backslash in Matlab we get the following results:
>> tic,x=A\b;toc
Elapsed time is 0.900737 seconds.
The Matlab solution implements Gaussian elimination with partial pivoting, but is more than 13 times faster than any of the earlier runs. What is going on?
In the late 1980’s and early 1990’s the use of cache memory became increasingly popular. Cache memory is a relatively expensive but is fast compared to main RAM memory in a computer. It became apparent that it is important to keep calculations within the fast cache memory as much as possible. Otherwise, the time to access the slow main RAM memory outside the cache would dominate the calculation times.
To address this issue a new publicly available linear algebra package, called LAPACK was developed with NSF funding. This package required recoding most of the LINPACK algorithms, typically using block factorizations which allowed dividing the algorithm into pieces such that the calculations could be kept in cache memory as much as possible. The current code for Matlab backslash does Gaussian elimination with partial pivoting using such a block algorithm. As we see above this leads to a huge decrease in run times, more than a factor of 13.
The idea of block factorization (we will ignore pivoting initially for clarity) is based on the following block factorization of an n by n matrix A
Where, for a block size b, the matrices and are b by b, and are (nb) by b,
and are b by (nb) and and are (nb) by (nb). Let and We will call n by b matrices and “tall skinny matrices” since we will assume that b is much smaller than n. For example b might be 100 and n 2000 or larger.
As can be seen by multiplying the first block column of U times L, equation (0) implies that or that:
and are the LU factors of .
We can therefore calculate and by doing an LU factorization of the tall skinny matrix . This calculation can be done in cache memory, assuming the b is chosen small enough.
Next we multiply the first block row of L times U, which implies that . We therefore can calculate using:
.
This calculation can also be done in cache since and are “short (they only have b rows), fat (n columns)” matrices.
Finally, if in (0) we multiply the second block row of L by the second block row of U, it follows that or that . We break that up into two parts
and
andare the LU factors of .
The matrices in (4) are typically too large to fit in cache memory. For example if n = 2000 and b = 100, then and are 1900 by 1900 (more generally (nb) by (nb)). However, it is easy to separate the calculations in (3) into smaller pieces that do fit into cache memory. The structure of (3) is indicated by
The calculations for a submatrix of , for example the elements in rows 1001 to 1100 and columns 1001 to 1100, only require the same entries and the elements of rows 1001 to 1100 of and columns 1001 to 1100 of , for this example. The submatrix sizes can be selected so that these calculations fit into cache memory.
The calculations in (4) can be done by repeating the above procedure in a loop. The first pass through the loop we calculate the first block row and column of L and U applying the above procedure to the 2000 by 2000 matrix A (in our example). Next we apply the procedure to the 1900 by 1900 matrix , which produces the next block row and block column of L and U. Continuing we eventually factor the whole matrix.
The GEPP_BLOCK code in item 25 at the course web site implements these ideas in Matlab code. GEPP_BLOCK does a block factorization with partial pivoting. This requires careful book keeping to keep track of the permutations. This is included in the GEPP_BLOCK but we will not discuss the details here.
A block implementation of rook pivoting is also possible but due to the more complicated search in rook pivoting there will be more “cache misses”, where one needs to access memory outside of cache, than in partial pivoting. Finally, it appears that is not possible to do a block implementation of complete pivoting since complete pivoting requires a search of the entire unfinished portion of A, at every step of the factorization.
Here are some timings for the same 2000 by 2000 matrix used earlier, using compiled code (C and Fortran):
>> tic, x = gecp(A,b); toc % unblocked complete pivoting
Elapsed time is 20.461706 seconds.
>> tic, x=gerp(A,b); toc % blocked rook pivoting
Elapsed time is 1.367741 seconds.
>> tic, x=A\b; toc % blocked partial pivoting
Elapsed time is 0.900737 seconds.
In this experiment blocked rook pivoting requires about 50% more time than blocked partial pivoting and complete pivoting is more than 20 times slower than partial pivoting.
Note that the most time consuming part of blocked Gaussian elimination algorithm is the matrix multiplication in (3) or (5). Most computer manufacturers supply very efficient, machine language implementations of simpler matrix operations such as matrix multiplication. Such routines are called the Basic Linear Algebra Subroutines or the BLAS. The blocked Gaussian elimination algorithm facilitates structuring the code so that the code can call BLAS routines.
The above discussion represents the state of the art until very recently. For example, Matlab’s backslash for square linear systems calls an implementation of blocked Gaussian elimination with partial pivoting.
However, computers are evolving again:
PARALLEL PROCESSING, 2011 AND BEYOND
Computers with multicore processors and parallel processing addon computer cards are becoming increasingly common. For example, the NVIDIA Tessla cards contain hundreds or even thousands of processors. Currently, the more advanced parallel processing cards cost a few thousand dollars, but the prices will continue to fall. Soon massively parallel processing will not only be the domain of supercomputer; parallel processing will become ubiquitous. Our algorithms will need to adapt.
Parts of the block Gaussian elimination algorithm are easy to parallelize. For example, it is easy to divide up the calculations in (5) into independent pieces as described earlier by focusing on calculation of submatrices of . Each piece can then be sent to a separate processor. These processors can independently and simultaneous do their portion of the calculations. Since this procedure is relatively straightforward the calculations in (3) or (5) are sometimes called “embarrassingly parallel.” Similarly the calculations in (2) are easy to do in parallel.
If the calculations in (2) and (3) are done in parallel by hundreds or more processors, these calculations may be done quickly and the bottleneck in the computations may become the “tall skinny” LU factorization in step (1).
If this tall skinny LU factorization in (1) is done with partial pivoting, the problem is not “embarrassingly parallel”; it is difficult to do the pivoting in parallel efficiently. A key factor in the computation time is the communication between processors, that is moving data. This often dominates the arithmetic computations. Let us look at one part of the communications – the number of messages that need to be passed between processors  for partial pivoting of a tall skinny matrix. A separate, important issue is the volume of data passed between processors, but for simplicity we will not discuss that here. In some environments, the time to establish the new data connection, to initiate a new message, between processors can be a bottleneck.
To the left we have picture a potential division of a tall skinny matrix, which we call , to separate processors. In this case we have pictured m blocks in the matrix and assume that each block is a b by b matrix so that the complete tall skinny matrix is n by b with . Assume that each block to be assigned to a separate processor. Locating a pivot element in partial pivoting is a column by column operation. We can begin by finding the largest magnitude element in the first column. To do that we need to collect information from every processor. For example, each parallel processor could calculate the largest magnitude element of column one of its submatrix and then the m local maximum elements can be compared to determine the largest magnitude element. This would require sending m messages, one for each processor. The selection of pivot elements for a factorization of the entire n by b matrix would require the same number of messages for each column or a total of n= mb messages. There will be other information passed between processors for the elimination, but we will focus only on the pivot search. We conclude that

With this approach the pivot search in partial pivoting of the n by b matrix requires messages.
Recall that n can be large, thousands or more. There are other algorithms for implementing partial pivoting in parallel, but none of them minimizes the number of messages as well as an alternate pivoting scheme.
Tournament pivoting is a new (first published in 2011) pivoting scheme. Rather than working column by column as does partial pivoting, tournament pivoting selects pivot elements by having “competitions” between pairs of blocks of or pairs of sets of rows in a “tournament” which chooses an overall winner consisting of the best b rows of . These best rows are pivoted to the top of and Gaussian elimination with no pivoting is applied to the pivoted , so the that the winning rows of the tournament are the pivot rows.
The principle motivating the competitions between blocks of is that

Gaussian elimination with partial pivoting tends to move rows that are linearly independent to the top or the matrix.
We won’t try to prove this, but we can look at a specific example. Suppose the second row of a matrix is linearly dependent on the first row. When we do the elimination and add a multiple of the first row to the second row, the new row two will be entirely zero. When we select the pivot element for column two we search column two on or below the diagonal for the element furthest from zero. Therefore the current row two, which has zero on the diagonal, won’t remain at row two. So the linear dependent row two won’t remain at the top of the factorization.
With this in mind we can recast

the goal of pivoting is to select rows of the b by n matrixso that the first b rows of the pivoted matrix are linear independent.
We can now do this via a “tournament” involving blocks of (pictured here again):

The rows of and will participate in a competition by forming an LU factorization with partial pivoting to the 2b by b matrix . The rows moved to the top are the “winners” that is they are rows of B that more linearly independent, as determined by Gaussian elimination. Let W be the b by b matrix with the winning rows.

For k = 3 to m have the rows of (the “challengers”) compete with the rows of W (the current “champions”). The competition is done in a similar way to step 1: apply Gaussian elimination with partial pivoting to the 2b by b matrix . The b rows that are moved to the top are the new winners or the current “champions”.
The result will be the b most linearly independent rows (as determined by the tournament) of , as desired.
We should add that there are many other ways to carry out the “tournamemt” to choose b “winners.” The above tournament is a sequential style tournament where the current team of champions successively meet new challengers one after the other. A ‘real’ world example of this is on the Survivor TV shows where a player who is sent to an ‘exile’ island meets a new challenger every week. The winner goes on to the meet the next challenger. In the literature this is called a “flat tree” tournament.
Another style tournament is “tennis style.” A tennis style tournament is divided into rounds. On the first round all m teams play in m/2 “games.” The m/2 winning teams (or sets of rows in our case) are sent to the next round. At the second round there are m/ 4 games (or, in our case, LU factorizations on matrices of size 2b by b) and m/4 sets of rows advance. This is continued to quarterfinals, semifinals and to finals where the b “best” rows are selected. This is called a “binary tree” tournament in the literature. A binary tree tournament takes less time than a flat tree tournament (in a real life tournament or on a computer that can have simultaneous games, i.e. a parallel processing computer).
What is the communication cost of a tournament to determine the best, or most linear independent rows, of the n by b matrix . Recall that for simplicity we are only focusing on the number of messages. In a tournament we need to pass information between processors at the beginning of each game played. There are only m1 games played in a tournament of m teams (this is simple to see in the flat tournament case). Therefore
This is an order of magnitude less than required by the algorithm we described earlier for partial pivoting.
The paper “CALU: A Communication Optimal LU Factorization Algorithm,” by Laura Grigori, James Demmel, and Hua Xiang, SIAM J. Matrix Analysis and Applications, Vol. 32, pp. 13171350, 2011 proves that tournament pivoting is communication optimal in that the number messages required by the algorithm and the number of words in these messages achieve, within a modest factor, theoretical lower bounds on the amount of communication. The authors conclude:

“The reason to consider CALU is that it does an optimal amount of communication, and asymptotically less that Gaussian elimination with partial pivoting (GEPP), and so will be much faster on platforms where communication is expensive.”
Dostları ilə paylaş: 