root / www / algorithm.html
Historique | Voir | Annoter | Télécharger (13,99 ko)
1 |
<HTML>
|
---|---|
2 |
<HEAD>
|
3 |
<TITLE>HPL Algorithm</TITLE> |
4 |
</HEAD>
|
5 |
|
6 |
<BODY
|
7 |
BGCOLOR = "WHITE" |
8 |
BACKGROUND = "WHITE" |
9 |
TEXT = "#000000" |
10 |
VLINK = "#000099" |
11 |
ALINK = "#947153" |
12 |
LINK = "#0000ff"> |
13 |
|
14 |
<H2>HPL Algorithm</H2> |
15 |
|
16 |
<STRONG>
|
17 |
This page provides a high-level description of the algorithm used in |
18 |
this package. As indicated below, HPL contains in fact many possible |
19 |
variants for various operations. Defaults could have been chosen, or |
20 |
even variants could be selected during the execution. Due to the |
21 |
performance requirements, it was decided to leave the user with the |
22 |
opportunity of choosing, so that an "optimal" set of parameters could |
23 |
easily be experimentally determined for a given machine configuration. |
24 |
From a numerical accuracy point of view, <STRONG>all</STRONG> possible |
25 |
combinations are rigorously equivalent to each other even though the |
26 |
result may slightly differ (bit-wise). |
27 |
</STRONG><BR><BR> |
28 |
|
29 |
<UL>
|
30 |
<LI><A HREF="algorithm.html#main">Main Algorithm</A> |
31 |
<LI><A HREF="algorithm.html#pfact">Panel Factorization</A> |
32 |
<LI><A HREF="algorithm.html#bcast">Panel Broadcast</A> |
33 |
<LI><A HREF="algorithm.html#look_ahead">Look-ahead</A> |
34 |
<LI><A HREF="algorithm.html#update">Update</A> |
35 |
<LI><A HREF="algorithm.html#trsv">Backward Substitution</A> |
36 |
<LI><A HREF="algorithm.html#check">Checking the Solution</A> |
37 |
</UL>
|
38 |
<HR NOSHADE |
39 |
|
40 |
<H3<A ="main">Main Algorithm</A></H3> |
41 |
|
42 |
This software package solves a linear system of order n: A x = b by |
43 |
first computing the LU factorization with row partial pivoting of the |
44 |
n-by-n+1 coefficient matrix [A b] = [[L,U] y]. Since the lower triangular |
45 |
factor L is applied to b as the factorization progresses, the solution x |
46 |
is obtained by solving the upper triangular system U x = y. The lower |
47 |
triangular matrix L is left unpivoted and the array of pivots is not |
48 |
returned.<BR><BR> |
49 |
|
50 |
<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0> |
51 |
<TR>
|
52 |
<TD ALIGN=LEFT> |
53 |
The data is distributed onto a two-dimensional P-by-Q grid of processes |
54 |
according to the block-cyclic scheme to ensure "good" load balance |
55 |
as well as the scalability of the algorithm. The n-by-n+1 coefficient |
56 |
matrix is first logically partitioned into nb-by-nb blocks, that are |
57 |
cyclically "dealt" onto the P-by-Q process grid. This is done in both |
58 |
dimensions of the matrix.</TD>
|
59 |
<TD ALIGN=CENTER><IMG SRC = "mat2.jpg" BORDER=0 HEIGHT=165 WIDTH=340></TD> |
60 |
</TR>
|
61 |
</TABLE>
|
62 |
<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0> |
63 |
<TR>
|
64 |
<TD ALIGN=CENTER><IMG SRC ="main.jpg" BORDER=0 HEIGHT=165 WIDTH=165></TD> |
65 |
<TD ALIGN=LEFT> |
66 |
The right-looking variant has been chosen for the main loop of the LU |
67 |
factorization. This means that at each iteration of the loop a panel of |
68 |
nb columns is factorized, and the trailing submatrix is updated. Note |
69 |
that this computation is thus logically partitioned with the same block |
70 |
size nb that was used for the data distribution.</TD>
|
71 |
</TR>
|
72 |
</TABLE>
|
73 |
<HR NOSHADE |
74 |
|
75 |
<H3<A ="pfact">Panel Factorization</A></H3> |
76 |
|
77 |
<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=10> |
78 |
<TR>
|
79 |
<TD ALIGN=LEFT> |
80 |
At a given iteration of the main loop, and because of the cartesian |
81 |
property of the distribution scheme, each panel factorization occurs in |
82 |
one column of processes. This particular part of the computation lies |
83 |
on the critical path of the overall algorithm. The user is offered the |
84 |
choice of three (Crout, left- and right-looking) matrix-multiply based |
85 |
recursive variants. The software also allows the user to choose in how |
86 |
many sub-panels the current panel should be divided into during the |
87 |
recursion. Furthermore, one can also select at run-time the recursion |
88 |
stopping criterium in terms of the number of columns left to factorize. |
89 |
When this threshold is reached, the sub-panel will then be factorized |
90 |
using one of the three Crout, left- or right-looking matrix-vector based |
91 |
variant. Finally, for each panel column the pivot search, the associated |
92 |
swap and broadcast operation of the pivot row are combined into one |
93 |
single communication step. A binary-exchange (leave-on-all) reduction |
94 |
performs these three operations at once.</TD>
|
95 |
<TD ALIGN=CENTER><IMG SRC = "pfact.jpg" BORDER=0 HEIGHT=300 WIDTH=160></TD> |
96 |
</TR>
|
97 |
</TABLE>
|
98 |
<HR NOSHADE |
99 |
|
100 |
<H3<A ="bcast">Panel Broadcast</A></H3> |
101 |
|
102 |
Once the panel factorization has been computed, this panel of columns |
103 |
is broadcast to the other process columns. There are many possible |
104 |
broadcast algorithms and the software currently offers 6 variants to |
105 |
choose from. These variants are described below assuming that process 0 |
106 |
is the source of the broadcast for convenience. "->" means "sends to".
|
107 |
<UL>
|
108 |
<LI><STRONG>Increasing-ring</STRONG>: 0 -> 1; 1 -> 2; 2 -> 3 and so on. |
109 |
This algorithm is the classic one; it has the caveat that process 1 has |
110 |
to send a message. |
111 |
<CENTER>
|
112 |
<IMG SRC="1ring.jpg"> |
113 |
</CENTER>
|
114 |
|
115 |
<LI><STRONG>Increasing-ring (modified)</STRONG>: 0 -> 1; 0 -> 2; 2 -> 3 |
116 |
and so on. Process 0 sends two messages and process 1 only receives one |
117 |
message. This algorithm is almost always better, if not the best. |
118 |
<CENTER>
|
119 |
<IMG SRC="1rinM.jpg"> |
120 |
</CENTER>
|
121 |
|
122 |
<LI><STRONG>Increasing-2-ring</STRONG>: The Q processes are divided into |
123 |
two parts: 0 -> 1 and 0 -> Q/2; Then processes 1 and Q/2 act as sources |
124 |
of two rings: 1 -> 2, Q/2 -> Q/2+1; 2 -> 3, Q/2+1 -> to Q/2+2 and so on. |
125 |
This algorithm has the advantage of reducing the time by which the last |
126 |
process will receive the panel at the cost of process 0 sending 2 |
127 |
messages. |
128 |
<CENTER>
|
129 |
<IMG SRC="2ring.jpg"> |
130 |
</CENTER>
|
131 |
|
132 |
<LI><STRONG>Increasing-2-ring (modified)</STRONG>: As one may expect, |
133 |
first 0 -> 1, then the Q-1 processes left are divided into two equal
|
134 |
parts: 0 -> 2 and 0 -> Q/2; Processes 2 and Q/2 act then as sources of |
135 |
two rings: 2 -> 3, Q/2 -> Q/2+1; 3 -> 4, Q/2+1 -> to Q/2+2 and so on. |
136 |
This algorithm is probably the most serious competitor to the increasing |
137 |
ring modified variant. |
138 |
<CENTER>
|
139 |
<IMG SRC="2rinM.jpg"> |
140 |
</CENTER>
|
141 |
|
142 |
<LI><STRONG>Long (bandwidth reducing)</STRONG>: as opposed to the |
143 |
previous variants, this algorithm and its follower synchronize all |
144 |
processes involved in the operation. The message is chopped into Q equal |
145 |
pieces that are scattered across the Q processes. |
146 |
<CENTER>
|
147 |
<IMG SRC="spread.jpg"> |
148 |
</CENTER>
|
149 |
The pieces are then rolled in Q-1 steps. The scatter phase uses a binary |
150 |
tree and the rolling phase exclusively uses mutual message exchanges. In |
151 |
odd steps 0 <-> 1, 2 <-> 3, 4 <-> 5 and so on; in even steps Q-1 <-> 0, |
152 |
1 <-> 2, 3 <-> 4, 5 <-> 6 and so on. |
153 |
<CENTER>
|
154 |
<IMG SRC="roll.jpg"> |
155 |
</CENTER>
|
156 |
More messages are exchanged, however the total volume of communication is |
157 |
independent of Q, making this algorithm particularly suitable for large |
158 |
messages. This algorithm becomes competitive when the nodes are "very |
159 |
fast" and the network (comparatively) "very slow".<BR><BR> |
160 |
|
161 |
<LI><STRONG>Long (bandwidth reducing modified)</STRONG>: same as above, |
162 |
except that 0 -> 1 first, and then the Long variant is used on processes
|
163 |
0,2,3,4 .. Q-1.<BR><BR> |
164 |
<CENTER>
|
165 |
<IMG SRC="spreadM.jpg"> |
166 |
<IMG SRC="rollM.jpg"> |
167 |
</CENTER>
|
168 |
|
169 |
</UL>
|
170 |
|
171 |
The rings variants are distinguished by a probe mechanism that activates |
172 |
them. In other words, a process involved in the broadcast and different |
173 |
from the source asynchronously probes for the message to receive. When |
174 |
the message is available the broadcast proceeds, and otherwise the |
175 |
function returns. This allows to interleave the broadcast operation with |
176 |
the update phase. This contributes to reduce the idle time spent by those |
177 |
processes waiting for the factorized panel. This mechanism is necessary |
178 |
to accomodate for various computation/communication performance ratio.<BR><BR> |
179 |
<HR NOSHADE |
180 |
|
181 |
<H3<A ="look_ahead">Look-ahead</A></H3> |
182 |
|
183 |
Once the panel has been broadcast or say during this broadcast operation, |
184 |
the trailing submatrix is updated using the last panel in the look-ahead |
185 |
pipe: as mentioned before, the panel factorization lies on the critical |
186 |
path, which means that when the kth panel has been factorized and then |
187 |
broadcast, the next most urgent task to complete is the factorization and |
188 |
broadcast of the k+1 th panel. This technique is often refered to as |
189 |
"look-ahead" or "send-ahead" in the literature. This package allows to |
190 |
select various "depth" of look-ahead. By convention, a depth of zero |
191 |
corresponds to no lookahead, in which case the trailing submatrix is |
192 |
updated by the panel currently broadcast. Look-ahead consumes some extra |
193 |
memory to essentially keep all the panels of columns currently in the |
194 |
look-ahead pipe. A look-ahead of depth 1 (maybe 2) is likely to achieve |
195 |
the best performance gain.<BR><BR> |
196 |
<HR NOSHADE |
197 |
|
198 |
<H3<A ="update">Update</A></H3> |
199 |
|
200 |
The update of the trailing submatrix by the last panel in the look-ahead |
201 |
pipe is made of two phases. First, the pivots must be applied to form the |
202 |
current row panel U. U should then be solved by the upper triangle of the |
203 |
column panel. U finally needs to be broadcast to each process row so that |
204 |
the local rank-nb update can take place. We choose to combine the |
205 |
swapping and broadcast of U at the cost of replicating the solve. Two |
206 |
algorithms are available for this communication operation. |
207 |
<UL>
|
208 |
<LI><STRONG>Binary-exchange</STRONG>: this is a modified variant of the |
209 |
binary-exchange (leave on all) reduction operation. Every process column |
210 |
performs the same operation. The algorithm essentially works as follows. |
211 |
It pretends reducing the row panel U, but at the beginning the only valid |
212 |
copy is owned by the current process row. The other process rows will |
213 |
contribute rows of A they own that should be copied in U and replace them |
214 |
with rows that were originally in the current process row. The complete |
215 |
operation is performed in log(P) steps. For the sake of simplicity, let |
216 |
assume that P is a power of two. At step k, process row p exchanges a |
217 |
message with process row p+2^k. There are essentially two cases. First, |
218 |
one of those two process rows has received U in a previous step. The |
219 |
exchange occurs. One process swaps its local rows of A into U. Both |
220 |
processes copy in U remote rows of A. Second, none of those process rows |
221 |
has received U, the exchange occurs, and both processes simply add those |
222 |
remote rows to the list they have accumulated so far. At each step, a |
223 |
message of the size of U is exchanged by at least one pair of process |
224 |
rows.<BR><BR> |
225 |
|
226 |
<LI><STRONG>Long</STRONG>: this is a bandwidth reducing variant |
227 |
accomplishing the same task. The row panel is first spread (using a tree) |
228 |
among the process rows with respect to the pivot array. This is a scatter |
229 |
(V variant for MPI users). Locally, every process row then swaps these |
230 |
rows with the the rows of A it owns and that belong to U. These buffers |
231 |
are then rolled (P-1 steps) to finish the broadcast of U. Every process |
232 |
row permutes U and proceed with the computational part of the update. A |
233 |
couple of notes: process rows are logarithmically sorted before |
234 |
spreading, so that processes receiving the largest number of rows are |
235 |
first in the tree. This makes the communication volume optimal for this |
236 |
phase. Finally, before rolling and after the local swap, an equilibration |
237 |
phase occurs during which the local pieces of U are uniformly spread |
238 |
across the process rows. A tree-based algorithm is used. This operation |
239 |
is necessary to keep the rolling phase optimal even when the pivot rows |
240 |
are not equally distributed in process rows. This algorithm has a |
241 |
complexity in terms of communication volume that solely depends on the |
242 |
size of U. In particular, the number of process rows only impacts the |
243 |
number of messages exchanged. It will thus outperforms the previous |
244 |
variant for large problems on large machine configurations.<BR><BR> |
245 |
|
246 |
</UL>
|
247 |
|
248 |
The user can select any of the two variants above. In addition, a mix is |
249 |
possible as well. The "binary-exchange" algorithm will be used when U |
250 |
contains at most a certain number of columns. Choosing at least the block |
251 |
size nb as the threshold value is clearly recommended when look-ahead is |
252 |
on.<BR><BR> |
253 |
<HR NOSHADE |
254 |
|
255 |
<H3<A ="trsv">Backward Substitution</A></H3> |
256 |
|
257 |
The factorization has just now ended, the back-substitution remains to be |
258 |
done. For this, we choose a look-ahead of depth one variant. The |
259 |
right-hand-side is forwarded in process rows in a decreasing-ring |
260 |
fashion, so that we solve Q * nb entries at a time. At each step, this |
261 |
shrinking piece of the right-hand-side is updated. The process just above |
262 |
the one owning the current diagonal block of the matrix A updates first |
263 |
its last nb piece of x, forwards it to the previous process column, then |
264 |
broadcast it in the process column in a decreasing-ring fashion as well. |
265 |
The solution is then updated and sent to the previous process column. The |
266 |
solution of the linear system is left replicated in every process row.<BR><BR> |
267 |
<HR NOSHADE |
268 |
|
269 |
<H3<A ="check">Checking the Solution</A></H3> |
270 |
|
271 |
To verify the result obtained, the input matrix and right-hand side are |
272 |
regenerated. The normwise backward error (see formula below) is then |
273 |
computed. A solution is considered as "numerically correct" when this |
274 |
quantity is less than a threshold value of the order of 1.0. In the |
275 |
expression below, eps is the relative (distributed-memory) machine |
276 |
precision. |
277 |
|
278 |
<UL>
|
279 |
<LI>|| Ax - b ||_oo / ( eps * ( || A ||_oo * || x ||_oo + || b ||_oo ) * n )
|
280 |
</UL>
|
281 |
|
282 |
<HR NOSHADE |
283 |
<CENTER |
284 |
<A = "index.html"> [Home]</A> |
285 |
<A HREF = "copyright.html"> [Copyright and Licensing Terms]</A> |
286 |
<A HREF = "algorithm.html"> [Algorithm]</A> |
287 |
<A HREF = "scalability.html"> [Scalability]</A> |
288 |
<A HREF = "results.html"> [Performance Results]</A> |
289 |
<A HREF = "documentation.html"> [Documentation]</A> |
290 |
<A HREF = "software.html"> [Software]</A> |
291 |
<A HREF = "faqs.html"> [FAQs]</A> |
292 |
<A HREF = "tuning.html"> [Tuning]</A> |
293 |
<A HREF = "errata.html"> [Errata-Bugs]</A> |
294 |
<A HREF = "references.html"> [References]</A> |
295 |
<A HREF = "links.html"> [Related Links]</A><BR> |
296 |
</CENTER>
|
297 |
<HR NOSHADE |
298 |
</BODY |
299 |
</HTML |