root / www / algorithm.html
Historique | Voir | Annoter | Télécharger (13,99 ko)
1 | 1 | equemene | <HTML>
|
---|---|---|---|
2 | 1 | equemene | <HEAD>
|
3 | 1 | equemene | <TITLE>HPL Algorithm</TITLE> |
4 | 1 | equemene | </HEAD>
|
5 | 1 | equemene | |
6 | 1 | equemene | <BODY
|
7 | 1 | equemene | BGCOLOR = "WHITE" |
8 | 1 | equemene | BACKGROUND = "WHITE" |
9 | 1 | equemene | TEXT = "#000000" |
10 | 1 | equemene | VLINK = "#000099" |
11 | 1 | equemene | ALINK = "#947153" |
12 | 1 | equemene | LINK = "#0000ff"> |
13 | 1 | equemene | |
14 | 1 | equemene | <H2>HPL Algorithm</H2> |
15 | 1 | equemene | |
16 | 1 | equemene | <STRONG>
|
17 | 1 | equemene | This page provides a high-level description of the algorithm used in |
18 | 1 | equemene | this package. As indicated below, HPL contains in fact many possible |
19 | 1 | equemene | variants for various operations. Defaults could have been chosen, or |
20 | 1 | equemene | even variants could be selected during the execution. Due to the |
21 | 1 | equemene | performance requirements, it was decided to leave the user with the |
22 | 1 | equemene | opportunity of choosing, so that an "optimal" set of parameters could |
23 | 1 | equemene | easily be experimentally determined for a given machine configuration. |
24 | 1 | equemene | From a numerical accuracy point of view, <STRONG>all</STRONG> possible |
25 | 1 | equemene | combinations are rigorously equivalent to each other even though the |
26 | 1 | equemene | result may slightly differ (bit-wise). |
27 | 1 | equemene | </STRONG><BR><BR> |
28 | 1 | equemene | |
29 | 1 | equemene | <UL>
|
30 | 1 | equemene | <LI><A HREF="algorithm.html#main">Main Algorithm</A> |
31 | 1 | equemene | <LI><A HREF="algorithm.html#pfact">Panel Factorization</A> |
32 | 1 | equemene | <LI><A HREF="algorithm.html#bcast">Panel Broadcast</A> |
33 | 1 | equemene | <LI><A HREF="algorithm.html#look_ahead">Look-ahead</A> |
34 | 1 | equemene | <LI><A HREF="algorithm.html#update">Update</A> |
35 | 1 | equemene | <LI><A HREF="algorithm.html#trsv">Backward Substitution</A> |
36 | 1 | equemene | <LI><A HREF="algorithm.html#check">Checking the Solution</A> |
37 | 1 | equemene | </UL>
|
38 | 1 | equemene | <HR NOSHADE |
39 | 1 | equemene | |
40 | 1 | equemene | <H3<A ="main">Main Algorithm</A></H3> |
41 | 1 | equemene | |
42 | 1 | equemene | This software package solves a linear system of order n: A x = b by |
43 | 1 | equemene | first computing the LU factorization with row partial pivoting of the |
44 | 1 | equemene | n-by-n+1 coefficient matrix [A b] = [[L,U] y]. Since the lower triangular |
45 | 1 | equemene | factor L is applied to b as the factorization progresses, the solution x |
46 | 1 | equemene | is obtained by solving the upper triangular system U x = y. The lower |
47 | 1 | equemene | triangular matrix L is left unpivoted and the array of pivots is not |
48 | 1 | equemene | returned.<BR><BR> |
49 | 1 | equemene | |
50 | 1 | equemene | <TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0> |
51 | 1 | equemene | <TR>
|
52 | 1 | equemene | <TD ALIGN=LEFT> |
53 | 1 | equemene | The data is distributed onto a two-dimensional P-by-Q grid of processes |
54 | 1 | equemene | according to the block-cyclic scheme to ensure "good" load balance |
55 | 1 | equemene | as well as the scalability of the algorithm. The n-by-n+1 coefficient |
56 | 1 | equemene | matrix is first logically partitioned into nb-by-nb blocks, that are |
57 | 1 | equemene | cyclically "dealt" onto the P-by-Q process grid. This is done in both |
58 | 1 | equemene | dimensions of the matrix.</TD>
|
59 | 1 | equemene | <TD ALIGN=CENTER><IMG SRC = "mat2.jpg" BORDER=0 HEIGHT=165 WIDTH=340></TD> |
60 | 1 | equemene | </TR>
|
61 | 1 | equemene | </TABLE>
|
62 | 1 | equemene | <TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0> |
63 | 1 | equemene | <TR>
|
64 | 1 | equemene | <TD ALIGN=CENTER><IMG SRC ="main.jpg" BORDER=0 HEIGHT=165 WIDTH=165></TD> |
65 | 1 | equemene | <TD ALIGN=LEFT> |
66 | 1 | equemene | The right-looking variant has been chosen for the main loop of the LU |
67 | 1 | equemene | factorization. This means that at each iteration of the loop a panel of |
68 | 1 | equemene | nb columns is factorized, and the trailing submatrix is updated. Note |
69 | 1 | equemene | that this computation is thus logically partitioned with the same block |
70 | 1 | equemene | size nb that was used for the data distribution.</TD>
|
71 | 1 | equemene | </TR>
|
72 | 1 | equemene | </TABLE>
|
73 | 1 | equemene | <HR NOSHADE |
74 | 1 | equemene | |
75 | 1 | equemene | <H3<A ="pfact">Panel Factorization</A></H3> |
76 | 1 | equemene | |
77 | 1 | equemene | <TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=10> |
78 | 1 | equemene | <TR>
|
79 | 1 | equemene | <TD ALIGN=LEFT> |
80 | 1 | equemene | At a given iteration of the main loop, and because of the cartesian |
81 | 1 | equemene | property of the distribution scheme, each panel factorization occurs in |
82 | 1 | equemene | one column of processes. This particular part of the computation lies |
83 | 1 | equemene | on the critical path of the overall algorithm. The user is offered the |
84 | 1 | equemene | choice of three (Crout, left- and right-looking) matrix-multiply based |
85 | 1 | equemene | recursive variants. The software also allows the user to choose in how |
86 | 1 | equemene | many sub-panels the current panel should be divided into during the |
87 | 1 | equemene | recursion. Furthermore, one can also select at run-time the recursion |
88 | 1 | equemene | stopping criterium in terms of the number of columns left to factorize. |
89 | 1 | equemene | When this threshold is reached, the sub-panel will then be factorized |
90 | 1 | equemene | using one of the three Crout, left- or right-looking matrix-vector based |
91 | 1 | equemene | variant. Finally, for each panel column the pivot search, the associated |
92 | 1 | equemene | swap and broadcast operation of the pivot row are combined into one |
93 | 1 | equemene | single communication step. A binary-exchange (leave-on-all) reduction |
94 | 1 | equemene | performs these three operations at once.</TD>
|
95 | 1 | equemene | <TD ALIGN=CENTER><IMG SRC = "pfact.jpg" BORDER=0 HEIGHT=300 WIDTH=160></TD> |
96 | 1 | equemene | </TR>
|
97 | 1 | equemene | </TABLE>
|
98 | 1 | equemene | <HR NOSHADE |
99 | 1 | equemene | |
100 | 1 | equemene | <H3<A ="bcast">Panel Broadcast</A></H3> |
101 | 1 | equemene | |
102 | 1 | equemene | Once the panel factorization has been computed, this panel of columns |
103 | 1 | equemene | is broadcast to the other process columns. There are many possible |
104 | 1 | equemene | broadcast algorithms and the software currently offers 6 variants to |
105 | 1 | equemene | choose from. These variants are described below assuming that process 0 |
106 | 1 | equemene | is the source of the broadcast for convenience. "->" means "sends to".
|
107 | 1 | equemene | <UL>
|
108 | 1 | equemene | <LI><STRONG>Increasing-ring</STRONG>: 0 -> 1; 1 -> 2; 2 -> 3 and so on. |
109 | 1 | equemene | This algorithm is the classic one; it has the caveat that process 1 has |
110 | 1 | equemene | to send a message. |
111 | 1 | equemene | <CENTER>
|
112 | 1 | equemene | <IMG SRC="1ring.jpg"> |
113 | 1 | equemene | </CENTER>
|
114 | 1 | equemene | |
115 | 1 | equemene | <LI><STRONG>Increasing-ring (modified)</STRONG>: 0 -> 1; 0 -> 2; 2 -> 3 |
116 | 1 | equemene | and so on. Process 0 sends two messages and process 1 only receives one |
117 | 1 | equemene | message. This algorithm is almost always better, if not the best. |
118 | 1 | equemene | <CENTER>
|
119 | 1 | equemene | <IMG SRC="1rinM.jpg"> |
120 | 1 | equemene | </CENTER>
|
121 | 1 | equemene | |
122 | 1 | equemene | <LI><STRONG>Increasing-2-ring</STRONG>: The Q processes are divided into |
123 | 1 | equemene | two parts: 0 -> 1 and 0 -> Q/2; Then processes 1 and Q/2 act as sources |
124 | 1 | equemene | of two rings: 1 -> 2, Q/2 -> Q/2+1; 2 -> 3, Q/2+1 -> to Q/2+2 and so on. |
125 | 1 | equemene | This algorithm has the advantage of reducing the time by which the last |
126 | 1 | equemene | process will receive the panel at the cost of process 0 sending 2 |
127 | 1 | equemene | messages. |
128 | 1 | equemene | <CENTER>
|
129 | 1 | equemene | <IMG SRC="2ring.jpg"> |
130 | 1 | equemene | </CENTER>
|
131 | 1 | equemene | |
132 | 1 | equemene | <LI><STRONG>Increasing-2-ring (modified)</STRONG>: As one may expect, |
133 | 1 | equemene | first 0 -> 1, then the Q-1 processes left are divided into two equal
|
134 | 1 | equemene | parts: 0 -> 2 and 0 -> Q/2; Processes 2 and Q/2 act then as sources of |
135 | 1 | equemene | two rings: 2 -> 3, Q/2 -> Q/2+1; 3 -> 4, Q/2+1 -> to Q/2+2 and so on. |
136 | 1 | equemene | This algorithm is probably the most serious competitor to the increasing |
137 | 1 | equemene | ring modified variant. |
138 | 1 | equemene | <CENTER>
|
139 | 1 | equemene | <IMG SRC="2rinM.jpg"> |
140 | 1 | equemene | </CENTER>
|
141 | 1 | equemene | |
142 | 1 | equemene | <LI><STRONG>Long (bandwidth reducing)</STRONG>: as opposed to the |
143 | 1 | equemene | previous variants, this algorithm and its follower synchronize all |
144 | 1 | equemene | processes involved in the operation. The message is chopped into Q equal |
145 | 1 | equemene | pieces that are scattered across the Q processes. |
146 | 1 | equemene | <CENTER>
|
147 | 1 | equemene | <IMG SRC="spread.jpg"> |
148 | 1 | equemene | </CENTER>
|
149 | 1 | equemene | The pieces are then rolled in Q-1 steps. The scatter phase uses a binary |
150 | 1 | equemene | tree and the rolling phase exclusively uses mutual message exchanges. In |
151 | 1 | equemene | odd steps 0 <-> 1, 2 <-> 3, 4 <-> 5 and so on; in even steps Q-1 <-> 0, |
152 | 1 | equemene | 1 <-> 2, 3 <-> 4, 5 <-> 6 and so on. |
153 | 1 | equemene | <CENTER>
|
154 | 1 | equemene | <IMG SRC="roll.jpg"> |
155 | 1 | equemene | </CENTER>
|
156 | 1 | equemene | More messages are exchanged, however the total volume of communication is |
157 | 1 | equemene | independent of Q, making this algorithm particularly suitable for large |
158 | 1 | equemene | messages. This algorithm becomes competitive when the nodes are "very |
159 | 1 | equemene | fast" and the network (comparatively) "very slow".<BR><BR> |
160 | 1 | equemene | |
161 | 1 | equemene | <LI><STRONG>Long (bandwidth reducing modified)</STRONG>: same as above, |
162 | 1 | equemene | except that 0 -> 1 first, and then the Long variant is used on processes
|
163 | 1 | equemene | 0,2,3,4 .. Q-1.<BR><BR> |
164 | 1 | equemene | <CENTER>
|
165 | 1 | equemene | <IMG SRC="spreadM.jpg"> |
166 | 1 | equemene | <IMG SRC="rollM.jpg"> |
167 | 1 | equemene | </CENTER>
|
168 | 1 | equemene | |
169 | 1 | equemene | </UL>
|
170 | 1 | equemene | |
171 | 1 | equemene | The rings variants are distinguished by a probe mechanism that activates |
172 | 1 | equemene | them. In other words, a process involved in the broadcast and different |
173 | 1 | equemene | from the source asynchronously probes for the message to receive. When |
174 | 1 | equemene | the message is available the broadcast proceeds, and otherwise the |
175 | 1 | equemene | function returns. This allows to interleave the broadcast operation with |
176 | 1 | equemene | the update phase. This contributes to reduce the idle time spent by those |
177 | 1 | equemene | processes waiting for the factorized panel. This mechanism is necessary |
178 | 1 | equemene | to accomodate for various computation/communication performance ratio.<BR><BR> |
179 | 1 | equemene | <HR NOSHADE |
180 | 1 | equemene | |
181 | 1 | equemene | <H3<A ="look_ahead">Look-ahead</A></H3> |
182 | 1 | equemene | |
183 | 1 | equemene | Once the panel has been broadcast or say during this broadcast operation, |
184 | 1 | equemene | the trailing submatrix is updated using the last panel in the look-ahead |
185 | 1 | equemene | pipe: as mentioned before, the panel factorization lies on the critical |
186 | 1 | equemene | path, which means that when the kth panel has been factorized and then |
187 | 1 | equemene | broadcast, the next most urgent task to complete is the factorization and |
188 | 1 | equemene | broadcast of the k+1 th panel. This technique is often refered to as |
189 | 1 | equemene | "look-ahead" or "send-ahead" in the literature. This package allows to |
190 | 1 | equemene | select various "depth" of look-ahead. By convention, a depth of zero |
191 | 1 | equemene | corresponds to no lookahead, in which case the trailing submatrix is |
192 | 1 | equemene | updated by the panel currently broadcast. Look-ahead consumes some extra |
193 | 1 | equemene | memory to essentially keep all the panels of columns currently in the |
194 | 1 | equemene | look-ahead pipe. A look-ahead of depth 1 (maybe 2) is likely to achieve |
195 | 1 | equemene | the best performance gain.<BR><BR> |
196 | 1 | equemene | <HR NOSHADE |
197 | 1 | equemene | |
198 | 1 | equemene | <H3<A ="update">Update</A></H3> |
199 | 1 | equemene | |
200 | 1 | equemene | The update of the trailing submatrix by the last panel in the look-ahead |
201 | 1 | equemene | pipe is made of two phases. First, the pivots must be applied to form the |
202 | 1 | equemene | current row panel U. U should then be solved by the upper triangle of the |
203 | 1 | equemene | column panel. U finally needs to be broadcast to each process row so that |
204 | 1 | equemene | the local rank-nb update can take place. We choose to combine the |
205 | 1 | equemene | swapping and broadcast of U at the cost of replicating the solve. Two |
206 | 1 | equemene | algorithms are available for this communication operation. |
207 | 1 | equemene | <UL>
|
208 | 1 | equemene | <LI><STRONG>Binary-exchange</STRONG>: this is a modified variant of the |
209 | 1 | equemene | binary-exchange (leave on all) reduction operation. Every process column |
210 | 1 | equemene | performs the same operation. The algorithm essentially works as follows. |
211 | 1 | equemene | It pretends reducing the row panel U, but at the beginning the only valid |
212 | 1 | equemene | copy is owned by the current process row. The other process rows will |
213 | 1 | equemene | contribute rows of A they own that should be copied in U and replace them |
214 | 1 | equemene | with rows that were originally in the current process row. The complete |
215 | 1 | equemene | operation is performed in log(P) steps. For the sake of simplicity, let |
216 | 1 | equemene | assume that P is a power of two. At step k, process row p exchanges a |
217 | 1 | equemene | message with process row p+2^k. There are essentially two cases. First, |
218 | 1 | equemene | one of those two process rows has received U in a previous step. The |
219 | 1 | equemene | exchange occurs. One process swaps its local rows of A into U. Both |
220 | 1 | equemene | processes copy in U remote rows of A. Second, none of those process rows |
221 | 1 | equemene | has received U, the exchange occurs, and both processes simply add those |
222 | 1 | equemene | remote rows to the list they have accumulated so far. At each step, a |
223 | 1 | equemene | message of the size of U is exchanged by at least one pair of process |
224 | 1 | equemene | rows.<BR><BR> |
225 | 1 | equemene | |
226 | 1 | equemene | <LI><STRONG>Long</STRONG>: this is a bandwidth reducing variant |
227 | 1 | equemene | accomplishing the same task. The row panel is first spread (using a tree) |
228 | 1 | equemene | among the process rows with respect to the pivot array. This is a scatter |
229 | 1 | equemene | (V variant for MPI users). Locally, every process row then swaps these |
230 | 1 | equemene | rows with the the rows of A it owns and that belong to U. These buffers |
231 | 1 | equemene | are then rolled (P-1 steps) to finish the broadcast of U. Every process |
232 | 1 | equemene | row permutes U and proceed with the computational part of the update. A |
233 | 1 | equemene | couple of notes: process rows are logarithmically sorted before |
234 | 1 | equemene | spreading, so that processes receiving the largest number of rows are |
235 | 1 | equemene | first in the tree. This makes the communication volume optimal for this |
236 | 1 | equemene | phase. Finally, before rolling and after the local swap, an equilibration |
237 | 1 | equemene | phase occurs during which the local pieces of U are uniformly spread |
238 | 1 | equemene | across the process rows. A tree-based algorithm is used. This operation |
239 | 1 | equemene | is necessary to keep the rolling phase optimal even when the pivot rows |
240 | 1 | equemene | are not equally distributed in process rows. This algorithm has a |
241 | 1 | equemene | complexity in terms of communication volume that solely depends on the |
242 | 1 | equemene | size of U. In particular, the number of process rows only impacts the |
243 | 1 | equemene | number of messages exchanged. It will thus outperforms the previous |
244 | 1 | equemene | variant for large problems on large machine configurations.<BR><BR> |
245 | 1 | equemene | |
246 | 1 | equemene | </UL>
|
247 | 1 | equemene | |
248 | 1 | equemene | The user can select any of the two variants above. In addition, a mix is |
249 | 1 | equemene | possible as well. The "binary-exchange" algorithm will be used when U |
250 | 1 | equemene | contains at most a certain number of columns. Choosing at least the block |
251 | 1 | equemene | size nb as the threshold value is clearly recommended when look-ahead is |
252 | 1 | equemene | on.<BR><BR> |
253 | 1 | equemene | <HR NOSHADE |
254 | 1 | equemene | |
255 | 1 | equemene | <H3<A ="trsv">Backward Substitution</A></H3> |
256 | 1 | equemene | |
257 | 1 | equemene | The factorization has just now ended, the back-substitution remains to be |
258 | 1 | equemene | done. For this, we choose a look-ahead of depth one variant. The |
259 | 1 | equemene | right-hand-side is forwarded in process rows in a decreasing-ring |
260 | 1 | equemene | fashion, so that we solve Q * nb entries at a time. At each step, this |
261 | 1 | equemene | shrinking piece of the right-hand-side is updated. The process just above |
262 | 1 | equemene | the one owning the current diagonal block of the matrix A updates first |
263 | 1 | equemene | its last nb piece of x, forwards it to the previous process column, then |
264 | 1 | equemene | broadcast it in the process column in a decreasing-ring fashion as well. |
265 | 1 | equemene | The solution is then updated and sent to the previous process column. The |
266 | 1 | equemene | solution of the linear system is left replicated in every process row.<BR><BR> |
267 | 1 | equemene | <HR NOSHADE |
268 | 1 | equemene | |
269 | 1 | equemene | <H3<A ="check">Checking the Solution</A></H3> |
270 | 1 | equemene | |
271 | 1 | equemene | To verify the result obtained, the input matrix and right-hand side are |
272 | 1 | equemene | regenerated. The normwise backward error (see formula below) is then |
273 | 1 | equemene | computed. A solution is considered as "numerically correct" when this |
274 | 1 | equemene | quantity is less than a threshold value of the order of 1.0. In the |
275 | 1 | equemene | expression below, eps is the relative (distributed-memory) machine |
276 | 1 | equemene | precision. |
277 | 1 | equemene | |
278 | 1 | equemene | <UL>
|
279 | 1 | equemene | <LI>|| Ax - b ||_oo / ( eps * ( || A ||_oo * || x ||_oo + || b ||_oo ) * n )
|
280 | 1 | equemene | </UL>
|
281 | 1 | equemene | |
282 | 1 | equemene | <HR NOSHADE |
283 | 1 | equemene | <CENTER |
284 | 1 | equemene | <A = "index.html"> [Home]</A> |
285 | 1 | equemene | <A HREF = "copyright.html"> [Copyright and Licensing Terms]</A> |
286 | 1 | equemene | <A HREF = "algorithm.html"> [Algorithm]</A> |
287 | 1 | equemene | <A HREF = "scalability.html"> [Scalability]</A> |
288 | 1 | equemene | <A HREF = "results.html"> [Performance Results]</A> |
289 | 1 | equemene | <A HREF = "documentation.html"> [Documentation]</A> |
290 | 1 | equemene | <A HREF = "software.html"> [Software]</A> |
291 | 1 | equemene | <A HREF = "faqs.html"> [FAQs]</A> |
292 | 1 | equemene | <A HREF = "tuning.html"> [Tuning]</A> |
293 | 1 | equemene | <A HREF = "errata.html"> [Errata-Bugs]</A> |
294 | 1 | equemene | <A HREF = "references.html"> [References]</A> |
295 | 1 | equemene | <A HREF = "links.html"> [Related Links]</A><BR> |
296 | 1 | equemene | </CENTER>
|
297 | 1 | equemene | <HR NOSHADE |
298 | 1 | equemene | </BODY |
299 | 1 | equemene | </HTML |