Statistiques
| Révision :

root / www / algorithm.html

Historique | Voir | Annoter | Télécharger (13,99 ko)

1
<HTML>
2
<HEAD>
3
<TITLE>HPL Algorithm</TITLE>
4
</HEAD>
5

    
6
<BODY 
7
BGCOLOR     = "WHITE"
8
BACKGROUND  = "WHITE"
9
TEXT        = "#000000"
10
VLINK       = "#000099"
11
ALINK       = "#947153"
12
LINK        = "#0000ff">
13

    
14
<H2>HPL Algorithm</H2>
15

    
16
<STRONG>
17
This  page provides  a high-level description of the algorithm used in
18
this package. As indicated below,  HPL  contains in fact many possible
19
variants for various operations.  Defaults could have been chosen,  or
20
even  variants  could  be selected  during  the execution.  Due to the
21
performance requirements,  it was  decided  to leave the user with the
22
opportunity of choosing,  so that an "optimal" set of parameters could
23
easily be experimentally determined for a given machine configuration.
24
From a numerical accuracy point of view, <STRONG>all</STRONG> possible
25
combinations are rigorously equivalent  to each other  even though the
26
result may slightly differ (bit-wise).
27
</STRONG><BR><BR>
28

    
29
<UL>
30
<LI><A HREF="algorithm.html#main">Main Algorithm</A>
31
<LI><A HREF="algorithm.html#pfact">Panel Factorization</A>
32
<LI><A HREF="algorithm.html#bcast">Panel Broadcast</A>
33
<LI><A HREF="algorithm.html#look_ahead">Look-ahead</A>
34
<LI><A HREF="algorithm.html#update">Update</A>
35
<LI><A HREF="algorithm.html#trsv">Backward Substitution</A>
36
<LI><A HREF="algorithm.html#check">Checking the Solution</A>
37
</UL>
38
<HR NOSHADE
39

    
40
<H3<A ="main">Main Algorithm</A></H3>
41

    
42
This  software  package  solves  a linear system  of order n:  A x = b by
43
first  computing  the  LU  factorization with row partial pivoting of the
44
n-by-n+1 coefficient matrix [A b] = [[L,U] y]. Since the lower triangular
45
factor L is applied to b as the factorization progresses, the solution  x
46
is obtained  by  solving  the upper triangular system U x = y.  The lower
47
triangular  matrix  L  is left unpivoted  and  the array of pivots is not
48
returned.<BR><BR>
49

    
50
<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0>
51
<TR>
52
<TD ALIGN=LEFT>
53
The  data  is distributed onto a two-dimensional P-by-Q grid of processes
54
according  to  the  block-cyclic  scheme  to ensure  "good"  load balance
55
as well as  the scalability  of the algorithm.  The  n-by-n+1 coefficient
56
matrix is  first  logically partitioned into  nb-by-nb  blocks,  that are
57
cyclically "dealt" onto the  P-by-Q  process grid.  This is done  in both
58
dimensions of the matrix.</TD>
59
<TD ALIGN=CENTER><IMG SRC = "mat2.jpg" BORDER=0 HEIGHT=165 WIDTH=340></TD>
60
</TR>
61
</TABLE>
62
<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0>
63
<TR>
64
<TD ALIGN=CENTER><IMG SRC ="main.jpg" BORDER=0 HEIGHT=165 WIDTH=165></TD>
65
<TD ALIGN=LEFT>
66
The  right-looking  variant  has been chosen for the main loop of the  LU
67
factorization.  This  means that at each iteration of the loop a panel of
68
nb columns is factorized,  and  the  trailing submatrix is updated.  Note
69
that this computation is  thus  logically partitioned with the same block
70
size nb that was used for the data distribution.</TD>
71
</TR>
72
</TABLE>
73
<HR NOSHADE
74

    
75
<H3<A ="pfact">Panel Factorization</A></H3>
76

    
77
<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=10>
78
<TR>
79
<TD ALIGN=LEFT>
80
At  a given iteration  of the main loop,  and  because of  the  cartesian 
81
property of the distribution scheme,  each panel factorization  occurs in
82
one column of processes.   This  particular part of the computation  lies
83
on the critical path of  the overall algorithm.  The user is  offered the
84
choice of three  (Crout, left- and right-looking)  matrix-multiply  based 
85
recursive variants. The software also allows the user  to choose  in  how
86
many  sub-panels  the current panel  should be divided  into  during  the
87
recursion.  Furthermore,  one  can also  select at run-time the recursion
88
stopping criterium in terms of the number  of  columns left to factorize.
89
When this  threshold is reached,  the sub-panel will  then be  factorized
90
using one of the three Crout, left- or right-looking matrix-vector  based 
91
variant.  Finally, for each panel column the pivot search, the associated
92
swap  and broadcast  operation  of  the pivot row  are combined  into one 
93
single communication step.  A   binary-exchange  (leave-on-all) reduction
94
performs these three operations at once.</TD>
95
<TD ALIGN=CENTER><IMG SRC = "pfact.jpg" BORDER=0 HEIGHT=300 WIDTH=160></TD>
96
</TR>
97
</TABLE>
98
<HR NOSHADE
99

    
100
<H3<A ="bcast">Panel Broadcast</A></H3>
101

    
102
Once  the panel factorization has been computed,  this  panel  of columns
103
is  broadcast  to the other process columns.   There  are  many  possible 
104
broadcast  algorithms  and  the  software currently offers  6 variants to 
105
choose from.  These variants are described below assuming  that process 0
106
is the source of the broadcast for convenience. "->" means "sends to".
107
<UL>
108
<LI><STRONG>Increasing-ring</STRONG>:  0 -> 1;  1 -> 2; 2 -> 3 and so on.
109
This algorithm is the classic one;  it has  the caveat that process 1 has
110
to send a message.
111
<CENTER>
112
<IMG SRC="1ring.jpg">
113
</CENTER>
114

    
115
<LI><STRONG>Increasing-ring (modified)</STRONG>:  0 -> 1;  0 -> 2; 2 -> 3
116
and so on. Process 0 sends two messages and process 1  only  receives one
117
message. This algorithm is almost always better, if not the best.
118
<CENTER>
119
<IMG SRC="1rinM.jpg">
120
</CENTER>
121

    
122
<LI><STRONG>Increasing-2-ring</STRONG>:  The Q processes are divided into
123
two parts: 0 -> 1 and 0 -> Q/2;  Then processes 1  and Q/2 act as sources
124
of two rings: 1 -> 2, Q/2 -> Q/2+1;  2 -> 3, Q/2+1 -> to Q/2+2 and so on.
125
This  algorithm has the advantage  of reducing the time by which the last
126
process  will  receive  the  panel  at  the  cost  of process 0 sending 2
127
messages.
128
<CENTER>
129
<IMG SRC="2ring.jpg">
130
</CENTER>
131

    
132
<LI><STRONG>Increasing-2-ring (modified)</STRONG>:  As  one  may  expect,
133
first 0 -> 1,  then  the  Q-1  processes  left are divided into two equal
134
parts: 0 -> 2 and 0 -> Q/2;  Processes  2 and Q/2  act then as sources of
135
two rings:  2 -> 3,  Q/2 -> Q/2+1; 3 -> 4,  Q/2+1 -> to Q/2+2  and so on.
136
This algorithm is probably  the most serious competitor to the increasing
137
ring modified variant.
138
<CENTER>
139
<IMG SRC="2rinM.jpg">
140
</CENTER>
141

    
142
<LI><STRONG>Long  (bandwidth  reducing)</STRONG>:  as   opposed   to  the
143
previous  variants,  this  algorithm  and  its follower  synchronize  all 
144
processes involved in the operation. The message is chopped into  Q equal
145
pieces that are scattered  across the Q processes. 
146
<CENTER>
147
<IMG SRC="spread.jpg">
148
</CENTER>
149
The pieces are then rolled in Q-1 steps.  The scatter phase uses a binary
150
tree and the rolling phase exclusively uses mutual message exchanges.  In
151
odd steps 0 <-> 1,  2 <-> 3, 4 <-> 5 and so on;  in even steps Q-1 <-> 0,
152
1 <-> 2, 3 <-> 4, 5 <-> 6 and so on.
153
<CENTER>
154
<IMG SRC="roll.jpg">
155
</CENTER>
156
More messages are exchanged, however the total volume of communication is
157
independent of Q, making this algorithm  particularly suitable for  large
158
messages.  This algorithm  becomes  competitive  when the nodes are "very 
159
fast" and the network (comparatively) "very slow".<BR><BR>
160

    
161
<LI><STRONG>Long (bandwidth reducing modified)</STRONG>:  same  as above,
162
except that 0 -> 1 first,  and then the Long variant is used on processes
163
0,2,3,4 .. Q-1.<BR><BR>
164
<CENTER>
165
<IMG SRC="spreadM.jpg">
166
<IMG SRC="rollM.jpg">
167
</CENTER>
168

    
169
</UL>
170

    
171
The rings variants are distinguished by a probe mechanism  that activates
172
them.  In other words,  a process involved in the broadcast and different
173
from  the source asynchronously  probes for the message to receive.  When
174
the  message  is  available  the broadcast proceeds,  and  otherwise  the
175
function returns.  This allows to interleave the broadcast operation with
176
the update phase. This contributes to reduce the idle time spent by those
177
processes waiting for the factorized panel.  This  mechanism is necessary
178
to accomodate for various computation/communication performance ratio.<BR><BR>
179
<HR NOSHADE
180

    
181
<H3<A ="look_ahead">Look-ahead</A></H3>
182

    
183
Once the panel has been broadcast or say during this broadcast operation,
184
the trailing submatrix is updated  using the last panel in the look-ahead
185
pipe: as mentioned before,  the panel factorization  lies on the critical
186
path,  which  means  that when the kth panel has been factorized and then 
187
broadcast, the next most urgent task to complete is the factorization and
188
broadcast of the k+1 th panel.  This technique  is  often  refered  to as
189
"look-ahead" or "send-ahead" in the literature.  This  package  allows to
190
select various "depth" of look-ahead.  By  convention,  a  depth  of zero
191
corresponds to no lookahead,  in which case  the  trailing  submatrix  is
192
updated by the panel currently broadcast.  Look-ahead consumes some extra
193
memory  to  essentially  keep  all the panels of columns currently in the
194
look-ahead pipe.  A look-ahead  of depth 1 (maybe 2) is likely to achieve
195
the best performance gain.<BR><BR> 
196
<HR NOSHADE
197

    
198
<H3<A ="update">Update</A></H3>
199

    
200
The update of the trailing submatrix by the last panel in the  look-ahead
201
pipe is made of two phases. First, the pivots must be applied to form the
202
current row panel U. U should then be solved by the upper triangle of the
203
column panel. U finally needs to be broadcast to each process row so that
204
the  local  rank-nb  update  can take place.  We choose  to  combine  the
205
swapping and broadcast of  U  at the cost of  replicating the solve.  Two
206
algorithms are available for this communication operation.
207
<UL>
208
<LI><STRONG>Binary-exchange</STRONG>:  this is a modified variant  of the
209
binary-exchange (leave on all) reduction operation.  Every process column
210
performs the same operation.  The algorithm essentially works as follows.
211
It pretends reducing the row panel U, but at the beginning the only valid
212
copy is owned by the current process row.  The  other process  rows  will
213
contribute rows of A they own that should be copied in U and replace them
214
with rows that were originally in the current process row.  The  complete
215
operation is performed in  log(P) steps.  For the sake of simplicity, let
216
assume that  P  is a power of two.  At step k,  process row p exchanges a 
217
message with process row p+2^k.  There are  essentially two cases. First,
218
one of those two process rows  has received  U  in  a previous step.  The
219
exchange occurs.  One process  swaps  its  local rows of  A into U.  Both
220
processes copy in  U remote rows of A. Second, none of those process rows
221
has received U,  the exchange occurs, and both processes simply add those
222
remote rows  to  the list  they have accumulated so far.  At each step, a 
223
message  of  the size of  U  is exchanged by at least one pair of process
224
rows.<BR><BR>
225

    
226
<LI><STRONG>Long</STRONG>:   this  is   a   bandwidth   reducing  variant
227
accomplishing the same task. The row panel is first spread (using a tree)
228
among the process rows with respect to the pivot array. This is a scatter
229
(V variant for MPI users).  Locally,  every process row  then swaps these
230
rows with the the rows of A it owns and that belong to U.  These  buffers
231
are then rolled  (P-1 steps) to finish the broadcast of U.  Every process
232
row permutes U and proceed  with the computational part of the update.  A
233
couple  of  notes:   process  rows  are  logarithmically   sorted  before
234
spreading,  so  that  processes  receiving the largest number of rows are
235
first in the tree.  This makes  the communication volume optimal for this
236
phase. Finally, before rolling and after the local swap, an equilibration
237
phase occurs during  which the local pieces of  U  are  uniformly  spread
238
across  the process rows.  A tree-based algorithm is used. This operation
239
is necessary to keep the rolling phase optimal  even  when the pivot rows
240
are  not  equally distributed  in  process rows.  This  algorithm  has  a 
241
complexity  in  terms  of communication volume that solely depends on the 
242
size of U.  In particular,  the number of process rows  only  impacts the
243
number of messages exchanged.  It  will  thus  outperforms  the  previous
244
variant for large problems on large machine configurations.<BR><BR>
245

    
246
</UL>
247

    
248
The user can select any of the two variants above.  In addition, a mix is
249
possible as well.  The  "binary-exchange"  algorithm will be used when  U
250
contains at most a certain number of columns. Choosing at least the block
251
size  nb as the threshold value is clearly recommended when look-ahead is
252
on.<BR><BR>
253
<HR NOSHADE
254

    
255
<H3<A ="trsv">Backward Substitution</A></H3>
256

    
257
The factorization has just now ended, the back-substitution remains to be
258
done.  For this,  we  choose  a look-ahead  of  depth  one  variant.  The
259
right-hand-side  is  forwarded  in  process  rows  in  a  decreasing-ring 
260
fashion,  so that  we solve Q * nb entries at a time.  At each step, this
261
shrinking piece of the right-hand-side is updated. The process just above
262
the one owning the current diagonal block of the matrix  A  updates first 
263
its last nb piece of x,  forwards it to the previous process column, then
264
broadcast  it in the process column in a decreasing-ring fashion as well.
265
The solution is then updated and sent to the previous process column. The
266
solution of the linear system is left replicated in every process row.<BR><BR>
267
<HR NOSHADE
268
 
269
<H3<A ="check">Checking the Solution</A></H3>
270

    
271
To verify the result obtained,  the input matrix  and right-hand side are
272
regenerated.  The  normwise  backward  error  (see formula below) is then
273
computed.  A solution  is  considered  as "numerically correct" when this
274
quantity  is  less  than  a  threshold  value of the order of 1.0. In the
275
expression   below,  eps  is  the  relative  (distributed-memory)  machine
276
precision.
277

    
278
<UL>
279
<LI>|| Ax - b ||_oo / ( eps * ( || A ||_oo * || x ||_oo + || b ||_oo ) * n )
280
</UL>
281

    
282
<HR NOSHADE
283
<CENTER
284
<A  = "index.html">            [Home]</A>
285
<A HREF = "copyright.html">        [Copyright and Licensing Terms]</A>
286
<A HREF = "algorithm.html">        [Algorithm]</A>
287
<A HREF = "scalability.html">      [Scalability]</A>
288
<A HREF = "results.html">          [Performance Results]</A>
289
<A HREF = "documentation.html">    [Documentation]</A>
290
<A HREF = "software.html">         [Software]</A>
291
<A HREF = "faqs.html">             [FAQs]</A>
292
<A HREF = "tuning.html">           [Tuning]</A>
293
<A HREF = "errata.html">           [Errata-Bugs]</A>
294
<A HREF = "references.html">       [References]</A>
295
<A HREF = "links.html">            [Related Links]</A><BR>
296
</CENTER>
297
<HR NOSHADE
298
</BODY
299
</HTML