Statistiques
| Révision :

root / www / algorithm.html

Historique | Voir | Annoter | Télécharger (13,99 ko)

1 1 equemene
<HTML>
2 1 equemene
<HEAD>
3 1 equemene
<TITLE>HPL Algorithm</TITLE>
4 1 equemene
</HEAD>
5 1 equemene
6 1 equemene
<BODY
7 1 equemene
BGCOLOR     = "WHITE"
8 1 equemene
BACKGROUND  = "WHITE"
9 1 equemene
TEXT        = "#000000"
10 1 equemene
VLINK       = "#000099"
11 1 equemene
ALINK       = "#947153"
12 1 equemene
LINK        = "#0000ff">
13 1 equemene
14 1 equemene
<H2>HPL Algorithm</H2>
15 1 equemene
16 1 equemene
<STRONG>
17 1 equemene
This  page provides  a high-level description of the algorithm used in
18 1 equemene
this package. As indicated below,  HPL  contains in fact many possible
19 1 equemene
variants for various operations.  Defaults could have been chosen,  or
20 1 equemene
even  variants  could  be selected  during  the execution.  Due to the
21 1 equemene
performance requirements,  it was  decided  to leave the user with the
22 1 equemene
opportunity of choosing,  so that an "optimal" set of parameters could
23 1 equemene
easily be experimentally determined for a given machine configuration.
24 1 equemene
From a numerical accuracy point of view, <STRONG>all</STRONG> possible
25 1 equemene
combinations are rigorously equivalent  to each other  even though the
26 1 equemene
result may slightly differ (bit-wise).
27 1 equemene
</STRONG><BR><BR>
28 1 equemene
29 1 equemene
<UL>
30 1 equemene
<LI><A HREF="algorithm.html#main">Main Algorithm</A>
31 1 equemene
<LI><A HREF="algorithm.html#pfact">Panel Factorization</A>
32 1 equemene
<LI><A HREF="algorithm.html#bcast">Panel Broadcast</A>
33 1 equemene
<LI><A HREF="algorithm.html#look_ahead">Look-ahead</A>
34 1 equemene
<LI><A HREF="algorithm.html#update">Update</A>
35 1 equemene
<LI><A HREF="algorithm.html#trsv">Backward Substitution</A>
36 1 equemene
<LI><A HREF="algorithm.html#check">Checking the Solution</A>
37 1 equemene
</UL>
38 1 equemene
<HR NOSHADE
39 1 equemene
40 1 equemene
<H3<A ="main">Main Algorithm</A></H3>
41 1 equemene
42 1 equemene
This  software  package  solves  a linear system  of order n:  A x = b by
43 1 equemene
first  computing  the  LU  factorization with row partial pivoting of the
44 1 equemene
n-by-n+1 coefficient matrix [A b] = [[L,U] y]. Since the lower triangular
45 1 equemene
factor L is applied to b as the factorization progresses, the solution  x
46 1 equemene
is obtained  by  solving  the upper triangular system U x = y.  The lower
47 1 equemene
triangular  matrix  L  is left unpivoted  and  the array of pivots is not
48 1 equemene
returned.<BR><BR>
49 1 equemene
50 1 equemene
<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0>
51 1 equemene
<TR>
52 1 equemene
<TD ALIGN=LEFT>
53 1 equemene
The  data  is distributed onto a two-dimensional P-by-Q grid of processes
54 1 equemene
according  to  the  block-cyclic  scheme  to ensure  "good"  load balance
55 1 equemene
as well as  the scalability  of the algorithm.  The  n-by-n+1 coefficient
56 1 equemene
matrix is  first  logically partitioned into  nb-by-nb  blocks,  that are
57 1 equemene
cyclically "dealt" onto the  P-by-Q  process grid.  This is done  in both
58 1 equemene
dimensions of the matrix.</TD>
59 1 equemene
<TD ALIGN=CENTER><IMG SRC = "mat2.jpg" BORDER=0 HEIGHT=165 WIDTH=340></TD>
60 1 equemene
</TR>
61 1 equemene
</TABLE>
62 1 equemene
<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0>
63 1 equemene
<TR>
64 1 equemene
<TD ALIGN=CENTER><IMG SRC ="main.jpg" BORDER=0 HEIGHT=165 WIDTH=165></TD>
65 1 equemene
<TD ALIGN=LEFT>
66 1 equemene
The  right-looking  variant  has been chosen for the main loop of the  LU
67 1 equemene
factorization.  This  means that at each iteration of the loop a panel of
68 1 equemene
nb columns is factorized,  and  the  trailing submatrix is updated.  Note
69 1 equemene
that this computation is  thus  logically partitioned with the same block
70 1 equemene
size nb that was used for the data distribution.</TD>
71 1 equemene
</TR>
72 1 equemene
</TABLE>
73 1 equemene
<HR NOSHADE
74 1 equemene
75 1 equemene
<H3<A ="pfact">Panel Factorization</A></H3>
76 1 equemene
77 1 equemene
<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=10>
78 1 equemene
<TR>
79 1 equemene
<TD ALIGN=LEFT>
80 1 equemene
At  a given iteration  of the main loop,  and  because of  the  cartesian
81 1 equemene
property of the distribution scheme,  each panel factorization  occurs in
82 1 equemene
one column of processes.   This  particular part of the computation  lies
83 1 equemene
on the critical path of  the overall algorithm.  The user is  offered the
84 1 equemene
choice of three  (Crout, left- and right-looking)  matrix-multiply  based
85 1 equemene
recursive variants. The software also allows the user  to choose  in  how
86 1 equemene
many  sub-panels  the current panel  should be divided  into  during  the
87 1 equemene
recursion.  Furthermore,  one  can also  select at run-time the recursion
88 1 equemene
stopping criterium in terms of the number  of  columns left to factorize.
89 1 equemene
When this  threshold is reached,  the sub-panel will  then be  factorized
90 1 equemene
using one of the three Crout, left- or right-looking matrix-vector  based
91 1 equemene
variant.  Finally, for each panel column the pivot search, the associated
92 1 equemene
swap  and broadcast  operation  of  the pivot row  are combined  into one
93 1 equemene
single communication step.  A   binary-exchange  (leave-on-all) reduction
94 1 equemene
performs these three operations at once.</TD>
95 1 equemene
<TD ALIGN=CENTER><IMG SRC = "pfact.jpg" BORDER=0 HEIGHT=300 WIDTH=160></TD>
96 1 equemene
</TR>
97 1 equemene
</TABLE>
98 1 equemene
<HR NOSHADE
99 1 equemene
100 1 equemene
<H3<A ="bcast">Panel Broadcast</A></H3>
101 1 equemene
102 1 equemene
Once  the panel factorization has been computed,  this  panel  of columns
103 1 equemene
is  broadcast  to the other process columns.   There  are  many  possible
104 1 equemene
broadcast  algorithms  and  the  software currently offers  6 variants to
105 1 equemene
choose from.  These variants are described below assuming  that process 0
106 1 equemene
is the source of the broadcast for convenience. "->" means "sends to".
107 1 equemene
<UL>
108 1 equemene
<LI><STRONG>Increasing-ring</STRONG>:  0 -> 1;  1 -> 2; 2 -> 3 and so on.
109 1 equemene
This algorithm is the classic one;  it has  the caveat that process 1 has
110 1 equemene
to send a message.
111 1 equemene
<CENTER>
112 1 equemene
<IMG SRC="1ring.jpg">
113 1 equemene
</CENTER>
114 1 equemene
115 1 equemene
<LI><STRONG>Increasing-ring (modified)</STRONG>:  0 -> 1;  0 -> 2; 2 -> 3
116 1 equemene
and so on. Process 0 sends two messages and process 1  only  receives one
117 1 equemene
message. This algorithm is almost always better, if not the best.
118 1 equemene
<CENTER>
119 1 equemene
<IMG SRC="1rinM.jpg">
120 1 equemene
</CENTER>
121 1 equemene
122 1 equemene
<LI><STRONG>Increasing-2-ring</STRONG>:  The Q processes are divided into
123 1 equemene
two parts: 0 -> 1 and 0 -> Q/2;  Then processes 1  and Q/2 act as sources
124 1 equemene
of two rings: 1 -> 2, Q/2 -> Q/2+1;  2 -> 3, Q/2+1 -> to Q/2+2 and so on.
125 1 equemene
This  algorithm has the advantage  of reducing the time by which the last
126 1 equemene
process  will  receive  the  panel  at  the  cost  of process 0 sending 2
127 1 equemene
messages.
128 1 equemene
<CENTER>
129 1 equemene
<IMG SRC="2ring.jpg">
130 1 equemene
</CENTER>
131 1 equemene
132 1 equemene
<LI><STRONG>Increasing-2-ring (modified)</STRONG>:  As  one  may  expect,
133 1 equemene
first 0 -> 1,  then  the  Q-1  processes  left are divided into two equal
134 1 equemene
parts: 0 -> 2 and 0 -> Q/2;  Processes  2 and Q/2  act then as sources of
135 1 equemene
two rings:  2 -> 3,  Q/2 -> Q/2+1; 3 -> 4,  Q/2+1 -> to Q/2+2  and so on.
136 1 equemene
This algorithm is probably  the most serious competitor to the increasing
137 1 equemene
ring modified variant.
138 1 equemene
<CENTER>
139 1 equemene
<IMG SRC="2rinM.jpg">
140 1 equemene
</CENTER>
141 1 equemene
142 1 equemene
<LI><STRONG>Long  (bandwidth  reducing)</STRONG>:  as   opposed   to  the
143 1 equemene
previous  variants,  this  algorithm  and  its follower  synchronize  all
144 1 equemene
processes involved in the operation. The message is chopped into  Q equal
145 1 equemene
pieces that are scattered  across the Q processes.
146 1 equemene
<CENTER>
147 1 equemene
<IMG SRC="spread.jpg">
148 1 equemene
</CENTER>
149 1 equemene
The pieces are then rolled in Q-1 steps.  The scatter phase uses a binary
150 1 equemene
tree and the rolling phase exclusively uses mutual message exchanges.  In
151 1 equemene
odd steps 0 <-> 1,  2 <-> 3, 4 <-> 5 and so on;  in even steps Q-1 <-> 0,
152 1 equemene
1 <-> 2, 3 <-> 4, 5 <-> 6 and so on.
153 1 equemene
<CENTER>
154 1 equemene
<IMG SRC="roll.jpg">
155 1 equemene
</CENTER>
156 1 equemene
More messages are exchanged, however the total volume of communication is
157 1 equemene
independent of Q, making this algorithm  particularly suitable for  large
158 1 equemene
messages.  This algorithm  becomes  competitive  when the nodes are "very
159 1 equemene
fast" and the network (comparatively) "very slow".<BR><BR>
160 1 equemene
161 1 equemene
<LI><STRONG>Long (bandwidth reducing modified)</STRONG>:  same  as above,
162 1 equemene
except that 0 -> 1 first,  and then the Long variant is used on processes
163 1 equemene
0,2,3,4 .. Q-1.<BR><BR>
164 1 equemene
<CENTER>
165 1 equemene
<IMG SRC="spreadM.jpg">
166 1 equemene
<IMG SRC="rollM.jpg">
167 1 equemene
</CENTER>
168 1 equemene
169 1 equemene
</UL>
170 1 equemene
171 1 equemene
The rings variants are distinguished by a probe mechanism  that activates
172 1 equemene
them.  In other words,  a process involved in the broadcast and different
173 1 equemene
from  the source asynchronously  probes for the message to receive.  When
174 1 equemene
the  message  is  available  the broadcast proceeds,  and  otherwise  the
175 1 equemene
function returns.  This allows to interleave the broadcast operation with
176 1 equemene
the update phase. This contributes to reduce the idle time spent by those
177 1 equemene
processes waiting for the factorized panel.  This  mechanism is necessary
178 1 equemene
to accomodate for various computation/communication performance ratio.<BR><BR>
179 1 equemene
<HR NOSHADE
180 1 equemene
181 1 equemene
<H3<A ="look_ahead">Look-ahead</A></H3>
182 1 equemene
183 1 equemene
Once the panel has been broadcast or say during this broadcast operation,
184 1 equemene
the trailing submatrix is updated  using the last panel in the look-ahead
185 1 equemene
pipe: as mentioned before,  the panel factorization  lies on the critical
186 1 equemene
path,  which  means  that when the kth panel has been factorized and then
187 1 equemene
broadcast, the next most urgent task to complete is the factorization and
188 1 equemene
broadcast of the k+1 th panel.  This technique  is  often  refered  to as
189 1 equemene
"look-ahead" or "send-ahead" in the literature.  This  package  allows to
190 1 equemene
select various "depth" of look-ahead.  By  convention,  a  depth  of zero
191 1 equemene
corresponds to no lookahead,  in which case  the  trailing  submatrix  is
192 1 equemene
updated by the panel currently broadcast.  Look-ahead consumes some extra
193 1 equemene
memory  to  essentially  keep  all the panels of columns currently in the
194 1 equemene
look-ahead pipe.  A look-ahead  of depth 1 (maybe 2) is likely to achieve
195 1 equemene
the best performance gain.<BR><BR>
196 1 equemene
<HR NOSHADE
197 1 equemene
198 1 equemene
<H3<A ="update">Update</A></H3>
199 1 equemene
200 1 equemene
The update of the trailing submatrix by the last panel in the  look-ahead
201 1 equemene
pipe is made of two phases. First, the pivots must be applied to form the
202 1 equemene
current row panel U. U should then be solved by the upper triangle of the
203 1 equemene
column panel. U finally needs to be broadcast to each process row so that
204 1 equemene
the  local  rank-nb  update  can take place.  We choose  to  combine  the
205 1 equemene
swapping and broadcast of  U  at the cost of  replicating the solve.  Two
206 1 equemene
algorithms are available for this communication operation.
207 1 equemene
<UL>
208 1 equemene
<LI><STRONG>Binary-exchange</STRONG>:  this is a modified variant  of the
209 1 equemene
binary-exchange (leave on all) reduction operation.  Every process column
210 1 equemene
performs the same operation.  The algorithm essentially works as follows.
211 1 equemene
It pretends reducing the row panel U, but at the beginning the only valid
212 1 equemene
copy is owned by the current process row.  The  other process  rows  will
213 1 equemene
contribute rows of A they own that should be copied in U and replace them
214 1 equemene
with rows that were originally in the current process row.  The  complete
215 1 equemene
operation is performed in  log(P) steps.  For the sake of simplicity, let
216 1 equemene
assume that  P  is a power of two.  At step k,  process row p exchanges a
217 1 equemene
message with process row p+2^k.  There are  essentially two cases. First,
218 1 equemene
one of those two process rows  has received  U  in  a previous step.  The
219 1 equemene
exchange occurs.  One process  swaps  its  local rows of  A into U.  Both
220 1 equemene
processes copy in  U remote rows of A. Second, none of those process rows
221 1 equemene
has received U,  the exchange occurs, and both processes simply add those
222 1 equemene
remote rows  to  the list  they have accumulated so far.  At each step, a
223 1 equemene
message  of  the size of  U  is exchanged by at least one pair of process
224 1 equemene
rows.<BR><BR>
225 1 equemene
226 1 equemene
<LI><STRONG>Long</STRONG>:   this  is   a   bandwidth   reducing  variant
227 1 equemene
accomplishing the same task. The row panel is first spread (using a tree)
228 1 equemene
among the process rows with respect to the pivot array. This is a scatter
229 1 equemene
(V variant for MPI users).  Locally,  every process row  then swaps these
230 1 equemene
rows with the the rows of A it owns and that belong to U.  These  buffers
231 1 equemene
are then rolled  (P-1 steps) to finish the broadcast of U.  Every process
232 1 equemene
row permutes U and proceed  with the computational part of the update.  A
233 1 equemene
couple  of  notes:   process  rows  are  logarithmically   sorted  before
234 1 equemene
spreading,  so  that  processes  receiving the largest number of rows are
235 1 equemene
first in the tree.  This makes  the communication volume optimal for this
236 1 equemene
phase. Finally, before rolling and after the local swap, an equilibration
237 1 equemene
phase occurs during  which the local pieces of  U  are  uniformly  spread
238 1 equemene
across  the process rows.  A tree-based algorithm is used. This operation
239 1 equemene
is necessary to keep the rolling phase optimal  even  when the pivot rows
240 1 equemene
are  not  equally distributed  in  process rows.  This  algorithm  has  a
241 1 equemene
complexity  in  terms  of communication volume that solely depends on the
242 1 equemene
size of U.  In particular,  the number of process rows  only  impacts the
243 1 equemene
number of messages exchanged.  It  will  thus  outperforms  the  previous
244 1 equemene
variant for large problems on large machine configurations.<BR><BR>
245 1 equemene
246 1 equemene
</UL>
247 1 equemene
248 1 equemene
The user can select any of the two variants above.  In addition, a mix is
249 1 equemene
possible as well.  The  "binary-exchange"  algorithm will be used when  U
250 1 equemene
contains at most a certain number of columns. Choosing at least the block
251 1 equemene
size  nb as the threshold value is clearly recommended when look-ahead is
252 1 equemene
on.<BR><BR>
253 1 equemene
<HR NOSHADE
254 1 equemene
255 1 equemene
<H3<A ="trsv">Backward Substitution</A></H3>
256 1 equemene
257 1 equemene
The factorization has just now ended, the back-substitution remains to be
258 1 equemene
done.  For this,  we  choose  a look-ahead  of  depth  one  variant.  The
259 1 equemene
right-hand-side  is  forwarded  in  process  rows  in  a  decreasing-ring
260 1 equemene
fashion,  so that  we solve Q * nb entries at a time.  At each step, this
261 1 equemene
shrinking piece of the right-hand-side is updated. The process just above
262 1 equemene
the one owning the current diagonal block of the matrix  A  updates first
263 1 equemene
its last nb piece of x,  forwards it to the previous process column, then
264 1 equemene
broadcast  it in the process column in a decreasing-ring fashion as well.
265 1 equemene
The solution is then updated and sent to the previous process column. The
266 1 equemene
solution of the linear system is left replicated in every process row.<BR><BR>
267 1 equemene
<HR NOSHADE
268 1 equemene
269 1 equemene
<H3<A ="check">Checking the Solution</A></H3>
270 1 equemene
271 1 equemene
To verify the result obtained,  the input matrix  and right-hand side are
272 1 equemene
regenerated.  The  normwise  backward  error  (see formula below) is then
273 1 equemene
computed.  A solution  is  considered  as "numerically correct" when this
274 1 equemene
quantity  is  less  than  a  threshold  value of the order of 1.0. In the
275 1 equemene
expression   below,  eps  is  the  relative  (distributed-memory)  machine
276 1 equemene
precision.
277 1 equemene
278 1 equemene
<UL>
279 1 equemene
<LI>|| Ax - b ||_oo / ( eps * ( || A ||_oo * || x ||_oo + || b ||_oo ) * n )
280 1 equemene
</UL>
281 1 equemene
282 1 equemene
<HR NOSHADE
283 1 equemene
<CENTER
284 1 equemene
<A  = "index.html">            [Home]</A>
285 1 equemene
<A HREF = "copyright.html">        [Copyright and Licensing Terms]</A>
286 1 equemene
<A HREF = "algorithm.html">        [Algorithm]</A>
287 1 equemene
<A HREF = "scalability.html">      [Scalability]</A>
288 1 equemene
<A HREF = "results.html">          [Performance Results]</A>
289 1 equemene
<A HREF = "documentation.html">    [Documentation]</A>
290 1 equemene
<A HREF = "software.html">         [Software]</A>
291 1 equemene
<A HREF = "faqs.html">             [FAQs]</A>
292 1 equemene
<A HREF = "tuning.html">           [Tuning]</A>
293 1 equemene
<A HREF = "errata.html">           [Errata-Bugs]</A>
294 1 equemene
<A HREF = "references.html">       [References]</A>
295 1 equemene
<A HREF = "links.html">            [Related Links]</A><BR>
296 1 equemene
</CENTER>
297 1 equemene
<HR NOSHADE
298 1 equemene
</BODY
299 1 equemene
</HTML