/www/algorithm.html - Annoter - HPL sur GPU - Forge du Centre Blaise Pascal

1

equemene

<HTML>

2

1

equemene

<HEAD>

3

1

equemene

<TITLE>HPL Algorithm</TITLE>

4

1

equemene

</HEAD>

5

1

equemene

6

1

equemene

<BODY

7

1

equemene

BGCOLOR     = "WHITE"

8

1

equemene

BACKGROUND  = "WHITE"

9

1

equemene

TEXT        = "#000000"

10

1

equemene

VLINK       = "#000099"

11

1

equemene

ALINK       = "#947153"

12

1

equemene

LINK        = "#0000ff">

13

1

equemene

14

1

equemene

<H2>HPL Algorithm</H2>

15

1

equemene

16

1

equemene

<STRONG>

17

1

equemene

This  page provides  a high-level description of the algorithm used in

18

1

equemene

this package. As indicated below,  HPL  contains in fact many possible

19

1

equemene

variants for various operations.  Defaults could have been chosen,  or

20

1

equemene

even  variants  could  be selected  during  the execution.  Due to the

21

1

equemene

performance requirements,  it was  decided  to leave the user with the

22

1

equemene

opportunity of choosing,  so that an "optimal" set of parameters could

23

1

equemene

easily be experimentally determined for a given machine configuration.

24

1

equemene

From a numerical accuracy point of view, <STRONG>all</STRONG> possible

25

1

equemene

combinations are rigorously equivalent  to each other  even though the

26

1

equemene

result may slightly differ (bit-wise).

27

1

equemene

</STRONG><BR><BR>

28

1

equemene

29

1

equemene

<UL>

30

1

equemene

<LI><A HREF="algorithm.html#main">Main Algorithm</A>

31

1

equemene

<LI><A HREF="algorithm.html#pfact">Panel Factorization</A>

32

1

equemene

<LI><A HREF="algorithm.html#bcast">Panel Broadcast</A>

33

1

equemene

<LI><A HREF="algorithm.html#look_ahead">Look-ahead</A>

34

1

equemene

<LI><A HREF="algorithm.html#update">Update</A>

35

1

equemene

<LI><A HREF="algorithm.html#trsv">Backward Substitution</A>

36

1

equemene

<LI><A HREF="algorithm.html#check">Checking the Solution</A>

37

1

equemene

</UL>

38

1

equemene

<HR NOSHADE

39

1

equemene

40

1

equemene

<H3<A ="main">Main Algorithm</A></H3>

41

1

equemene

42

1

equemene

This  software  package  solves  a linear system  of order n:  A x = b by

43

1

equemene

first  computing  the  LU  factorization with row partial pivoting of the

44

1

equemene

n-by-n+1 coefficient matrix [A b] = [[L,U] y]. Since the lower triangular

45

1

equemene

factor L is applied to b as the factorization progresses, the solution  x

46

1

equemene

is obtained  by  solving  the upper triangular system U x = y.  The lower

47

1

equemene

triangular  matrix  L  is left unpivoted  and  the array of pivots is not

48

1

equemene

returned.<BR><BR>

49

1

equemene

50

1

equemene

<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0>

51

1

equemene

<TR>

52

1

equemene

<TD ALIGN=LEFT>

53

1

equemene

The  data  is distributed onto a two-dimensional P-by-Q grid of processes

54

1

equemene

according  to  the  block-cyclic  scheme  to ensure  "good"  load balance

55

1

equemene

as well as  the scalability  of the algorithm.  The  n-by-n+1 coefficient

56

1

equemene

matrix is  first  logically partitioned into  nb-by-nb  blocks,  that are

57

1

equemene

cyclically "dealt" onto the  P-by-Q  process grid.  This is done  in both

58

1

equemene

dimensions of the matrix.</TD>

59

1

equemene

<TD ALIGN=CENTER><IMG SRC = "mat2.jpg" BORDER=0 HEIGHT=165 WIDTH=340></TD>

60

1

equemene

</TR>

61

1

equemene

</TABLE>

62

1

equemene

<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0>

63

1

equemene

<TR>

64

1

equemene

<TD ALIGN=CENTER><IMG SRC ="main.jpg" BORDER=0 HEIGHT=165 WIDTH=165></TD>

65

1

equemene

<TD ALIGN=LEFT>

66

1

equemene

The  right-looking  variant  has been chosen for the main loop of the  LU

67

1

equemene

factorization.  This  means that at each iteration of the loop a panel of

68

1

equemene

nb columns is factorized,  and  the  trailing submatrix is updated.  Note

69

1

equemene

that this computation is  thus  logically partitioned with the same block

70

1

equemene

size nb that was used for the data distribution.</TD>

71

1

equemene

</TR>

72

1

equemene

</TABLE>

73

1

equemene

<HR NOSHADE

74

1

equemene

75

1

equemene

<H3<A ="pfact">Panel Factorization</A></H3>

76

1

equemene

77

1

equemene

<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=10>

78

1

equemene

<TR>

79

1

equemene

<TD ALIGN=LEFT>

80

1

equemene

At  a given iteration  of the main loop,  and  because of  the  cartesian

81

1

equemene

property of the distribution scheme,  each panel factorization  occurs in

82

1

equemene

one column of processes.   This  particular part of the computation  lies

83

1

equemene

on the critical path of  the overall algorithm.  The user is  offered the

84

1

equemene

choice of three  (Crout, left- and right-looking)  matrix-multiply  based

85

1

equemene

recursive variants. The software also allows the user  to choose  in  how

86

1

equemene

many  sub-panels  the current panel  should be divided  into  during  the

87

1

equemene

recursion.  Furthermore,  one  can also  select at run-time the recursion

88

1

equemene

stopping criterium in terms of the number  of  columns left to factorize.

89

1

equemene

When this  threshold is reached,  the sub-panel will  then be  factorized

90

1

equemene

using one of the three Crout, left- or right-looking matrix-vector  based

91

1

equemene

variant.  Finally, for each panel column the pivot search, the associated

92

1

equemene

swap  and broadcast  operation  of  the pivot row  are combined  into one

93

1

equemene

single communication step.  A   binary-exchange  (leave-on-all) reduction

94

1

equemene

performs these three operations at once.</TD>

95

1

equemene

<TD ALIGN=CENTER><IMG SRC = "pfact.jpg" BORDER=0 HEIGHT=300 WIDTH=160></TD>

96

1

equemene

</TR>

97

1

equemene

</TABLE>

98

1

equemene

<HR NOSHADE

99

1

equemene

100

1

equemene

<H3<A ="bcast">Panel Broadcast</A></H3>

101

1

equemene

102

1

equemene

Once  the panel factorization has been computed,  this  panel  of columns

103

1

equemene

is  broadcast  to the other process columns.   There  are  many  possible

104

1

equemene

broadcast  algorithms  and  the  software currently offers  6 variants to

105

1

equemene

choose from.  These variants are described below assuming  that process 0

106

1

equemene

is the source of the broadcast for convenience. "->" means "sends to".

107

1

equemene

<UL>

108

1

equemene

<LI><STRONG>Increasing-ring</STRONG>:  0 -> 1;  1 -> 2; 2 -> 3 and so on.

109

1

equemene

This algorithm is the classic one;  it has  the caveat that process 1 has

110

1

equemene

to send a message.

111

1

equemene

<CENTER>

112

1

equemene

<IMG SRC="1ring.jpg">

113

1

equemene

</CENTER>

114

1

equemene

115

1

equemene

<LI><STRONG>Increasing-ring (modified)</STRONG>:  0 -> 1;  0 -> 2; 2 -> 3

116

1

equemene

and so on. Process 0 sends two messages and process 1  only  receives one

117

1

equemene

message. This algorithm is almost always better, if not the best.

118

1

equemene

<CENTER>

119

1

equemene

<IMG SRC="1rinM.jpg">

120

1

equemene

</CENTER>

121

1

equemene

122

1

equemene

<LI><STRONG>Increasing-2-ring</STRONG>:  The Q processes are divided into

123

1

equemene

two parts: 0 -> 1 and 0 -> Q/2;  Then processes 1  and Q/2 act as sources

124

1

equemene

of two rings: 1 -> 2, Q/2 -> Q/2+1;  2 -> 3, Q/2+1 -> to Q/2+2 and so on.

125

1

equemene

This  algorithm has the advantage  of reducing the time by which the last

126

1

equemene

process  will  receive  the  panel  at  the  cost  of process 0 sending 2

127

1

equemene

messages.

128

1

equemene

<CENTER>

129

1

equemene

<IMG SRC="2ring.jpg">

130

1

equemene

</CENTER>

131

1

equemene

132

1

equemene

<LI><STRONG>Increasing-2-ring (modified)</STRONG>:  As  one  may  expect,

133

1

equemene

first 0 -> 1,  then  the  Q-1  processes  left are divided into two equal

134

1

equemene

parts: 0 -> 2 and 0 -> Q/2;  Processes  2 and Q/2  act then as sources of

135

1

equemene

two rings:  2 -> 3,  Q/2 -> Q/2+1; 3 -> 4,  Q/2+1 -> to Q/2+2  and so on.

136

1

equemene

This algorithm is probably  the most serious competitor to the increasing

137

1

equemene

ring modified variant.

138

1

equemene

<CENTER>

139

1

equemene

<IMG SRC="2rinM.jpg">

140

1

equemene

</CENTER>

141

1

equemene

142

1

equemene

<LI><STRONG>Long  (bandwidth  reducing)</STRONG>:  as   opposed   to  the

143

1

equemene

previous  variants,  this  algorithm  and  its follower  synchronize  all

144

1

equemene

processes involved in the operation. The message is chopped into  Q equal

145

1

equemene

pieces that are scattered  across the Q processes.

146

1

equemene

<CENTER>

147

1

equemene

<IMG SRC="spread.jpg">

148

1

equemene

</CENTER>

149

1

equemene

The pieces are then rolled in Q-1 steps.  The scatter phase uses a binary

150

1

equemene

tree and the rolling phase exclusively uses mutual message exchanges.  In

151

1

equemene

odd steps 0 <-> 1,  2 <-> 3, 4 <-> 5 and so on;  in even steps Q-1 <-> 0,

152

1

equemene

1 <-> 2, 3 <-> 4, 5 <-> 6 and so on.

153

1

equemene

<CENTER>

154

1

equemene

<IMG SRC="roll.jpg">

155

1

equemene

</CENTER>

156

1

equemene

More messages are exchanged, however the total volume of communication is

157

1

equemene

independent of Q, making this algorithm  particularly suitable for  large

158

1

equemene

messages.  This algorithm  becomes  competitive  when the nodes are "very

159

1

equemene

fast" and the network (comparatively) "very slow".<BR><BR>

160

1

equemene

161

1

equemene

<LI><STRONG>Long (bandwidth reducing modified)</STRONG>:  same  as above,

162

1

equemene

except that 0 -> 1 first,  and then the Long variant is used on processes

163

1

equemene

0,2,3,4 .. Q-1.<BR><BR>

164

1

equemene

<CENTER>

165

1

equemene

<IMG SRC="spreadM.jpg">

166

1

equemene

<IMG SRC="rollM.jpg">

167

1

equemene

</CENTER>

168

1

equemene

169

1

equemene

</UL>

170

1

equemene

171

1

equemene

The rings variants are distinguished by a probe mechanism  that activates

172

1

equemene

them.  In other words,  a process involved in the broadcast and different

173

1

equemene

from  the source asynchronously  probes for the message to receive.  When

174

1

equemene

the  message  is  available  the broadcast proceeds,  and  otherwise  the

175

1

equemene

function returns.  This allows to interleave the broadcast operation with

176

1

equemene

the update phase. This contributes to reduce the idle time spent by those

177

1

equemene

processes waiting for the factorized panel.  This  mechanism is necessary

178

1

equemene

to accomodate for various computation/communication performance ratio.<BR><BR>

179

1

equemene

<HR NOSHADE

180

1

equemene

181

1

equemene

<H3<A ="look_ahead">Look-ahead</A></H3>

182

1

equemene

183

1

equemene

Once the panel has been broadcast or say during this broadcast operation,

184

1

equemene

the trailing submatrix is updated  using the last panel in the look-ahead

185

1

equemene

pipe: as mentioned before,  the panel factorization  lies on the critical

186

1

equemene

path,  which  means  that when the kth panel has been factorized and then

187

1

equemene

broadcast, the next most urgent task to complete is the factorization and

188

1

equemene

broadcast of the k+1 th panel.  This technique  is  often  refered  to as

189

1

equemene

"look-ahead" or "send-ahead" in the literature.  This  package  allows to

190

1

equemene

select various "depth" of look-ahead.  By  convention,  a  depth  of zero

191

1

equemene

corresponds to no lookahead,  in which case  the  trailing  submatrix  is

192

1

equemene

updated by the panel currently broadcast.  Look-ahead consumes some extra

193

1

equemene

memory  to  essentially  keep  all the panels of columns currently in the

194

1

equemene

look-ahead pipe.  A look-ahead  of depth 1 (maybe 2) is likely to achieve

195

1

equemene

the best performance gain.<BR><BR>

196

1

equemene

<HR NOSHADE

197

1

equemene

198

1

equemene

<H3<A ="update">Update</A></H3>

199

1

equemene

200

1

equemene

The update of the trailing submatrix by the last panel in the  look-ahead

201

1

equemene

pipe is made of two phases. First, the pivots must be applied to form the

202

1

equemene

current row panel U. U should then be solved by the upper triangle of the

203

1

equemene

column panel. U finally needs to be broadcast to each process row so that

204

1

equemene

the  local  rank-nb  update  can take place.  We choose  to  combine  the

205

1

equemene

swapping and broadcast of  U  at the cost of  replicating the solve.  Two

206

1

equemene

algorithms are available for this communication operation.

207

1

equemene

<UL>

208

1

equemene

<LI><STRONG>Binary-exchange</STRONG>:  this is a modified variant  of the

209

1

equemene

binary-exchange (leave on all) reduction operation.  Every process column

210

1

equemene

performs the same operation.  The algorithm essentially works as follows.

211

1

equemene

It pretends reducing the row panel U, but at the beginning the only valid

212

1

equemene

copy is owned by the current process row.  The  other process  rows  will

213

1

equemene

contribute rows of A they own that should be copied in U and replace them

214

1

equemene

with rows that were originally in the current process row.  The  complete

215

1

equemene

operation is performed in  log(P) steps.  For the sake of simplicity, let

216

1

equemene

assume that  P  is a power of two.  At step k,  process row p exchanges a

217

1

equemene

message with process row p+2^k.  There are  essentially two cases. First,

218

1

equemene

one of those two process rows  has received  U  in  a previous step.  The

219

1

equemene

exchange occurs.  One process  swaps  its  local rows of  A into U.  Both

220

1

equemene

processes copy in  U remote rows of A. Second, none of those process rows

221

1

equemene

has received U,  the exchange occurs, and both processes simply add those

222

1

equemene

remote rows  to  the list  they have accumulated so far.  At each step, a

223

1

equemene

message  of  the size of  U  is exchanged by at least one pair of process

224

1

equemene

rows.<BR><BR>

225

1

equemene

226

1

equemene

<LI><STRONG>Long</STRONG>:   this  is   a   bandwidth   reducing  variant

227

1

equemene

accomplishing the same task. The row panel is first spread (using a tree)

228

1

equemene

among the process rows with respect to the pivot array. This is a scatter

229

1

equemene

(V variant for MPI users).  Locally,  every process row  then swaps these

230

1

equemene

rows with the the rows of A it owns and that belong to U.  These  buffers

231

1

equemene

are then rolled  (P-1 steps) to finish the broadcast of U.  Every process

232

1

equemene

row permutes U and proceed  with the computational part of the update.  A

233

1

equemene

couple  of  notes:   process  rows  are  logarithmically   sorted  before

234

1

equemene

spreading,  so  that  processes  receiving the largest number of rows are

235

1

equemene

first in the tree.  This makes  the communication volume optimal for this

236

1

equemene

phase. Finally, before rolling and after the local swap, an equilibration

237

1

equemene

phase occurs during  which the local pieces of  U  are  uniformly  spread

238

1

equemene

across  the process rows.  A tree-based algorithm is used. This operation

239

1

equemene

is necessary to keep the rolling phase optimal  even  when the pivot rows

240

1

equemene

are  not  equally distributed  in  process rows.  This  algorithm  has  a

241

1

equemene

complexity  in  terms  of communication volume that solely depends on the

242

1

equemene

size of U.  In particular,  the number of process rows  only  impacts the

243

1

equemene

number of messages exchanged.  It  will  thus  outperforms  the  previous

244

1

equemene

variant for large problems on large machine configurations.<BR><BR>

245

1

equemene

246

1

equemene

</UL>

247

1

equemene

248

1

equemene

The user can select any of the two variants above.  In addition, a mix is

249

1

equemene

possible as well.  The  "binary-exchange"  algorithm will be used when  U

250

1

equemene

contains at most a certain number of columns. Choosing at least the block

251

1

equemene

size  nb as the threshold value is clearly recommended when look-ahead is

252

1

equemene

on.<BR><BR>

253

1

equemene

<HR NOSHADE

254

1

equemene

255

1

equemene

<H3<A ="trsv">Backward Substitution</A></H3>

256

1

equemene

257

1

equemene

The factorization has just now ended, the back-substitution remains to be

258

1

equemene

done.  For this,  we  choose  a look-ahead  of  depth  one  variant.  The

259

1

equemene

right-hand-side  is  forwarded  in  process  rows  in  a  decreasing-ring

260

1

equemene

fashion,  so that  we solve Q * nb entries at a time.  At each step, this

261

1

equemene

shrinking piece of the right-hand-side is updated. The process just above

262

1

equemene

the one owning the current diagonal block of the matrix  A  updates first

263

1

equemene

its last nb piece of x,  forwards it to the previous process column, then

264

1

equemene

broadcast  it in the process column in a decreasing-ring fashion as well.

265

1

equemene

The solution is then updated and sent to the previous process column. The

266

1

equemene

solution of the linear system is left replicated in every process row.<BR><BR>

267

1

equemene

<HR NOSHADE

268

1

equemene

269

1

equemene

<H3<A ="check">Checking the Solution</A></H3>

270

1

equemene

271

1

equemene

To verify the result obtained,  the input matrix  and right-hand side are

272

1

equemene

regenerated.  The  normwise  backward  error  (see formula below) is then

273

1

equemene

computed.  A solution  is  considered  as "numerically correct" when this

274

1

equemene

quantity  is  less  than  a  threshold  value of the order of 1.0. In the

275

1

equemene

expression   below,  eps  is  the  relative  (distributed-memory)  machine

276

1

equemene

precision.

277

1

equemene

278

1

equemene

<UL>

279

1

equemene

<LI>|| Ax - b ||_oo / ( eps * ( || A ||_oo * || x ||_oo + || b ||_oo ) * n )

280

1

equemene

</UL>

281

1

equemene

282

1

equemene

<HR NOSHADE

283

1

equemene

<CENTER

284

1

equemene

<A  = "index.html">            [Home]</A>

285

1

equemene

<A HREF = "copyright.html">        [Copyright and Licensing Terms]</A>

286

1

equemene

<A HREF = "algorithm.html">        [Algorithm]</A>

287

1

equemene

<A HREF = "scalability.html">      [Scalability]</A>

288

1

equemene

<A HREF = "results.html">          [Performance Results]</A>

289

1

equemene

<A HREF = "documentation.html">    [Documentation]</A>

290

1

equemene

<A HREF = "software.html">         [Software]</A>

291

1

equemene

<A HREF = "faqs.html">             [FAQs]</A>

292

1

equemene

<A HREF = "tuning.html">           [Tuning]</A>

293

1

equemene

<A HREF = "errata.html">           [Errata-Bugs]</A>

294

1

equemene

<A HREF = "references.html">       [References]</A>

295

1

equemene

<A HREF = "links.html">            [Related Links]</A><BR>

296

1

equemene

</CENTER>

297

1

equemene

<HR NOSHADE

298

1

equemene

</BODY

299

1

equemene

</HTML

Centre Blaise Pascal » HPL sur GPU

root / www / algorithm.html

-equemene
+<HTML>
-equemene
+<HEAD>
-equemene
+<TITLE>HPL Algorithm</TITLE>
-equemene
+</HEAD>
 equemene
-equemene
+<BODY
-equemene
+BGCOLOR     = "WHITE"
-equemene
+BACKGROUND  = "WHITE"
-equemene
+TEXT        = "#000000"
-equemene
+VLINK       = "#000099"
-equemene
+ALINK       = "#947153"
-equemene
+LINK        = "#0000ff">
 equemene
-equemene
+<H2>HPL Algorithm</H2>
 equemene
-equemene
+<STRONG>
-equemene
+This  page provides  a high-level description of the algorithm used in
-equemene
+this package. As indicated below,  HPL  contains in fact many possible
-equemene
+variants for various operations.  Defaults could have been chosen,  or
-equemene
+even  variants  could  be selected  during  the execution.  Due to the
-equemene
+performance requirements,  it was  decided  to leave the user with the
-equemene
+opportunity of choosing,  so that an "optimal" set of parameters could
-equemene
+easily be experimentally determined for a given machine configuration.
-equemene
+From a numerical accuracy point of view, <STRONG>all</STRONG> possible
-equemene
+combinations are rigorously equivalent  to each other  even though the
-equemene
+result may slightly differ (bit-wise).
-equemene
+</STRONG><BR><BR>
 equemene
-equemene
+<UL>
-equemene
+<LI><A HREF="algorithm.html#main">Main Algorithm</A>
-equemene
+<LI><A HREF="algorithm.html#pfact">Panel Factorization</A>
-equemene
+<LI><A HREF="algorithm.html#bcast">Panel Broadcast</A>
-equemene
+<LI><A HREF="algorithm.html#look_ahead">Look-ahead</A>
-equemene
+<LI><A HREF="algorithm.html#update">Update</A>
-equemene
+<LI><A HREF="algorithm.html#trsv">Backward Substitution</A>
-equemene
+<LI><A HREF="algorithm.html#check">Checking the Solution</A>
-equemene
+</UL>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="main">Main Algorithm</A></H3>
 equemene
-equemene
+This  software  package  solves  a linear system  of order n:  A x = b by
-equemene
+first  computing  the  LU  factorization with row partial pivoting of the
-equemene
+n-by-n+1 coefficient matrix [A b] = [[L,U] y]. Since the lower triangular
-equemene
+factor L is applied to b as the factorization progresses, the solution  x
-equemene
+is obtained  by  solving  the upper triangular system U x = y.  The lower
-equemene
+triangular  matrix  L  is left unpivoted  and  the array of pivots is not
-equemene
+returned.<BR><BR>
 equemene
-equemene
+<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0>
-equemene
+<TR>
-equemene
+<TD ALIGN=LEFT>
-equemene
+The  data  is distributed onto a two-dimensional P-by-Q grid of processes
-equemene
+according  to  the  block-cyclic  scheme  to ensure  "good"  load balance
-equemene
+as well as  the scalability  of the algorithm.  The  n-by-n+1 coefficient
-equemene
+matrix is  first  logically partitioned into  nb-by-nb  blocks,  that are
-equemene
+cyclically "dealt" onto the  P-by-Q  process grid.  This is done  in both
-equemene
+dimensions of the matrix.</TD>
-equemene
+<TD ALIGN=CENTER><IMG SRC = "mat2.jpg" BORDER=0 HEIGHT=165 WIDTH=340></TD>
-equemene
+</TR>
-equemene
+</TABLE>
-equemene
+<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=0>
-equemene
+<TR>
-equemene
+<TD ALIGN=CENTER><IMG SRC ="main.jpg" BORDER=0 HEIGHT=165 WIDTH=165></TD>
-equemene
+<TD ALIGN=LEFT>
-equemene
+The  right-looking  variant  has been chosen for the main loop of the  LU
-equemene
+factorization.  This  means that at each iteration of the loop a panel of
-equemene
+nb columns is factorized,  and  the  trailing submatrix is updated.  Note
-equemene
+that this computation is  thus  logically partitioned with the same block
-equemene
+size nb that was used for the data distribution.</TD>
-equemene
+</TR>
-equemene
+</TABLE>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="pfact">Panel Factorization</A></H3>
 equemene
-equemene
+<TABLE HSPACE=0 VSPACE=0 WIDTH=100% BORDER=0 CELLSPACING=1 CELLPADDING=10>
-equemene
+<TR>
-equemene
+<TD ALIGN=LEFT>
-equemene
+At  a given iteration  of the main loop,  and  because of  the  cartesian
-equemene
+property of the distribution scheme,  each panel factorization  occurs in
-equemene
+one column of processes.   This  particular part of the computation  lies
-equemene
+on the critical path of  the overall algorithm.  The user is  offered the
-equemene
+choice of three  (Crout, left- and right-looking)  matrix-multiply  based
-equemene
+recursive variants. The software also allows the user  to choose  in  how
-equemene
+many  sub-panels  the current panel  should be divided  into  during  the
-equemene
+recursion.  Furthermore,  one  can also  select at run-time the recursion
-equemene
+stopping criterium in terms of the number  of  columns left to factorize.
-equemene
+When this  threshold is reached,  the sub-panel will  then be  factorized
-equemene
+using one of the three Crout, left- or right-looking matrix-vector  based
-equemene
+variant.  Finally, for each panel column the pivot search, the associated
-equemene
+swap  and broadcast  operation  of  the pivot row  are combined  into one
-equemene
+single communication step.  A   binary-exchange  (leave-on-all) reduction
-equemene
+performs these three operations at once.</TD>
-equemene
+<TD ALIGN=CENTER><IMG SRC = "pfact.jpg" BORDER=0 HEIGHT=300 WIDTH=160></TD>
-equemene
+</TR>
-equemene
+</TABLE>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="bcast">Panel Broadcast</A></H3>
 equemene
-equemene
+Once  the panel factorization has been computed,  this  panel  of columns
-equemene
+is  broadcast  to the other process columns.   There  are  many  possible
-equemene
+broadcast  algorithms  and  the  software currently offers  6 variants to
-equemene
+choose from.  These variants are described below assuming  that process 0
-equemene
+is the source of the broadcast for convenience. "->" means "sends to".
-equemene
+<UL>
-equemene
+<LI><STRONG>Increasing-ring</STRONG>:  0 -> 1;  1 -> 2; 2 -> 3 and so on.
-equemene
+This algorithm is the classic one;  it has  the caveat that process 1 has
-equemene
+to send a message.
-equemene
+<CENTER>
-equemene
+<IMG SRC="1ring.jpg">
-equemene
+</CENTER>
 equemene
-equemene
+<LI><STRONG>Increasing-ring (modified)</STRONG>:  0 -> 1;  0 -> 2; 2 -> 3
-equemene
+and so on. Process 0 sends two messages and process 1  only  receives one
-equemene
+message. This algorithm is almost always better, if not the best.
-equemene
+<CENTER>
-equemene
+<IMG SRC="1rinM.jpg">
-equemene
+</CENTER>
 equemene
-equemene
+<LI><STRONG>Increasing-2-ring</STRONG>:  The Q processes are divided into
-equemene
+two parts: 0 -> 1 and 0 -> Q/2;  Then processes 1  and Q/2 act as sources
-equemene
+of two rings: 1 -> 2, Q/2 -> Q/2+1;  2 -> 3, Q/2+1 -> to Q/2+2 and so on.
-equemene
+This  algorithm has the advantage  of reducing the time by which the last
-equemene
+process  will  receive  the  panel  at  the  cost  of process 0 sending 2
-equemene
+messages.
-equemene
+<CENTER>
-equemene
+<IMG SRC="2ring.jpg">
-equemene
+</CENTER>
 equemene
-equemene
+<LI><STRONG>Increasing-2-ring (modified)</STRONG>:  As  one  may  expect,
-equemene
+first 0 -> 1,  then  the  Q-1  processes  left are divided into two equal
-equemene
+parts: 0 -> 2 and 0 -> Q/2;  Processes  2 and Q/2  act then as sources of
-equemene
+two rings:  2 -> 3,  Q/2 -> Q/2+1; 3 -> 4,  Q/2+1 -> to Q/2+2  and so on.
-equemene
+This algorithm is probably  the most serious competitor to the increasing
-equemene
+ring modified variant.
-equemene
+<CENTER>
-equemene
+<IMG SRC="2rinM.jpg">
-equemene
+</CENTER>
 equemene
-equemene
+<LI><STRONG>Long  (bandwidth  reducing)</STRONG>:  as   opposed   to  the
-equemene
+previous  variants,  this  algorithm  and  its follower  synchronize  all
-equemene
+processes involved in the operation. The message is chopped into  Q equal
-equemene
+pieces that are scattered  across the Q processes.
-equemene
+<CENTER>
-equemene
+<IMG SRC="spread.jpg">
-equemene
+</CENTER>
-equemene
+The pieces are then rolled in Q-1 steps.  The scatter phase uses a binary
-equemene
+tree and the rolling phase exclusively uses mutual message exchanges.  In
-equemene
+odd steps 0 <-> 1,  2 <-> 3, 4 <-> 5 and so on;  in even steps Q-1 <-> 0,
-equemene
+<-> 2, 3 <-> 4, 5 <-> 6 and so on.
-equemene
+<CENTER>
-equemene
+<IMG SRC="roll.jpg">
-equemene
+</CENTER>
-equemene
+More messages are exchanged, however the total volume of communication is
-equemene
+independent of Q, making this algorithm  particularly suitable for  large
-equemene
+messages.  This algorithm  becomes  competitive  when the nodes are "very
-equemene
+fast" and the network (comparatively) "very slow".<BR><BR>
 equemene
-equemene
+<LI><STRONG>Long (bandwidth reducing modified)</STRONG>:  same  as above,
-equemene
+except that 0 -> 1 first,  and then the Long variant is used on processes
-equemene
+,2,3,4 .. Q-1.<BR><BR>
-equemene
+<CENTER>
-equemene
+<IMG SRC="spreadM.jpg">
-equemene
+<IMG SRC="rollM.jpg">
-equemene
+</CENTER>
 equemene
-equemene
+</UL>
 equemene
-equemene
+The rings variants are distinguished by a probe mechanism  that activates
-equemene
+them.  In other words,  a process involved in the broadcast and different
-equemene
+from  the source asynchronously  probes for the message to receive.  When
-equemene
+the  message  is  available  the broadcast proceeds,  and  otherwise  the
-equemene
+function returns.  This allows to interleave the broadcast operation with
-equemene
+the update phase. This contributes to reduce the idle time spent by those
-equemene
+processes waiting for the factorized panel.  This  mechanism is necessary
-equemene
+to accomodate for various computation/communication performance ratio.<BR><BR>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="look_ahead">Look-ahead</A></H3>
 equemene
-equemene
+Once the panel has been broadcast or say during this broadcast operation,
-equemene
+the trailing submatrix is updated  using the last panel in the look-ahead
-equemene
+pipe: as mentioned before,  the panel factorization  lies on the critical
-equemene
+path,  which  means  that when the kth panel has been factorized and then
-equemene
+broadcast, the next most urgent task to complete is the factorization and
-equemene
+broadcast of the k+1 th panel.  This technique  is  often  refered  to as
-equemene
+"look-ahead" or "send-ahead" in the literature.  This  package  allows to
-equemene
+select various "depth" of look-ahead.  By  convention,  a  depth  of zero
-equemene
+corresponds to no lookahead,  in which case  the  trailing  submatrix  is
-equemene
+updated by the panel currently broadcast.  Look-ahead consumes some extra
-equemene
+memory  to  essentially  keep  all the panels of columns currently in the
-equemene
+look-ahead pipe.  A look-ahead  of depth 1 (maybe 2) is likely to achieve
-equemene
+the best performance gain.<BR><BR>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="update">Update</A></H3>
 equemene
-equemene
+The update of the trailing submatrix by the last panel in the  look-ahead
-equemene
+pipe is made of two phases. First, the pivots must be applied to form the
-equemene
+current row panel U. U should then be solved by the upper triangle of the
-equemene
+column panel. U finally needs to be broadcast to each process row so that
-equemene
+the  local  rank-nb  update  can take place.  We choose  to  combine  the
-equemene
+swapping and broadcast of  U  at the cost of  replicating the solve.  Two
-equemene
+algorithms are available for this communication operation.
-equemene
+<UL>
-equemene
+<LI><STRONG>Binary-exchange</STRONG>:  this is a modified variant  of the
-equemene
+binary-exchange (leave on all) reduction operation.  Every process column
-equemene
+performs the same operation.  The algorithm essentially works as follows.
-equemene
+It pretends reducing the row panel U, but at the beginning the only valid
-equemene
+copy is owned by the current process row.  The  other process  rows  will
-equemene
+contribute rows of A they own that should be copied in U and replace them
-equemene
+with rows that were originally in the current process row.  The  complete
-equemene
+operation is performed in  log(P) steps.  For the sake of simplicity, let
-equemene
+assume that  P  is a power of two.  At step k,  process row p exchanges a
-equemene
+message with process row p+2^k.  There are  essentially two cases. First,
-equemene
+one of those two process rows  has received  U  in  a previous step.  The
-equemene
+exchange occurs.  One process  swaps  its  local rows of  A into U.  Both
-equemene
+processes copy in  U remote rows of A. Second, none of those process rows
-equemene
+has received U,  the exchange occurs, and both processes simply add those
-equemene
+remote rows  to  the list  they have accumulated so far.  At each step, a
-equemene
+message  of  the size of  U  is exchanged by at least one pair of process
-equemene
+rows.<BR><BR>
 equemene
-equemene
+<LI><STRONG>Long</STRONG>:   this  is   a   bandwidth   reducing  variant
-equemene
+accomplishing the same task. The row panel is first spread (using a tree)
-equemene
+among the process rows with respect to the pivot array. This is a scatter
-equemene
+(V variant for MPI users).  Locally,  every process row  then swaps these
-equemene
+rows with the the rows of A it owns and that belong to U.  These  buffers
-equemene
+are then rolled  (P-1 steps) to finish the broadcast of U.  Every process
-equemene
+row permutes U and proceed  with the computational part of the update.  A
-equemene
+couple  of  notes:   process  rows  are  logarithmically   sorted  before
-equemene
+spreading,  so  that  processes  receiving the largest number of rows are
-equemene
+first in the tree.  This makes  the communication volume optimal for this
-equemene
+phase. Finally, before rolling and after the local swap, an equilibration
-equemene
+phase occurs during  which the local pieces of  U  are  uniformly  spread
-equemene
+across  the process rows.  A tree-based algorithm is used. This operation
-equemene
+is necessary to keep the rolling phase optimal  even  when the pivot rows
-equemene
+are  not  equally distributed  in  process rows.  This  algorithm  has  a
-equemene
+complexity  in  terms  of communication volume that solely depends on the
-equemene
+size of U.  In particular,  the number of process rows  only  impacts the
-equemene
+number of messages exchanged.  It  will  thus  outperforms  the  previous
-equemene
+variant for large problems on large machine configurations.<BR><BR>
 equemene
-equemene
+</UL>
 equemene
-equemene
+The user can select any of the two variants above.  In addition, a mix is
-equemene
+possible as well.  The  "binary-exchange"  algorithm will be used when  U
-equemene
+contains at most a certain number of columns. Choosing at least the block
-equemene
+size  nb as the threshold value is clearly recommended when look-ahead is
-equemene
+on.<BR><BR>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="trsv">Backward Substitution</A></H3>
 equemene
-equemene
+The factorization has just now ended, the back-substitution remains to be
-equemene
+done.  For this,  we  choose  a look-ahead  of  depth  one  variant.  The
-equemene
+right-hand-side  is  forwarded  in  process  rows  in  a  decreasing-ring
-equemene
+fashion,  so that  we solve Q * nb entries at a time.  At each step, this
-equemene
+shrinking piece of the right-hand-side is updated. The process just above
-equemene
+the one owning the current diagonal block of the matrix  A  updates first
-equemene
+its last nb piece of x,  forwards it to the previous process column, then
-equemene
+broadcast  it in the process column in a decreasing-ring fashion as well.
-equemene
+The solution is then updated and sent to the previous process column. The
-equemene
+solution of the linear system is left replicated in every process row.<BR><BR>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="check">Checking the Solution</A></H3>
 equemene
-equemene
+To verify the result obtained,  the input matrix  and right-hand side are
-equemene
+regenerated.  The  normwise  backward  error  (see formula below) is then
-equemene
+computed.  A solution  is  considered  as "numerically correct" when this
-equemene
+quantity  is  less  than  a  threshold  value of the order of 1.0. In the
-equemene
+expression   below,  eps  is  the  relative  (distributed-memory)  machine
-equemene
+precision.
 equemene
-equemene
+<UL>
-equemene
+<LI>|| Ax - b ||_oo / ( eps * ( || A ||_oo * || x ||_oo + || b ||_oo ) * n )
-equemene
+</UL>
 equemene
-equemene
+<HR NOSHADE
-equemene
+<CENTER
-equemene
+<A  = "index.html">            [Home]</A>
-equemene
+<A HREF = "copyright.html">        [Copyright and Licensing Terms]</A>
-equemene
+<A HREF = "algorithm.html">        [Algorithm]</A>
-equemene
+<A HREF = "scalability.html">      [Scalability]</A>
-equemene
+<A HREF = "results.html">          [Performance Results]</A>
-equemene
+<A HREF = "documentation.html">    [Documentation]</A>
-equemene
+<A HREF = "software.html">         [Software]</A>
-equemene
+<A HREF = "faqs.html">             [FAQs]</A>
-equemene
+<A HREF = "tuning.html">           [Tuning]</A>
-equemene
+<A HREF = "errata.html">           [Errata-Bugs]</A>
-equemene
+<A HREF = "references.html">       [References]</A>
-equemene
+<A HREF = "links.html">            [Related Links]</A><BR>
-equemene
+</CENTER>
-equemene
+<HR NOSHADE
-equemene
+</BODY
-equemene
+</HTML