Statistiques
| Révision :

root / TUNING @ 9

Historique | Voir | Annoter | Télécharger (17,07 ko)

1
==============================================================
2
 Performance Tuning and setting up the input data file HPL.dat
3
 
4
 Current as of release 2.0 - September 10, 2008
5
==============================================================
6
 Check out  the website  www.netlib.org/benchmark/hpl  for the
7
 latest information.
8

    
9
 After  having  built  the executable hpl/bin/<arch>/xhpl, one
10
 may want to modify the input  data  file  HPL.dat.  This file
11
 should  reside  in  the  same  directory  as  the  executable 
12
 hpl/bin/<arch>/xhpl.  An example  HPL.dat file is provided by
13
 default.  This  file  contains  information about the problem
14
 sizes,  machine configuration,  and  algorithm features to be
15
 used by the executable. It is 30 lines long. All the selected
16
 parameters  will  be  printed  in the output generated by the
17
 executable.
18

    
19
 At the end of this file,  there  is a couple of  experimental
20
 guide lines that you may find useful.
21

    
22
==============================================================
23
 File HPL.dat (description):
24

    
25
 Line 1: (unused) Typically  one  would  use this line for its 
26
 own good. For example, it could be used to summarize the con-
27
 tent of the input file. By default this line reads:
28
 
29
 HPL Linpack benchmark input file
30
 
31
 Line 2: (unused) same as line 1. By default this line reads:
32
 
33
 Innovative Computing Laboratory, University of Tennessee
34
 
35
 Line 3: the  user  can  choose where the output should be re-
36
 directed to.  In the case of a file, a name is necessary, and
37
 this  is  the  line  where one wants to specify it.  Only the
38
 first name on this line is significative. By default, the li-
39
 ne reads:
40
 
41
 HPL.out  output file name (if any)
42
 
43
 This  means  that if  one chooses to redirect the output to a
44
 file, the file will be called "HPL.out". The rest of the line
45
 is unused,  and this space to put some informative comment on
46
 the meaning of this line.
47
 
48
 Line 4: This line specifies  where the  output should go. The
49
 line is formatted, it must be a positive integer, the rest is
50
 unsignificant.  3 choices are possible for the positive inte-
51
 ger,  6 means that the output will go  the standard output, 7
52
 means  that the  output will go to the standard error. Any o-
53
 ther  integer  means  that  the  output  should be redirected
54
 to a file,  which  name has been specified in the line above.
55
 This line by default reads:
56
 
57
 6        device out (6=stdout,7=stderr,file)
58
 
59
 which  means  that  the  output generated  by  the executable
60
 should be redirected to the standard output.
61
 
62
 Line 5: This line specifies the number of problem sizes to be
63
 executed. This number should be less than or equal to 20. The
64
 first  integer  is  significant,  the rest is ignored. If the 
65
 line reads:
66
 
67
 3        # of problems sizes (N)
68
 
69
 this  means  that  the user is willing to run 3 problem sizes
70
 that will be specified in the next line.
71
 
72
 Line 6:  This  line  specifies the problem sizes one wants to 
73
 run.  Assuming  the  line  above  started with 3, the 3 first
74
 positive  integers  are significant, the rest is ignored. For
75
 example:
76
 
77
 3000 6000 10000    Ns
78
 
79
 means that one wants xhpl to run 3 (specified in line 5) pro-
80
 blem sizes, namely 3000, 6000 and 10000.
81
 
82
 Line 7: This line  specifies  the number of block sizes to be
83
 runned. This number  should  be  less  than  or equal to  20.
84
 The first integer is significant, the rest is ignored. If the
85
 line reads:
86
 
87
 5        # of NBs
88
 
89
 this means that the user is willing to use 5 block sizes that
90
 will be specified in the next line.
91
 
92
 Line 8: This line specifies the block sizes one wants to run.
93
 Assuming  the line above started with 5, the 5 first positive
94
 integers are significant, the rest is ignored. For example:
95
 
96
 80 100 120 140 160 NBs
97
 
98
 means  that  one  wants  xhpl  to use 5 (specified in line 7)
99
 block sizes, namely 80, 100, 120, 140 and 160.
100

    
101
 Line 9 specifies how the  MPI processes should be mapped onto
102
 the nodes of your platform.  There are currently two possible
103
 mappings, namely row- and column-major. This feature is main-
104
 ly  useful  when these nodes  are  themselves multi-processor
105
 computers. A row-major mapping is recommended.
106
 
107
 Line 10: This line specifies  the  number  of process grid to
108
 be runned.  This  number  should be less than or equal to 20.
109
 The first integer is significant, the rest is ignored. If the
110
 line reads:
111

    
112
 2        # of process grids (P x Q)
113
 
114
 this  means  that you are willing to try 2 process grid sizes 
115
 that will be specified in the next line.
116
 
117
 Line 11-12: These  two  lines specify  the  number of process
118
 rows  and  columns of each grid you want to run on.  Assuming
119
 the line above (10) started with 2,  the 2 first positive in-
120
 tegers of those two lines are significant,  the rest is igno-
121
 red. For example:
122
 
123
 1 2          Ps
124
 6 8          Qs
125
 
126
 means  that one wants to run  xhpl  on  2 process grids (line
127
 10), namely 1 by 6 and 2 by 8.  Note:  In this example, it is
128
 required then to start xhpl on at least 16 nodes  (max of P_i
129
 xQ_i). The runs on the two grids will be consecutive.  If one
130
 was starting xhpl on more than 16 nodes, say 52, only 6 would
131
 be used for the first grid  (1x6) and then 16  (2x8) would be
132
 used for the second grid.  The fact  that you started the MPI 
133
 job on 52 nodes,  will not make HPL use all of them.  In this 
134
 example, only 16 would be used. If one wants to run xhpl with
135
 52 processes one needs to specify a grid of 52 processes, for
136
 example the following lines would do the job:
137
 
138
 4  2         Ps
139
 13 8         Qs
140
 
141
 Line 13: This  line  specifies  the  threshold  the residuals
142
 should be compared to.  The  residuals  should be or order 1,
143
 but are in practice slightly less than this, typically 0.001.
144
 This  line  is  made of a real number, the rest is unsignifi-
145
 cant. For example:
146
 
147
 16.0         threshold
148

    
149
 In practice,  a value of 16.0 will cover most cases.  For va-
150
 rious reasons,  it is possible that some of the residuals be-
151
 come slightly larger, say for example 35.6.  xhpl  will  flag
152
 those runs as failed,  however they can be considered as cor-
153
 rect.  A run can be considered as failed if the residual is a
154
 few order of magnitude  bigger than 1 for example 10^6 or mo-
155
 re. Note: if one was to specify a threshold of 0.0, all tests
156
 would be flagged  as failed, even though the answer is likely
157
 to be correct.  It is allowed to specify a negative value for
158
 this threshold,  in  which case the checks will be by-passed,
159
 no matter what the value is, as soon as it is negative.  This
160
 feature  allows to save time when performing a lot of experi-
161
 ments, say for instance during the tuning phase. Example:
162
 
163
 -16.0        threshold
164
 
165
 The remaning lines  allow  to specifies algorithmic features.
166
 xhpl  will  run  all  possible combinations of those for each
167
 problem  size,  block size, process grid combination. This is
168
 handy  when one looks for an "optimal" set of parameters.  To
169
 understand  a little bit better,  let  say  first a few words
170
 about  the algorithm implemented in HPL. Basically this is  a
171
 right-looking  version  with  row-partial pivoting. The panel
172
 factorization is matrix-matrix operation based and recursive,
173
 dividing the panel into  NDIV  subpanels  at each step.  This
174
 part  of  the  panel   factorization   is  denoted  below  by
175
 "recursive panel fact. (RFACT)". The recursion stops when the
176
 current panel is made of less than or equal to NBMIN columns.
177
 At  that  point,  xhpl  uses  a matrix-vector operation based
178
 factorization denoted below by  "PFACTs".  Classic  recursion
179
 would then use  NDIV=2,  NBMIN=1.  There  are  essentially  3
180
 numerically  equivalent  LU  factorization algorithm variants
181
 (left-looking, Crout  and  right-looking).  In  HPL,  one can 
182
 choose  every one  of those  for the  RFACT,  as well as  the
183
 PFACT. The following lines of HPL.dat allows you to set those
184
 parameters.
185
 
186
 Lines 14-21: (Example 1)
187
 3       # of panel fact
188
 0 1 2   PFACTs (0=left, 1=Crout, 2=Right)
189
 4       # of recursive stopping criterium
190
 1 2 4 8 NBMINs (>= 1)
191
 3       # of panels in recursion
192
 2 3 4   NDIVs
193
 3       # of recursive panel fact.
194
 0 1 2   RFACTs (0=left, 1=Crout, 2=Right)
195
 
196
 This  example  would  try all variants of PFACT, 4 values for
197
 NBMIN,  namely 1, 2, 4 and 8,  3 values for NDIV namely 2,  3 
198
 and 4, and all variants for RFACT.  Lines 14-21: (Example 1)
199

    
200
 2       # of panel fact
201
 2 0     PFACTs (0=left, 1=Crout, 2=Right)
202
 2       # of recursive stopping criterium
203
 4 8     NBMINs (>= 1)
204
 1       # of panels in recursion
205
 2       NDIVs
206
 1       # of recursive panel fact.
207
 2       RFACTs (0=left, 1=Crout, 2=Right)
208
 
209
 This example would try  2 variants of PFACT namely right loo-
210
 king and left looking, 2 values for NBMIN, namely 4 and 8,  1
211
 value for NDIV namely 2, and one variant for RFACT.
212
 
213
 In the  main loop of the algorithm,  the current panel of co-
214
 lumn is broadcast in process rows  using  a virtual  ring to-
215
 pology. HPL offers various choices, and one most  likely want
216
 to use the increasing ring modified encoded as 1.  4  is also
217
 a good choice. Lines 22-23: (Example 1):
218

    
219
 1       # of broadcast
220
 1       BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
221
 
222
 This will cause HPL  to broadcast the current panel using the
223
 increasing ring modified topology. Lines 22-23: (Example 2):
224
 
225
 2       # of broadcast
226
 0 4     BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
227
 
228
 This will cause  HPL to broadcast the current panel using the
229
 increasing ring virtual topology and the long message algori-
230
 thm.
231
 
232
 Lines 24-25  allow  to  specify  the look-ahead depth used by
233
 HPL. A depth of 0 means that the next panel is factorized af-
234
 ter the update by the current panel is completely finished. A
235
 depth of 1 means that the next panel is factorized immediate-
236
 ly after being updated.  The  update by the current  panel is
237
 then finished.  A depth of k means that the k next panels are
238
 factorized immediately after being updated. The update by the
239
 current  panel is then finished. It turns out that a depth of
240
 1  seems  to give the best results, but may need a large pro-
241
 blem size  before one can see the performance gain. So use 1,
242
 if you do not know better,  otherwise  you may want to try 0.
243
 Look-ahead of depths 2  and larger will probably not give you
244
 better results.  Lines 24-25: (Example 1):
245
 
246
 1       # of lookahead depth
247
 1       DEPTHs (>=0)
248
 
249
 This will cause HPL to use a look-ahead of depth 1.
250
 Lines 24-25: (Example 2):
251
 
252
 2       # of lookahead depth
253
 0 1     DEPTHs (>=0)
254
 
255
 This will cause HPL to use a look-ahead of depths 0 and 1.
256

    
257
 Lines 26-27  allow to specify  the swapping algorithm used by
258
 HPL for all tests.  There  are  currently  two swapping algo-
259
 rithms  available,  one  based  on "binary exchange"  and the
260
 other one based on a  "spread-roll"  procedure  (also  called 
261
 "long" below. For large problem sizes, this last one is like-
262
 ly to be more efficient. The user can also choose to mix both
263
 variants, that is "binary-exchange"  for  a number of columns
264
 less  than a threshold value, and then the  "spread-roll" al-
265
 gorithm.  This threshold  value is then specified on Line 27.
266
 Lines 26-27: (Example 1):
267

    
268
 1       SWAP (0=bin-exch,1=long,2=mix)
269
 60      swapping threshold
270

    
271
 This will cause HPL to use the "long" or  "spread-roll" swap-
272
 ping algorithm.  Note  that a threshold  is specified in that
273
 example but not used by HPL. Lines 26-27: (Example 2):
274

    
275
 2       SWAP (0=bin-exch,1=long,2=mix)
276
 60      swapping threshold
277

    
278
 This will cause HPL to use the "long" or  "spread-roll" swap-
279
 ping  algorithm  as  soon as there is more than 60 columns in
280
 the row panel.  Otherwise,  the  "binary-exchange"  algorithm
281
 will be used instead.
282

    
283
 Line 28  allows  to specify whether the upper triangle of the
284
 panel  of  columns  should  be  stored  in  no-transposed  or
285
 transposed form. Example:
286

    
287
 0            L1 in (0=transposed,1=no-transposed) form
288

    
289
 Line 29 allows to specify whether the panel of rows  U should
290
 be stored in no-transposed or transposed form. Example:
291
 
292
 0            U  in (0=transposed,1=no-transposed) form
293

    
294
 Line 30 enables/disables the equilibration phase. This option
295
 will not be used unless you selected 1 or 2 in Line 26. Ex:
296

    
297
 1            Equilibration (0=no,1=yes)
298

    
299

    
300
 Line 31  allows  to  specify  the alignment in memory for the
301
 memory space allocated by HPL. On modern machines, one proba-
302
 bly wants to use 4, 8 or 16. This may result in a tiny amount
303
 of memory wasted. Example:
304
 
305
 4       memory alignment in double (> 0)
306

    
307
==============================================================
308
 Guide lines:
309

    
310
 1) Figure  out  a  good  block  size  for  the  matrix-matrix 
311
 multiply routine. The best method is to try a few out. If you
312
 happen  to know  the block size  used  by  the  matrix-matrix 
313
 multiply routine, a small multiple of that block size will do
314
 fine.
315

    
316
 HPL  uses the block size NB for the data distribution as well
317
 as  for   the  computational   granularity.   From   a   data 
318
 distribution point of view,  the smallest  NB, the better the
319
 load balance.  You  definitely  want  to stay away  from very
320
 large values of NB.  From a computation point of view,  a too
321
 small value of  NB may limit the computational performance by
322
 a large factor because almost no data reuse will occur in the
323
 highest level of the memory hierarchy. The number of messages
324
 will also increase.  Efficient  matrix-multiply  routines are 
325
 often internally blocked.  Small multiples  of  this blocking
326
 factor are likely to be good block sizes for HPL.  The bottom
327
 line  is  that  "good"  block sizes  are almost always in the
328
 [32..256] interval. The best values depend on the computation
329
 / communication performance ratio of your system.  To  a much
330
 less  extent,  the problem size  matters  as  well.  Say  for
331
 example,  you emperically found that 44 was a good block size
332
 with respect to performance.  88 or 132  are likely  to  give
333
 slightly better  results for large problem sizes because of a
334
 slighlty higher flop rate.
335

    
336
 2)  The process mapping  should  not matter  if  the nodes of
337
 your platform are single processor computers.  If these nodes
338
 are multi-processors, a row-major mapping is recommended.
339

    
340
 3) HPL likes "square" or slightly flat process grids.  Unless
341
 you  are using  a very small process grid, stay away from the 
342
 1-by-Q and P-by-1 process grids.
343

    
344
 4) Panel factorization parameters:  a good start are the fol-
345
 lowing for the lines 14-21:
346

    
347
 1       # of panel fact
348
 1       PFACTs (0=left, 1=Crout, 2=Right)
349
 2       # of recursive stopping criterium
350
 4 8     NBMINs (>= 1)
351
 1       # of panels in recursion
352
 2       NDIVs
353
 1       # of recursive panel fact.
354
 2       RFACTs (0=left, 1=Crout, 2=Right)
355

    
356
 5) Broadcast parameters: at this time, it is far from obvious
357
 to me what the best setting is,  so i would probably try them
358
 all. If I had to guess I would probably start with the follo-
359
 wing for the lines 22-23:
360
 
361
 2       # of broadcast
362
 1 3     BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
363

    
364
 The best broadcast  depends  on your problem size and harware
365
 performance. My take is that 4 or 5  may be  competitive  for
366
 machines  featuring  very  fast nodes  comparatively  to  the 
367
 network.
368

    
369
 6) Look-ahead depth: as mentioned above  0 or 1 are likely to 
370
 be the best choices.  This also  depends  on the problem size
371
 and machine configuration, so I would try "no look-ahead (0)"
372
 and "look-ahead of depth 1 (1)". That is for lines 24-25:
373
 
374
 2       # of lookahead depth
375
 0 1     DEPTHs (>=0)
376

    
377
 7) Swapping:  one  can select only one of the three algorithm 
378
 in the input file. Theoretically, mix (2) should win, however
379
 long (1) might just be good enough. The  difference should be
380
 small between those two assuming  a swapping threshold of the 
381
 order of the block size (NB) selected. If  this  threshold is
382
 very large, HPL will use bin_exch (0) most of the time and if
383
 it  is  very  small  (< NB) long (1)  will always be used. In 
384
 short  and  assuming  the  block size (NB)  used is say 60, I 
385
 would choose for the lines 26-27:
386

    
387
 2       SWAP (0=bin-exch,1=long,2=mix)
388
 60      swapping threshold 
389

    
390
 I would also try the long variant.  For  a very  small number 
391
 of processes  in every column of the process grid  (say < 4),
392
 very little performance difference should be observable.
393

    
394
 8) Local storage:  I do not think Line 28 matters.  Pick 0 in
395
 doubt.  Line 29 is more important.  It controls how the panel
396
 of rows should be stored. No doubt 0 is better. The caveat is
397
 that in that case the matrix-multiply function is called with
398
 ( Notrans, Trans, ... ), that is C := C - A B^T.  Unless  the
399
 computational  kernel  you  are  using  has a very poor (with
400
 respect to performance)  implementation  of that case, and is
401
 much more efficient with  ( Notrans, Notrans, ... ) just pick
402
 0 as well. So, my choice: 
403

    
404
 0       L1 in (0=transposed,1=no-transposed) form
405
 0       U  in (0=transposed,1=no-transposed) form
406

    
407
 9) Equilibration:  It  is hard to tell  whether equilibration
408
 should always be performed or not. Not knowing much about the
409
 random matrix generated and because the overhead is so small
410
 compared to the possible gain, I turn it on all the time.
411

    
412
 1       Equilibration (0=no,1=yes)
413

    
414
 10) For  alignment, 4 should be plenty,  but just to be safe,
415
 one may want to pick 8 instead.
416

    
417
 8       memory alignment in double (> 0)
418
 
419
==============================================================