Statistiques
| Révision :

root / www / tuning.html

Historique | Voir | Annoter | Télécharger (18,12 ko)

1
<HTML>
2
<HEAD>
3
<TITLE>HPL Tuning</TITLE>
4
</HEAD>
5

    
6
<BODY 
7
BGCOLOR     = "WHITE"
8
BACKGROUND  = "WHITE"
9
TEXT        = "#000000"
10
VLINK       = "#000099"
11
ALINK       = "#947153"
12
LINK        = "#0000ff">
13

    
14
<H2>HPL Tuning</H2>
15

    
16
After  having built the executable hpl/bin/&#60arch&#62/xhpl,
17
one may want to modify the input data file HPL.dat. This file
18
should  reside  in  the  same  directory  as  the  executable
19
hpl/bin/&#60arch&#62/xhpl.   An example   HPL.dat   file   is 
20
provided by default. This file contains information about the
21
problem sizes, machine configuration,  and algorithm features
22
to be used by the executable.  It is  31  lines long. All the
23
selected  parameters  will be printed in the output generated
24
by the executable.<BR><BR>
25

    
26
We first describe the meaning of each line of this input file
27
below.  Finally,  <A HREF="tuning.html#tips">a   few   useful 
28
experimental guide lines</A>  to set up the file are given at
29
the end of this page.<BR><BR>
30
<HR NOSHADE
31

    
32
<H3<A ="desc">Description of the HPL.dat File</A></H3>
33

    
34
<STRONG>Line 1</STRONG>:  (unused) Typically  one  would  use
35
this line for its own good.  For example,  it  could  be used
36
to summarize the content of the input file.  By  default this 
37
line reads:
38
<TT><PRE>
39
HPL Linpack benchmark input file
40
</PRE></TT>
41
 
42
<HR NOSHADE
43
<STRONGLine </STRONG:  (unused) same  line  By 
44
this  reads:
45
<TT<PRE
46
Innovative  Laboratory, University  Tennessee
47
</PRE</TT
48
 
49
<HR >
50
<STRONG>Line 3</STRONG>:  the  user  can   choose  where  the
51
output  should  be  redirected to.  In the case of a file,  a
52
name  is necessary, and this is  the line  where one wants to 
53
specify it.  Only the first name on this line is significant.
54
By default, the line reads:
55
<TT><PRE>
56
HPL.out  output file name (if any)
57
</PRE></TT>
58
 
59
This  means  that if  one chooses to redirect the output to a
60
file, the file will be called "HPL.out". The rest of the line
61
is unused,  and this space to put some informative comment on
62
the meaning of this line.<BR><BR>
63
 
64
<HR NOSHADE
65
<STRONGLine </STRONG:  line  where  output
66
 go.    line    formatted,  it    begin  a 
67
 integer,  the  is  3    are
68
  for    positive , 6  that  output
69
 go  standard ,  7    that   output 
70
go  the  error.    other  means   the
71
 should  redirected  a ,  which   has  
72
specified   the  above.  line  default 
73
<TT<PRE
74
6         out (6=stdout,7=stderr,file)
75
</PRE</TT
76
which    that    output   by   executable
77
 be  to  standard <BR<BR
78
 
79
<HR >
80
<STRONG>Line 5</STRONG>: This  line  specifies  the number of
81
problem sizes to be executed. This number should be less than
82
or equal to 20.  The first  integer is significant,  the rest
83
is ignored. If the line reads:
84
<TT><PRE>
85
3        # of problems sizes (N)
86
</PRE></TT>
87
this  means  that  the user is willing to run 3 problem sizes
88
that will be specified in the next line.<BR><BR>
89
 
90
<HR NOSHADE
91
<STRONGLine </STRONG:  line  the  sizes
92
 wants  run.    the    above   with ,
93
the    first   integers   significant, the  is
94
 For 
95
<TT<PRE
96
3000  10000    
97
</PRE</TT
98
means  one  xhpl  run  (specified  line )
99
problem , namely , 6000  10000.<BR<BR
100
 
101
<HR >
102
<STRONG>Line 7</STRONG>: This line  specifies  the number  of
103
block sizes to be runned. This number should be less than  or
104
equal to 20.  The first integer  is significant,  the rest is
105
ignored. If the line reads:
106
<TT><PRE>
107
5        # of NBs
108
</PRE></TT>
109
this means that the user is willing to use 5 block sizes that
110
will be specified in the next line.<BR><BR>
111
 
112
<HR NOSHADE
113
<STRONGLine </STRONG:   line  the  sizes
114
  wants   run.    the   above  with ,
115
the    first  integers    significant, the  is 
116
 For 
117
<TT<PRE
118
80  120  160 
119
</PRE</TT
120
means    one    xhpl   use  (specified  line )
121
block , namely , 100, 120, 140  160.<BR<BR
122

    
123
<HR >
124
<STRONG>Line 9</STRONG>:  This  line specifies  how  the  MPI
125
processes  should be mapped  onto the nodes of your platform.
126
There are currently two possible mappings,  namely  row-  and
127
column-major. This feature is mainly useful  when these nodes
128
are themselves multi-processor computers. A row-major mapping
129
is recommended.<BR><BR>
130
 
131
<HR NOSHADE
132
<STRONGLine </STRONG:  line   the   of
133
 grid  be   This    should  less 
134
or  to  The  integer  significant, the  is
135
 If  line 
136
<TT<PRE
137
2        # of  grids (P  Q)
138
</PRE</TT
139
this    that  are  to  2  grid  
140
that  be  in  next <BR<BR
141
 
142
<HR >
143
<STRONG>Line 11-12</STRONG>:  These  two  lines  specify  the  
144
number of process rows  and  columns of each grid you want to
145
run on.  Assuming the line above (10)  started with 2,  the 2
146
first  positive integers of those two lines  are significant,
147
the rest  is ignored. For example:
148
<TT><PRE>
149
1 2          Ps
150
6 8          Qs
151
</PRE></TT>
152
means that one wants to run  xhpl  on  2  process grids (line
153
10), namely 1-by-6 and 2-by-8. Note: In  this example,  it is
154
required then  to  start  xhpl  on  at  least  16  nodes (max
155
of Pi-by-Qi).  The runs on the two grids will be consecutive.
156
If one was starting xhpl on more than 16 nodes, say 52,  only
157
6 would be used for the first grid (1x6)  and  then 16  (2x8)
158
would  be used for the second grid. The fact that you started
159
the MPI job on 52 nodes, will not make  HPL  use all of them.
160
In this example,  only 16 would be used.  If one wants to run 
161
xhpl  with  52  processes  one needs  to specify a grid of 52
162
processes, for example the following lines would do the job:
163
<TT><PRE>
164
4  2         Ps
165
13 8         Qs
166
</PRE></TT>
167
 
168
<HR NOSHADE
169
<STRONGLine </STRONG:  line   the  
170
to  the  should  compared  The 
171
should  or  1, but   in  slightly  than
172
, typically   This    is  of  real ,
173
the  is  significant.  example:
174
<TT<PRE
175
16.0         
176
</PRE</TT
177
In ,  a  of    will    most   For
178
 reasons,  it   possible   some  the 
179
become  larger, say  example   xhpl  flag
180
 runs    failed,  however    can   considered 
181
correct.  run  be  as  if  residual
182
 a  order  magnitude  than  for  10^6 
183
more.   if  was   specify   threshold   0.0, all
184
  would  flagged   failed, even  the  is
185
  to    correct.   is  to  a  
186
value  this ,  in  case   checks   be 
187
,  no  what  threshold  is, as  as
188
  is    This    allows    save   when 
189
 a  of ,  say  instance  the
190
 phase. 
191
<TT<PRE
192
-16.0        
193
</PRE</TT
194
 
195
<HR >
196
The remaning lines  allow  to specifies algorithmic features.
197
xhpl  will  run  all  possible combinations of those for each
198
problem  size,  block size, process grid combination. This is
199
handy  when one looks for an "optimal" set of parameters.  To
200
understand  a little bit better,  let  say  first a few words
201
about  the algorithm implemented in HPL. Basically this is  a
202
right-looking  version  with  row-partial pivoting. The panel
203
factorization is matrix-matrix operation based and recursive,
204
dividing the panel into  NDIV  subpanels  at each step.  This
205
part  of  the   panel   factorization  is  denoted  below  by
206
"recursive  panel  fact.  (RFACT)".  The recursion stops when
207
the  current panel  is made of less  than or equal  to  NBMIN
208
columns. At that point, xhpl uses a  matrix-vector  operation
209
based  factorization  denoted   below  by  "PFACTs".  Classic
210
recursion  would  then  use  NDIV=2,   NBMIN=1.   There   are
211
essentially   3   numerically  equivalent  LU   factorization 
212
algorithm  variants  (left-looking, Crout and right-looking).
213
In HPL, one can choose  every one of those for the  RFACT, as
214
well as the PFACT.  The following lines of HPL.dat allows you
215
to set those parameters.<BR><BR>
216
<STRONG>Lines 14-21: (Example 1)</STRONG>
217
<TT><PRE>
218
3       # of panel fact
219
0 1 2   PFACTs (0=left, 1=Crout, 2=Right)
220
4       # of recursive stopping criterium
221
1 2 4 8 NBMINs (>= 1)
222
3       # of panels in recursion
223
2 3 4   NDIVs
224
3       # of recursive panel fact.
225
0 1 2   RFACTs (0=left, 1=Crout, 2=Right)
226
</PRE></TT>
227
 
228
This  example  would  try all variants of PFACT, 4 values for
229
NBMIN,  namely 1, 2, 4 and 8,  3 values for NDIV namely 2,  3 
230
and 4, and all variants for RFACT.<BR><BR>
231
<STRONG>Lines 14-21: (Example 2)</STRONG>
232
<TT><PRE>
233
2       # of panel fact
234
2 0     PFACTs (0=left, 1=Crout, 2=Right)
235
2       # of recursive stopping criterium
236
4 8     NBMINs (>= 1)
237
1       # of panels in recursion
238
2       NDIVs
239
1       # of recursive panel fact.
240
2       RFACTs (0=left, 1=Crout, 2=Right)
241
</PRE></TT>
242
This example  would  try  2  variants  of  PFACT namely right
243
looking and left looking, 2 values for NBMIN, namely 4 and 8,
244
1 value for NDIV namely 2, and one variant for RFACT.<BR><BR>
245
 
246
<HR NOSHADE
247
In   main   of  algorithm,  the    panel  
248
column   broadcast   process   using   virtual  
249
topology.  offers  choices  one  likely 
250
to  the  ring  encoded  1.  and  are
251
 good <BR<BR
252
<STRONGLines  (Example )</STRONG
253
<TT<PRE
254
1       # of 
255
1        (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
256
</PRE</TT
257
This  cause   to  the  panel  the
258
 ring  topology.<BR<BR
259
<STRONGLines  (Example )</STRONG
260
<TT<PRE
261
2       # of 
262
0      BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
263
</PRE</TT
264
This  cause   to  the  panel  the
265
   ring    topology    the    message
266
<BR<BR
267
 
268
<HR >
269
<STRONG>Lines 24-25</STRONG> allow to specify  the look-ahead
270
depth used by HPL.  A depth of 0  means  that  the next panel
271
is  factorized  after  the  update  by  the  current panel is
272
completely finished.   A  depth of  1  means  that  the  next
273
panel  is  immediately  factorized  after being updated.  The 
274
update  by  the  current panel is then finished. A depth of k
275
means that the k next panels are factorized immediately after
276
being updated.  The  update  by  the  current  panel  is then 
277
finished.  It  turns out that a depth of 1  seems to give the
278
best results,  but  may need a large problem size  before one
279
can  see  the performance  gain. So use 1, if you do not know
280
better,  otherwise  you  may want  to  try 0.  Look-ahead  of
281
depths 3  and  larger  will  probably  not  give  you  better
282
results.<BR><BR>
283
<STRONG>Lines 24-25: (Example 1):</STRONG>
284
<TT><PRE>
285
1       # of lookahead depth
286
1       DEPTHs (>=0)
287
</PRE></TT>
288
This will cause HPL to use a look-ahead of depth 1.<BR><BR>
289
<STRONG>Lines 24-25: (Example 2):</STRONG>
290
<TT><PRE>
291
2       # of lookahead depth
292
0 1     DEPTHs (>=0)
293
</PRE></TT>
294
This will cause HPL to use a look-ahead of depths 0 and 1.<BR><BR>
295

    
296
<HR NOSHADE
297
<STRONGLines </STRONG  allow    specify    swapping
298
  used    HPL   all   There    currently
299
  swapping     available,  one    on  "binary
300
"  and     other     based     a  "spread-roll"
301
procedure  (also     "long"  below).    large  
302
sizes, this  one  likely  be  efficient.    user
303
 also  to  both , that  "binary-exchange"
304
for  number  columns   than  threshold ,  and 
305
the  "spread-roll" algorithm.    threshold    is   
306
specified  Line <BR<BR
307
<STRONGLines  (Example ):</STRONG
308
<TT<PRE
309
1        (0=bin-exch,1=long,2=mix)
310
60       threshold
311
</PRE</TT
312
This    cause    to    the "long" or  "spread-roll" 
313
swapping   Note   a   is   in
314
 example  not  by <BR<BR
315
<STRONGLines  (Example ):</STRONG
316
<TT<PRE
317
2        (0=bin-exch,1=long,2=mix)
318
60       threshold
319
</PRE</TT
320
This    cause    to    the "long" or  "spread-roll" 
321
swapping   as   as  is  than  columns
322
 the  panel. , the "binary-exchange"  algorithm
323
 be  instead.<BR<BR
324

    
325
<HR >
326
<STRONG>Line 28</STRONG>  allows  to specify whether the upper
327
triangle  of  the  panel  of  columns  should   be  stored  in
328
no-transposed  or transposed form. Example:
329
<TT><PRE>
330
0            L1 in (0=transposed,1=no-transposed) form
331
</PRE></TT>
332

    
333
<HR NOSHADE
334
<STRONGLine </STRONG allows   specify  the  
335
of   U   be  in    or  
336
form. 
337
<TT<PRE
338
0              in (0=transposed,1=no-transposed) form
339
</PRE</TT
340

    
341
<HR >
342
<STRONG>Line 30</STRONG> enables / disables the equilibration 
343
phase. This option  will not be used unless you selected 1 or
344
2 in Line 26. Example:
345
<TT><PRE>
346
1            Equilibration (0=no,1=yes)
347
</PRE></TT>
348

    
349
<HR NOSHADE
350
<STRONGLine </STRONG allows    specify  alignment 
351
memory  the   space    by    On  
352
machines, one  wants  use  ,  8   16.    may 
353
 in  tiny  of  wasted. 
354
<TT<PRE
355
8        alignment  double (> 0)
356
</PRE></TT>
357

    
358
<HR NOSHADE
359
<H3<A ="tips">Guide Lines</A></H3>
360

    
361
<OL>
362
<LI>Figure  out  a  good block size  for  the matrix multiply
363
routine.  The best method  is to try a few out. If you happen
364
to know  the block size  used  by the matrix-matrix  multiply
365
routine,  a  small  multiple of that block size will do fine.
366
This particular topic is discussed in the
367
<A HREF="faqs.html#blsize">FAQs</A> section.<BR><BR>
368

    
369
<LI>The process mapping  should  not matter  if  the nodes of
370
your platform are single processor computers.  If these nodes
371
are multi-processors, a row-major mapping is recommended.<BR><BR>
372

    
373
<LI>HPL likes "square" or slightly flat process grids. Unless
374
you  are using  a very small process grid, stay away from the 
375
1-by-Q and P-by-1 process grids. This particular topic is also
376
discussed in the <A HREF="faqs.html#grid">FAQs</A> section.<BR><BR>
377

    
378
<LI>Panel factorization  parameters:  a  good  start  are the
379
following for the lines 14-21:
380
<TT><PRE>
381
1       # of panel fact
382
1       PFACTs (0=left, 1=Crout, 2=Right)
383
2       # of recursive stopping criterium
384
4 8     NBMINs (>= 1)
385
1       # of panels in recursion
386
2       NDIVs
387
1       # of recursive panel fact.
388
2       RFACTs (0=left, 1=Crout, 2=Right)
389
</PRE></TT>
390

    
391
<LI>Broadcast parameters: at this time it is far from obvious
392
to me what the best setting is,  so i would probably try them
393
all.  If  I  had  to guess  I would probably  start  with the 
394
following for the lines 22-23:
395
<TT><PRE>
396
2       # of broadcast
397
1 3     BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
398
</PRE></TT>
399
The best broadcast  depends  on your problem size and harware
400
performance. My take is that 4 or 5  may be  competitive  for
401
machines  featuring  very  fast nodes  comparatively  to  the 
402
network.<BR><BR>
403

    
404
<LI>Look-ahead depth: as mentioned above 0 or 1 are likely to 
405
be the best choices.  This also  depends  on the problem size
406
and machine configuration, so I would try "no look-ahead (0)"
407
and "look-ahead of depth 1 (1)". That is for lines 24-25:
408
<TT><PRE>
409
2       # of lookahead depth
410
0 1     DEPTHs (>=0)
411
</PRE></TT>
412

    
413
<LI>Swapping: one  can select only one of the three algorithm 
414
in the input file. Theoretically, mix (2) should win, however
415
long (1) might just be good enough. The  difference should be
416
small between those two assuming  a swapping threshold of the 
417
order of the block size (NB) selected. If  this  threshold is
418
very large, HPL will use bin_exch (0) most of the time and if
419
it  is  very  small  (< NB) long (1)  will always be used. In 
420
short  and  assuming  the  block size (NB)  used is say 60, I 
421
would choose for the lines 26-27:
422
<TT><PRE>
423
2       SWAP (0=bin-exch,1=long,2=mix)
424
60      swapping threshold 
425
</PRE></TT>
426
I would also try the long variant.  For  a very  small number 
427
of processes  in every column of the process grid  (say < 4),
428
very little performance difference should be observable.<BR><BR>
429

    
430
<LI>Local storage: I do not think Line 28 matters.  Pick 0 in
431
doubt. Line 29 is more important.  It controls  how the panel
432
of rows should be stored. No doubt 0 is better. The caveat is
433
that in that case the matrix-multiply function is called with
434
( Notrans, Trans, ... ), that is C := C - A B^T.   Unless the 
435
computational  kernel  you are using  has  a very poor  (with
436
respect to performance) implementation of that case,  and  is
437
much more efficient with  ( Notrans, Notrans, ... ) just pick
438
0 as well.  So, my choice:
439
<TT><PRE>
440
0       L1 in (0=transposed,1=no-transposed) form
441
0       U  in (0=transposed,1=no-transposed) form
442
</PRE></TT>
443

    
444
<LI>Equilibration: It  is hard to tell  whether equilibration
445
should always be performed or not. Not knowing much about the
446
random matrix generated  and because the overhead is so small
447
compared to the possible gain, I turn it on all the time.
448
<TT><PRE>
449
1       Equilibration (0=no,1=yes)
450
</PRE></TT>
451

    
452
<LI>For alignment, 4 should be plenty,  but just to be safe,
453
one may want to pick 8 instead.
454
<TT><PRE>
455
8       memory alignment in double (> 0)
456
</PRE></TT>
457
</OL>
458
 
459
<HR NOSHADE
460
<CENTER
461
<A  = "index.html">            [Home]</A>
462
<A HREF = "copyright.html">        [Copyright and Licensing Terms]</A>
463
<A HREF = "algorithm.html">        [Algorithm]</A>
464
<A HREF = "scalability.html">      [Scalability]</A>
465
<A HREF = "results.html">          [Performance Results]</A>
466
<A HREF = "documentation.html">    [Documentation]</A>
467
<A HREF = "software.html">         [Software]</A>
468
<A HREF = "faqs.html">             [FAQs]</A>
469
<A HREF = "tuning.html">           [Tuning]</A>
470
<A HREF = "errata.html">           [Errata-Bugs]</A>
471
<A HREF = "references.html">       [References]</A>
472
<A HREF = "links.html">            [Related Links]</A><BR>
473
</CENTER>
474
<HR NOSHADE
475
</BODY
476
</HTML