root / www / tuning.html
Historique | Voir | Annoter | Télécharger (18,12 ko)
1 |
<HTML>
|
---|---|
2 |
<HEAD>
|
3 |
<TITLE>HPL Tuning</TITLE> |
4 |
</HEAD>
|
5 |
|
6 |
<BODY
|
7 |
BGCOLOR = "WHITE" |
8 |
BACKGROUND = "WHITE" |
9 |
TEXT = "#000000" |
10 |
VLINK = "#000099" |
11 |
ALINK = "#947153" |
12 |
LINK = "#0000ff"> |
13 |
|
14 |
<H2>HPL Tuning</H2> |
15 |
|
16 |
After having built the executable hpl/bin/<arch>/xhpl, |
17 |
one may want to modify the input data file HPL.dat. This file |
18 |
should reside in the same directory as the executable |
19 |
hpl/bin/<arch>/xhpl. An example HPL.dat file is |
20 |
provided by default. This file contains information about the |
21 |
problem sizes, machine configuration, and algorithm features |
22 |
to be used by the executable. It is 31 lines long. All the |
23 |
selected parameters will be printed in the output generated |
24 |
by the executable.<BR><BR> |
25 |
|
26 |
We first describe the meaning of each line of this input file |
27 |
below. Finally, <A HREF="tuning.html#tips">a few useful |
28 |
experimental guide lines</A> to set up the file are given at
|
29 |
the end of this page.<BR><BR> |
30 |
<HR NOSHADE |
31 |
|
32 |
<H3<A ="desc">Description of the HPL.dat File</A></H3> |
33 |
|
34 |
<STRONG>Line 1</STRONG>: (unused) Typically one would use |
35 |
this line for its own good. For example, it could be used |
36 |
to summarize the content of the input file. By default this |
37 |
line reads: |
38 |
<TT><PRE> |
39 |
HPL Linpack benchmark input file |
40 |
</PRE></TT> |
41 |
|
42 |
<HR NOSHADE |
43 |
<STRONGLine </STRONG: (unused) same line By |
44 |
this reads: |
45 |
<TT<PRE |
46 |
Innovative Laboratory, University Tennessee |
47 |
</PRE</TT |
48 |
|
49 |
<HR > |
50 |
<STRONG>Line 3</STRONG>: the user can choose where the |
51 |
output should be redirected to. In the case of a file, a |
52 |
name is necessary, and this is the line where one wants to |
53 |
specify it. Only the first name on this line is significant. |
54 |
By default, the line reads: |
55 |
<TT><PRE> |
56 |
HPL.out output file name (if any) |
57 |
</PRE></TT> |
58 |
|
59 |
This means that if one chooses to redirect the output to a |
60 |
file, the file will be called "HPL.out". The rest of the line |
61 |
is unused, and this space to put some informative comment on |
62 |
the meaning of this line.<BR><BR> |
63 |
|
64 |
<HR NOSHADE |
65 |
<STRONGLine </STRONG: line where output |
66 |
go. line formatted, it begin a |
67 |
integer, the is 3 are |
68 |
for positive , 6 that output |
69 |
go standard , 7 that output |
70 |
go the error. other means the |
71 |
should redirected a , which has |
72 |
specified the above. line default |
73 |
<TT<PRE |
74 |
6 out (6=stdout,7=stderr,file) |
75 |
</PRE</TT |
76 |
which that output by executable |
77 |
be to standard <BR<BR |
78 |
|
79 |
<HR > |
80 |
<STRONG>Line 5</STRONG>: This line specifies the number of |
81 |
problem sizes to be executed. This number should be less than |
82 |
or equal to 20. The first integer is significant, the rest |
83 |
is ignored. If the line reads: |
84 |
<TT><PRE> |
85 |
3 # of problems sizes (N) |
86 |
</PRE></TT> |
87 |
this means that the user is willing to run 3 problem sizes |
88 |
that will be specified in the next line.<BR><BR> |
89 |
|
90 |
<HR NOSHADE |
91 |
<STRONGLine </STRONG: line the sizes |
92 |
wants run. the above with , |
93 |
the first integers significant, the is |
94 |
For
|
95 |
<TT<PRE |
96 |
3000 10000 |
97 |
</PRE</TT |
98 |
means one xhpl run (specified line ) |
99 |
problem , namely , 6000 10000.<BR<BR |
100 |
|
101 |
<HR > |
102 |
<STRONG>Line 7</STRONG>: This line specifies the number of |
103 |
block sizes to be runned. This number should be less than or |
104 |
equal to 20. The first integer is significant, the rest is |
105 |
ignored. If the line reads: |
106 |
<TT><PRE> |
107 |
5 # of NBs |
108 |
</PRE></TT> |
109 |
this means that the user is willing to use 5 block sizes that |
110 |
will be specified in the next line.<BR><BR> |
111 |
|
112 |
<HR NOSHADE |
113 |
<STRONGLine </STRONG: line the sizes |
114 |
wants run. the above with , |
115 |
the first integers significant, the is |
116 |
For
|
117 |
<TT<PRE |
118 |
80 120 160 |
119 |
</PRE</TT |
120 |
means one xhpl use (specified line ) |
121 |
block , namely , 100, 120, 140 160.<BR<BR |
122 |
|
123 |
<HR > |
124 |
<STRONG>Line 9</STRONG>: This line specifies how the MPI |
125 |
processes should be mapped onto the nodes of your platform. |
126 |
There are currently two possible mappings, namely row- and |
127 |
column-major. This feature is mainly useful when these nodes |
128 |
are themselves multi-processor computers. A row-major mapping |
129 |
is recommended.<BR><BR> |
130 |
|
131 |
<HR NOSHADE |
132 |
<STRONGLine </STRONG: line the of |
133 |
grid be This should less |
134 |
or to The integer significant, the is |
135 |
If line |
136 |
<TT<PRE |
137 |
2 # of grids (P Q) |
138 |
</PRE</TT |
139 |
this that are to 2 grid |
140 |
that be in next <BR<BR |
141 |
|
142 |
<HR > |
143 |
<STRONG>Line 11-12</STRONG>: These two lines specify the |
144 |
number of process rows and columns of each grid you want to |
145 |
run on. Assuming the line above (10) started with 2, the 2 |
146 |
first positive integers of those two lines are significant, |
147 |
the rest is ignored. For example: |
148 |
<TT><PRE> |
149 |
1 2 Ps |
150 |
6 8 Qs |
151 |
</PRE></TT> |
152 |
means that one wants to run xhpl on 2 process grids (line |
153 |
10), namely 1-by-6 and 2-by-8. Note: In this example, it is |
154 |
required then to start xhpl on at least 16 nodes (max |
155 |
of Pi-by-Qi). The runs on the two grids will be consecutive. |
156 |
If one was starting xhpl on more than 16 nodes, say 52, only |
157 |
6 would be used for the first grid (1x6) and then 16 (2x8) |
158 |
would be used for the second grid. The fact that you started |
159 |
the MPI job on 52 nodes, will not make HPL use all of them. |
160 |
In this example, only 16 would be used. If one wants to run |
161 |
xhpl with 52 processes one needs to specify a grid of 52 |
162 |
processes, for example the following lines would do the job: |
163 |
<TT><PRE> |
164 |
4 2 Ps |
165 |
13 8 Qs |
166 |
</PRE></TT> |
167 |
|
168 |
<HR NOSHADE |
169 |
<STRONGLine </STRONG: line the |
170 |
to the should compared The |
171 |
should or 1, but in slightly than |
172 |
, typically This is of real , |
173 |
the is significant. example: |
174 |
<TT<PRE |
175 |
16.0
|
176 |
</PRE</TT |
177 |
In , a of will most For |
178 |
reasons, it possible some the |
179 |
become larger, say example xhpl flag |
180 |
runs failed, however can considered |
181 |
correct. run be as if residual |
182 |
a order magnitude than for 10^6 |
183 |
more. if was specify threshold 0.0, all |
184 |
would flagged failed, even the is |
185 |
to correct. is to a |
186 |
value this , in case checks be |
187 |
, no what threshold is, as as |
188 |
is This allows save when |
189 |
a of , say instance the |
190 |
phase.
|
191 |
<TT<PRE |
192 |
-16.0
|
193 |
</PRE</TT |
194 |
|
195 |
<HR > |
196 |
The remaning lines allow to specifies algorithmic features. |
197 |
xhpl will run all possible combinations of those for each |
198 |
problem size, block size, process grid combination. This is |
199 |
handy when one looks for an "optimal" set of parameters. To |
200 |
understand a little bit better, let say first a few words |
201 |
about the algorithm implemented in HPL. Basically this is a |
202 |
right-looking version with row-partial pivoting. The panel |
203 |
factorization is matrix-matrix operation based and recursive, |
204 |
dividing the panel into NDIV subpanels at each step. This |
205 |
part of the panel factorization is denoted below by |
206 |
"recursive panel fact. (RFACT)". The recursion stops when |
207 |
the current panel is made of less than or equal to NBMIN |
208 |
columns. At that point, xhpl uses a matrix-vector operation |
209 |
based factorization denoted below by "PFACTs". Classic |
210 |
recursion would then use NDIV=2, NBMIN=1. There are |
211 |
essentially 3 numerically equivalent LU factorization |
212 |
algorithm variants (left-looking, Crout and right-looking). |
213 |
In HPL, one can choose every one of those for the RFACT, as |
214 |
well as the PFACT. The following lines of HPL.dat allows you |
215 |
to set those parameters.<BR><BR> |
216 |
<STRONG>Lines 14-21: (Example 1)</STRONG> |
217 |
<TT><PRE> |
218 |
3 # of panel fact |
219 |
0 1 2 PFACTs (0=left, 1=Crout, 2=Right) |
220 |
4 # of recursive stopping criterium |
221 |
1 2 4 8 NBMINs (>= 1)
|
222 |
3 # of panels in recursion |
223 |
2 3 4 NDIVs |
224 |
3 # of recursive panel fact. |
225 |
0 1 2 RFACTs (0=left, 1=Crout, 2=Right) |
226 |
</PRE></TT> |
227 |
|
228 |
This example would try all variants of PFACT, 4 values for |
229 |
NBMIN, namely 1, 2, 4 and 8, 3 values for NDIV namely 2, 3 |
230 |
and 4, and all variants for RFACT.<BR><BR> |
231 |
<STRONG>Lines 14-21: (Example 2)</STRONG> |
232 |
<TT><PRE> |
233 |
2 # of panel fact |
234 |
2 0 PFACTs (0=left, 1=Crout, 2=Right) |
235 |
2 # of recursive stopping criterium |
236 |
4 8 NBMINs (>= 1)
|
237 |
1 # of panels in recursion |
238 |
2 NDIVs |
239 |
1 # of recursive panel fact. |
240 |
2 RFACTs (0=left, 1=Crout, 2=Right) |
241 |
</PRE></TT> |
242 |
This example would try 2 variants of PFACT namely right |
243 |
looking and left looking, 2 values for NBMIN, namely 4 and 8, |
244 |
1 value for NDIV namely 2, and one variant for RFACT.<BR><BR> |
245 |
|
246 |
<HR NOSHADE |
247 |
In main of algorithm, the panel |
248 |
column broadcast process using virtual |
249 |
topology. offers choices one likely |
250 |
to the ring encoded 1. and are |
251 |
good <BR<BR |
252 |
<STRONGLines (Example )</STRONG |
253 |
<TT<PRE |
254 |
1 # of |
255 |
1 (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) |
256 |
</PRE</TT |
257 |
This cause to the panel the |
258 |
ring topology.<BR<BR |
259 |
<STRONGLines (Example )</STRONG |
260 |
<TT<PRE |
261 |
2 # of |
262 |
0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) |
263 |
</PRE</TT |
264 |
This cause to the panel the |
265 |
ring topology the message |
266 |
<BR<BR |
267 |
|
268 |
<HR > |
269 |
<STRONG>Lines 24-25</STRONG> allow to specify the look-ahead |
270 |
depth used by HPL. A depth of 0 means that the next panel |
271 |
is factorized after the update by the current panel is |
272 |
completely finished. A depth of 1 means that the next |
273 |
panel is immediately factorized after being updated. The |
274 |
update by the current panel is then finished. A depth of k |
275 |
means that the k next panels are factorized immediately after |
276 |
being updated. The update by the current panel is then |
277 |
finished. It turns out that a depth of 1 seems to give the |
278 |
best results, but may need a large problem size before one |
279 |
can see the performance gain. So use 1, if you do not know |
280 |
better, otherwise you may want to try 0. Look-ahead of |
281 |
depths 3 and larger will probably not give you better |
282 |
results.<BR><BR> |
283 |
<STRONG>Lines 24-25: (Example 1):</STRONG> |
284 |
<TT><PRE> |
285 |
1 # of lookahead depth |
286 |
1 DEPTHs (>=0)
|
287 |
</PRE></TT> |
288 |
This will cause HPL to use a look-ahead of depth 1.<BR><BR> |
289 |
<STRONG>Lines 24-25: (Example 2):</STRONG> |
290 |
<TT><PRE> |
291 |
2 # of lookahead depth |
292 |
0 1 DEPTHs (>=0)
|
293 |
</PRE></TT> |
294 |
This will cause HPL to use a look-ahead of depths 0 and 1.<BR><BR> |
295 |
|
296 |
<HR NOSHADE |
297 |
<STRONGLines </STRONG allow specify swapping |
298 |
used HPL all There currently |
299 |
swapping available, one on "binary |
300 |
" and other based a "spread-roll" |
301 |
procedure (also "long" below). large |
302 |
sizes, this one likely be efficient. user |
303 |
also to both , that "binary-exchange" |
304 |
for number columns than threshold , and |
305 |
the "spread-roll" algorithm. threshold is |
306 |
specified Line <BR<BR |
307 |
<STRONGLines (Example ):</STRONG |
308 |
<TT<PRE |
309 |
1 (0=bin-exch,1=long,2=mix) |
310 |
60 threshold |
311 |
</PRE</TT |
312 |
This cause to the "long" or "spread-roll" |
313 |
swapping Note a is in |
314 |
example not by <BR<BR |
315 |
<STRONGLines (Example ):</STRONG |
316 |
<TT<PRE |
317 |
2 (0=bin-exch,1=long,2=mix) |
318 |
60 threshold |
319 |
</PRE</TT |
320 |
This cause to the "long" or "spread-roll" |
321 |
swapping as as is than columns |
322 |
the panel. , the "binary-exchange" algorithm |
323 |
be instead.<BR<BR |
324 |
|
325 |
<HR > |
326 |
<STRONG>Line 28</STRONG> allows to specify whether the upper |
327 |
triangle of the panel of columns should be stored in |
328 |
no-transposed or transposed form. Example: |
329 |
<TT><PRE> |
330 |
0 L1 in (0=transposed,1=no-transposed) form |
331 |
</PRE></TT> |
332 |
|
333 |
<HR NOSHADE |
334 |
<STRONGLine </STRONG allows specify the |
335 |
of U be in or |
336 |
form.
|
337 |
<TT<PRE |
338 |
0 in (0=transposed,1=no-transposed) form |
339 |
</PRE</TT |
340 |
|
341 |
<HR > |
342 |
<STRONG>Line 30</STRONG> enables / disables the equilibration |
343 |
phase. This option will not be used unless you selected 1 or |
344 |
2 in Line 26. Example: |
345 |
<TT><PRE> |
346 |
1 Equilibration (0=no,1=yes) |
347 |
</PRE></TT> |
348 |
|
349 |
<HR NOSHADE |
350 |
<STRONGLine </STRONG allows specify alignment |
351 |
memory the space by On |
352 |
machines, one wants use , 8 16. may |
353 |
in tiny of wasted. |
354 |
<TT<PRE |
355 |
8 alignment double (> 0) |
356 |
</PRE></TT> |
357 |
|
358 |
<HR NOSHADE |
359 |
<H3<A ="tips">Guide Lines</A></H3> |
360 |
|
361 |
<OL>
|
362 |
<LI>Figure out a good block size for the matrix multiply
|
363 |
routine. The best method is to try a few out. If you happen |
364 |
to know the block size used by the matrix-matrix multiply |
365 |
routine, a small multiple of that block size will do fine. |
366 |
This particular topic is discussed in the |
367 |
<A HREF="faqs.html#blsize">FAQs</A> section.<BR><BR> |
368 |
|
369 |
<LI>The process mapping should not matter if the nodes of
|
370 |
your platform are single processor computers. If these nodes |
371 |
are multi-processors, a row-major mapping is recommended.<BR><BR> |
372 |
|
373 |
<LI>HPL likes "square" or slightly flat process grids. Unless
|
374 |
you are using a very small process grid, stay away from the |
375 |
1-by-Q and P-by-1 process grids. This particular topic is also |
376 |
discussed in the <A HREF="faqs.html#grid">FAQs</A> section.<BR><BR> |
377 |
|
378 |
<LI>Panel factorization parameters: a good start are the
|
379 |
following for the lines 14-21: |
380 |
<TT><PRE> |
381 |
1 # of panel fact |
382 |
1 PFACTs (0=left, 1=Crout, 2=Right) |
383 |
2 # of recursive stopping criterium |
384 |
4 8 NBMINs (>= 1)
|
385 |
1 # of panels in recursion |
386 |
2 NDIVs |
387 |
1 # of recursive panel fact. |
388 |
2 RFACTs (0=left, 1=Crout, 2=Right) |
389 |
</PRE></TT> |
390 |
|
391 |
<LI>Broadcast parameters: at this time it is far from obvious
|
392 |
to me what the best setting is, so i would probably try them |
393 |
all. If I had to guess I would probably start with the |
394 |
following for the lines 22-23: |
395 |
<TT><PRE> |
396 |
2 # of broadcast |
397 |
1 3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) |
398 |
</PRE></TT> |
399 |
The best broadcast depends on your problem size and harware |
400 |
performance. My take is that 4 or 5 may be competitive for |
401 |
machines featuring very fast nodes comparatively to the |
402 |
network.<BR><BR> |
403 |
|
404 |
<LI>Look-ahead depth: as mentioned above 0 or 1 are likely to
|
405 |
be the best choices. This also depends on the problem size |
406 |
and machine configuration, so I would try "no look-ahead (0)" |
407 |
and "look-ahead of depth 1 (1)". That is for lines 24-25: |
408 |
<TT><PRE> |
409 |
2 # of lookahead depth |
410 |
0 1 DEPTHs (>=0)
|
411 |
</PRE></TT> |
412 |
|
413 |
<LI>Swapping: one can select only one of the three algorithm
|
414 |
in the input file. Theoretically, mix (2) should win, however |
415 |
long (1) might just be good enough. The difference should be |
416 |
small between those two assuming a swapping threshold of the |
417 |
order of the block size (NB) selected. If this threshold is |
418 |
very large, HPL will use bin_exch (0) most of the time and if |
419 |
it is very small (< NB) long (1) will always be used. In
|
420 |
short and assuming the block size (NB) used is say 60, I |
421 |
would choose for the lines 26-27: |
422 |
<TT><PRE> |
423 |
2 SWAP (0=bin-exch,1=long,2=mix) |
424 |
60 swapping threshold |
425 |
</PRE></TT> |
426 |
I would also try the long variant. For a very small number |
427 |
of processes in every column of the process grid (say < 4),
|
428 |
very little performance difference should be observable.<BR><BR> |
429 |
|
430 |
<LI>Local storage: I do not think Line 28 matters. Pick 0 in
|
431 |
doubt. Line 29 is more important. It controls how the panel |
432 |
of rows should be stored. No doubt 0 is better. The caveat is |
433 |
that in that case the matrix-multiply function is called with |
434 |
( Notrans, Trans, ... ), that is C := C - A B^T. Unless the |
435 |
computational kernel you are using has a very poor (with |
436 |
respect to performance) implementation of that case, and is |
437 |
much more efficient with ( Notrans, Notrans, ... ) just pick |
438 |
0 as well. So, my choice: |
439 |
<TT><PRE> |
440 |
0 L1 in (0=transposed,1=no-transposed) form |
441 |
0 U in (0=transposed,1=no-transposed) form |
442 |
</PRE></TT> |
443 |
|
444 |
<LI>Equilibration: It is hard to tell whether equilibration
|
445 |
should always be performed or not. Not knowing much about the |
446 |
random matrix generated and because the overhead is so small |
447 |
compared to the possible gain, I turn it on all the time. |
448 |
<TT><PRE> |
449 |
1 Equilibration (0=no,1=yes) |
450 |
</PRE></TT> |
451 |
|
452 |
<LI>For alignment, 4 should be plenty, but just to be safe,
|
453 |
one may want to pick 8 instead. |
454 |
<TT><PRE> |
455 |
8 memory alignment in double (> 0)
|
456 |
</PRE></TT> |
457 |
</OL>
|
458 |
|
459 |
<HR NOSHADE |
460 |
<CENTER |
461 |
<A = "index.html"> [Home]</A> |
462 |
<A HREF = "copyright.html"> [Copyright and Licensing Terms]</A> |
463 |
<A HREF = "algorithm.html"> [Algorithm]</A> |
464 |
<A HREF = "scalability.html"> [Scalability]</A> |
465 |
<A HREF = "results.html"> [Performance Results]</A> |
466 |
<A HREF = "documentation.html"> [Documentation]</A> |
467 |
<A HREF = "software.html"> [Software]</A> |
468 |
<A HREF = "faqs.html"> [FAQs]</A> |
469 |
<A HREF = "tuning.html"> [Tuning]</A> |
470 |
<A HREF = "errata.html"> [Errata-Bugs]</A> |
471 |
<A HREF = "references.html"> [References]</A> |
472 |
<A HREF = "links.html"> [Related Links]</A><BR> |
473 |
</CENTER>
|
474 |
<HR NOSHADE |
475 |
</BODY |
476 |
</HTML |