root / TUNING @ 9
Historique | Voir | Annoter | Télécharger (17,07 ko)
1 |
============================================================== |
---|---|
2 |
Performance Tuning and setting up the input data file HPL.dat |
3 |
|
4 |
Current as of release 2.0 - September 10, 2008 |
5 |
============================================================== |
6 |
Check out the website www.netlib.org/benchmark/hpl for the |
7 |
latest information. |
8 |
|
9 |
After having built the executable hpl/bin/<arch>/xhpl, one |
10 |
may want to modify the input data file HPL.dat. This file |
11 |
should reside in the same directory as the executable |
12 |
hpl/bin/<arch>/xhpl. An example HPL.dat file is provided by |
13 |
default. This file contains information about the problem |
14 |
sizes, machine configuration, and algorithm features to be |
15 |
used by the executable. It is 30 lines long. All the selected |
16 |
parameters will be printed in the output generated by the |
17 |
executable. |
18 |
|
19 |
At the end of this file, there is a couple of experimental |
20 |
guide lines that you may find useful. |
21 |
|
22 |
============================================================== |
23 |
File HPL.dat (description): |
24 |
|
25 |
Line 1: (unused) Typically one would use this line for its |
26 |
own good. For example, it could be used to summarize the con- |
27 |
tent of the input file. By default this line reads: |
28 |
|
29 |
HPL Linpack benchmark input file |
30 |
|
31 |
Line 2: (unused) same as line 1. By default this line reads: |
32 |
|
33 |
Innovative Computing Laboratory, University of Tennessee |
34 |
|
35 |
Line 3: the user can choose where the output should be re- |
36 |
directed to. In the case of a file, a name is necessary, and |
37 |
this is the line where one wants to specify it. Only the |
38 |
first name on this line is significative. By default, the li- |
39 |
ne reads: |
40 |
|
41 |
HPL.out output file name (if any) |
42 |
|
43 |
This means that if one chooses to redirect the output to a |
44 |
file, the file will be called "HPL.out". The rest of the line |
45 |
is unused, and this space to put some informative comment on |
46 |
the meaning of this line. |
47 |
|
48 |
Line 4: This line specifies where the output should go. The |
49 |
line is formatted, it must be a positive integer, the rest is |
50 |
unsignificant. 3 choices are possible for the positive inte- |
51 |
ger, 6 means that the output will go the standard output, 7 |
52 |
means that the output will go to the standard error. Any o- |
53 |
ther integer means that the output should be redirected |
54 |
to a file, which name has been specified in the line above. |
55 |
This line by default reads: |
56 |
|
57 |
6 device out (6=stdout,7=stderr,file) |
58 |
|
59 |
which means that the output generated by the executable |
60 |
should be redirected to the standard output. |
61 |
|
62 |
Line 5: This line specifies the number of problem sizes to be |
63 |
executed. This number should be less than or equal to 20. The |
64 |
first integer is significant, the rest is ignored. If the |
65 |
line reads: |
66 |
|
67 |
3 # of problems sizes (N) |
68 |
|
69 |
this means that the user is willing to run 3 problem sizes |
70 |
that will be specified in the next line. |
71 |
|
72 |
Line 6: This line specifies the problem sizes one wants to |
73 |
run. Assuming the line above started with 3, the 3 first |
74 |
positive integers are significant, the rest is ignored. For |
75 |
example: |
76 |
|
77 |
3000 6000 10000 Ns |
78 |
|
79 |
means that one wants xhpl to run 3 (specified in line 5) pro- |
80 |
blem sizes, namely 3000, 6000 and 10000. |
81 |
|
82 |
Line 7: This line specifies the number of block sizes to be |
83 |
runned. This number should be less than or equal to 20. |
84 |
The first integer is significant, the rest is ignored. If the |
85 |
line reads: |
86 |
|
87 |
5 # of NBs |
88 |
|
89 |
this means that the user is willing to use 5 block sizes that |
90 |
will be specified in the next line. |
91 |
|
92 |
Line 8: This line specifies the block sizes one wants to run. |
93 |
Assuming the line above started with 5, the 5 first positive |
94 |
integers are significant, the rest is ignored. For example: |
95 |
|
96 |
80 100 120 140 160 NBs |
97 |
|
98 |
means that one wants xhpl to use 5 (specified in line 7) |
99 |
block sizes, namely 80, 100, 120, 140 and 160. |
100 |
|
101 |
Line 9 specifies how the MPI processes should be mapped onto |
102 |
the nodes of your platform. There are currently two possible |
103 |
mappings, namely row- and column-major. This feature is main- |
104 |
ly useful when these nodes are themselves multi-processor |
105 |
computers. A row-major mapping is recommended. |
106 |
|
107 |
Line 10: This line specifies the number of process grid to |
108 |
be runned. This number should be less than or equal to 20. |
109 |
The first integer is significant, the rest is ignored. If the |
110 |
line reads: |
111 |
|
112 |
2 # of process grids (P x Q) |
113 |
|
114 |
this means that you are willing to try 2 process grid sizes |
115 |
that will be specified in the next line. |
116 |
|
117 |
Line 11-12: These two lines specify the number of process |
118 |
rows and columns of each grid you want to run on. Assuming |
119 |
the line above (10) started with 2, the 2 first positive in- |
120 |
tegers of those two lines are significant, the rest is igno- |
121 |
red. For example: |
122 |
|
123 |
1 2 Ps |
124 |
6 8 Qs |
125 |
|
126 |
means that one wants to run xhpl on 2 process grids (line |
127 |
10), namely 1 by 6 and 2 by 8. Note: In this example, it is |
128 |
required then to start xhpl on at least 16 nodes (max of P_i |
129 |
xQ_i). The runs on the two grids will be consecutive. If one |
130 |
was starting xhpl on more than 16 nodes, say 52, only 6 would |
131 |
be used for the first grid (1x6) and then 16 (2x8) would be |
132 |
used for the second grid. The fact that you started the MPI |
133 |
job on 52 nodes, will not make HPL use all of them. In this |
134 |
example, only 16 would be used. If one wants to run xhpl with |
135 |
52 processes one needs to specify a grid of 52 processes, for |
136 |
example the following lines would do the job: |
137 |
|
138 |
4 2 Ps |
139 |
13 8 Qs |
140 |
|
141 |
Line 13: This line specifies the threshold the residuals |
142 |
should be compared to. The residuals should be or order 1, |
143 |
but are in practice slightly less than this, typically 0.001. |
144 |
This line is made of a real number, the rest is unsignifi- |
145 |
cant. For example: |
146 |
|
147 |
16.0 threshold |
148 |
|
149 |
In practice, a value of 16.0 will cover most cases. For va- |
150 |
rious reasons, it is possible that some of the residuals be- |
151 |
come slightly larger, say for example 35.6. xhpl will flag |
152 |
those runs as failed, however they can be considered as cor- |
153 |
rect. A run can be considered as failed if the residual is a |
154 |
few order of magnitude bigger than 1 for example 10^6 or mo- |
155 |
re. Note: if one was to specify a threshold of 0.0, all tests |
156 |
would be flagged as failed, even though the answer is likely |
157 |
to be correct. It is allowed to specify a negative value for |
158 |
this threshold, in which case the checks will be by-passed, |
159 |
no matter what the value is, as soon as it is negative. This |
160 |
feature allows to save time when performing a lot of experi- |
161 |
ments, say for instance during the tuning phase. Example: |
162 |
|
163 |
-16.0 threshold |
164 |
|
165 |
The remaning lines allow to specifies algorithmic features. |
166 |
xhpl will run all possible combinations of those for each |
167 |
problem size, block size, process grid combination. This is |
168 |
handy when one looks for an "optimal" set of parameters. To |
169 |
understand a little bit better, let say first a few words |
170 |
about the algorithm implemented in HPL. Basically this is a |
171 |
right-looking version with row-partial pivoting. The panel |
172 |
factorization is matrix-matrix operation based and recursive, |
173 |
dividing the panel into NDIV subpanels at each step. This |
174 |
part of the panel factorization is denoted below by |
175 |
"recursive panel fact. (RFACT)". The recursion stops when the |
176 |
current panel is made of less than or equal to NBMIN columns. |
177 |
At that point, xhpl uses a matrix-vector operation based |
178 |
factorization denoted below by "PFACTs". Classic recursion |
179 |
would then use NDIV=2, NBMIN=1. There are essentially 3 |
180 |
numerically equivalent LU factorization algorithm variants |
181 |
(left-looking, Crout and right-looking). In HPL, one can |
182 |
choose every one of those for the RFACT, as well as the |
183 |
PFACT. The following lines of HPL.dat allows you to set those |
184 |
parameters. |
185 |
|
186 |
Lines 14-21: (Example 1) |
187 |
3 # of panel fact |
188 |
0 1 2 PFACTs (0=left, 1=Crout, 2=Right) |
189 |
4 # of recursive stopping criterium |
190 |
1 2 4 8 NBMINs (>= 1) |
191 |
3 # of panels in recursion |
192 |
2 3 4 NDIVs |
193 |
3 # of recursive panel fact. |
194 |
0 1 2 RFACTs (0=left, 1=Crout, 2=Right) |
195 |
|
196 |
This example would try all variants of PFACT, 4 values for |
197 |
NBMIN, namely 1, 2, 4 and 8, 3 values for NDIV namely 2, 3 |
198 |
and 4, and all variants for RFACT. Lines 14-21: (Example 1) |
199 |
|
200 |
2 # of panel fact |
201 |
2 0 PFACTs (0=left, 1=Crout, 2=Right) |
202 |
2 # of recursive stopping criterium |
203 |
4 8 NBMINs (>= 1) |
204 |
1 # of panels in recursion |
205 |
2 NDIVs |
206 |
1 # of recursive panel fact. |
207 |
2 RFACTs (0=left, 1=Crout, 2=Right) |
208 |
|
209 |
This example would try 2 variants of PFACT namely right loo- |
210 |
king and left looking, 2 values for NBMIN, namely 4 and 8, 1 |
211 |
value for NDIV namely 2, and one variant for RFACT. |
212 |
|
213 |
In the main loop of the algorithm, the current panel of co- |
214 |
lumn is broadcast in process rows using a virtual ring to- |
215 |
pology. HPL offers various choices, and one most likely want |
216 |
to use the increasing ring modified encoded as 1. 4 is also |
217 |
a good choice. Lines 22-23: (Example 1): |
218 |
|
219 |
1 # of broadcast |
220 |
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) |
221 |
|
222 |
This will cause HPL to broadcast the current panel using the |
223 |
increasing ring modified topology. Lines 22-23: (Example 2): |
224 |
|
225 |
2 # of broadcast |
226 |
0 4 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) |
227 |
|
228 |
This will cause HPL to broadcast the current panel using the |
229 |
increasing ring virtual topology and the long message algori- |
230 |
thm. |
231 |
|
232 |
Lines 24-25 allow to specify the look-ahead depth used by |
233 |
HPL. A depth of 0 means that the next panel is factorized af- |
234 |
ter the update by the current panel is completely finished. A |
235 |
depth of 1 means that the next panel is factorized immediate- |
236 |
ly after being updated. The update by the current panel is |
237 |
then finished. A depth of k means that the k next panels are |
238 |
factorized immediately after being updated. The update by the |
239 |
current panel is then finished. It turns out that a depth of |
240 |
1 seems to give the best results, but may need a large pro- |
241 |
blem size before one can see the performance gain. So use 1, |
242 |
if you do not know better, otherwise you may want to try 0. |
243 |
Look-ahead of depths 2 and larger will probably not give you |
244 |
better results. Lines 24-25: (Example 1): |
245 |
|
246 |
1 # of lookahead depth |
247 |
1 DEPTHs (>=0) |
248 |
|
249 |
This will cause HPL to use a look-ahead of depth 1. |
250 |
Lines 24-25: (Example 2): |
251 |
|
252 |
2 # of lookahead depth |
253 |
0 1 DEPTHs (>=0) |
254 |
|
255 |
This will cause HPL to use a look-ahead of depths 0 and 1. |
256 |
|
257 |
Lines 26-27 allow to specify the swapping algorithm used by |
258 |
HPL for all tests. There are currently two swapping algo- |
259 |
rithms available, one based on "binary exchange" and the |
260 |
other one based on a "spread-roll" procedure (also called |
261 |
"long" below. For large problem sizes, this last one is like- |
262 |
ly to be more efficient. The user can also choose to mix both |
263 |
variants, that is "binary-exchange" for a number of columns |
264 |
less than a threshold value, and then the "spread-roll" al- |
265 |
gorithm. This threshold value is then specified on Line 27. |
266 |
Lines 26-27: (Example 1): |
267 |
|
268 |
1 SWAP (0=bin-exch,1=long,2=mix) |
269 |
60 swapping threshold |
270 |
|
271 |
This will cause HPL to use the "long" or "spread-roll" swap- |
272 |
ping algorithm. Note that a threshold is specified in that |
273 |
example but not used by HPL. Lines 26-27: (Example 2): |
274 |
|
275 |
2 SWAP (0=bin-exch,1=long,2=mix) |
276 |
60 swapping threshold |
277 |
|
278 |
This will cause HPL to use the "long" or "spread-roll" swap- |
279 |
ping algorithm as soon as there is more than 60 columns in |
280 |
the row panel. Otherwise, the "binary-exchange" algorithm |
281 |
will be used instead. |
282 |
|
283 |
Line 28 allows to specify whether the upper triangle of the |
284 |
panel of columns should be stored in no-transposed or |
285 |
transposed form. Example: |
286 |
|
287 |
0 L1 in (0=transposed,1=no-transposed) form |
288 |
|
289 |
Line 29 allows to specify whether the panel of rows U should |
290 |
be stored in no-transposed or transposed form. Example: |
291 |
|
292 |
0 U in (0=transposed,1=no-transposed) form |
293 |
|
294 |
Line 30 enables/disables the equilibration phase. This option |
295 |
will not be used unless you selected 1 or 2 in Line 26. Ex: |
296 |
|
297 |
1 Equilibration (0=no,1=yes) |
298 |
|
299 |
|
300 |
Line 31 allows to specify the alignment in memory for the |
301 |
memory space allocated by HPL. On modern machines, one proba- |
302 |
bly wants to use 4, 8 or 16. This may result in a tiny amount |
303 |
of memory wasted. Example: |
304 |
|
305 |
4 memory alignment in double (> 0) |
306 |
|
307 |
============================================================== |
308 |
Guide lines: |
309 |
|
310 |
1) Figure out a good block size for the matrix-matrix |
311 |
multiply routine. The best method is to try a few out. If you |
312 |
happen to know the block size used by the matrix-matrix |
313 |
multiply routine, a small multiple of that block size will do |
314 |
fine. |
315 |
|
316 |
HPL uses the block size NB for the data distribution as well |
317 |
as for the computational granularity. From a data |
318 |
distribution point of view, the smallest NB, the better the |
319 |
load balance. You definitely want to stay away from very |
320 |
large values of NB. From a computation point of view, a too |
321 |
small value of NB may limit the computational performance by |
322 |
a large factor because almost no data reuse will occur in the |
323 |
highest level of the memory hierarchy. The number of messages |
324 |
will also increase. Efficient matrix-multiply routines are |
325 |
often internally blocked. Small multiples of this blocking |
326 |
factor are likely to be good block sizes for HPL. The bottom |
327 |
line is that "good" block sizes are almost always in the |
328 |
[32..256] interval. The best values depend on the computation |
329 |
/ communication performance ratio of your system. To a much |
330 |
less extent, the problem size matters as well. Say for |
331 |
example, you emperically found that 44 was a good block size |
332 |
with respect to performance. 88 or 132 are likely to give |
333 |
slightly better results for large problem sizes because of a |
334 |
slighlty higher flop rate. |
335 |
|
336 |
2) The process mapping should not matter if the nodes of |
337 |
your platform are single processor computers. If these nodes |
338 |
are multi-processors, a row-major mapping is recommended. |
339 |
|
340 |
3) HPL likes "square" or slightly flat process grids. Unless |
341 |
you are using a very small process grid, stay away from the |
342 |
1-by-Q and P-by-1 process grids. |
343 |
|
344 |
4) Panel factorization parameters: a good start are the fol- |
345 |
lowing for the lines 14-21: |
346 |
|
347 |
1 # of panel fact |
348 |
1 PFACTs (0=left, 1=Crout, 2=Right) |
349 |
2 # of recursive stopping criterium |
350 |
4 8 NBMINs (>= 1) |
351 |
1 # of panels in recursion |
352 |
2 NDIVs |
353 |
1 # of recursive panel fact. |
354 |
2 RFACTs (0=left, 1=Crout, 2=Right) |
355 |
|
356 |
5) Broadcast parameters: at this time, it is far from obvious |
357 |
to me what the best setting is, so i would probably try them |
358 |
all. If I had to guess I would probably start with the follo- |
359 |
wing for the lines 22-23: |
360 |
|
361 |
2 # of broadcast |
362 |
1 3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) |
363 |
|
364 |
The best broadcast depends on your problem size and harware |
365 |
performance. My take is that 4 or 5 may be competitive for |
366 |
machines featuring very fast nodes comparatively to the |
367 |
network. |
368 |
|
369 |
6) Look-ahead depth: as mentioned above 0 or 1 are likely to |
370 |
be the best choices. This also depends on the problem size |
371 |
and machine configuration, so I would try "no look-ahead (0)" |
372 |
and "look-ahead of depth 1 (1)". That is for lines 24-25: |
373 |
|
374 |
2 # of lookahead depth |
375 |
0 1 DEPTHs (>=0) |
376 |
|
377 |
7) Swapping: one can select only one of the three algorithm |
378 |
in the input file. Theoretically, mix (2) should win, however |
379 |
long (1) might just be good enough. The difference should be |
380 |
small between those two assuming a swapping threshold of the |
381 |
order of the block size (NB) selected. If this threshold is |
382 |
very large, HPL will use bin_exch (0) most of the time and if |
383 |
it is very small (< NB) long (1) will always be used. In |
384 |
short and assuming the block size (NB) used is say 60, I |
385 |
would choose for the lines 26-27: |
386 |
|
387 |
2 SWAP (0=bin-exch,1=long,2=mix) |
388 |
60 swapping threshold |
389 |
|
390 |
I would also try the long variant. For a very small number |
391 |
of processes in every column of the process grid (say < 4), |
392 |
very little performance difference should be observable. |
393 |
|
394 |
8) Local storage: I do not think Line 28 matters. Pick 0 in |
395 |
doubt. Line 29 is more important. It controls how the panel |
396 |
of rows should be stored. No doubt 0 is better. The caveat is |
397 |
that in that case the matrix-multiply function is called with |
398 |
( Notrans, Trans, ... ), that is C := C - A B^T. Unless the |
399 |
computational kernel you are using has a very poor (with |
400 |
respect to performance) implementation of that case, and is |
401 |
much more efficient with ( Notrans, Notrans, ... ) just pick |
402 |
0 as well. So, my choice: |
403 |
|
404 |
0 L1 in (0=transposed,1=no-transposed) form |
405 |
0 U in (0=transposed,1=no-transposed) form |
406 |
|
407 |
9) Equilibration: It is hard to tell whether equilibration |
408 |
should always be performed or not. Not knowing much about the |
409 |
random matrix generated and because the overhead is so small |
410 |
compared to the possible gain, I turn it on all the time. |
411 |
|
412 |
1 Equilibration (0=no,1=yes) |
413 |
|
414 |
10) For alignment, 4 should be plenty, but just to be safe, |
415 |
one may want to pick 8 instead. |
416 |
|
417 |
8 memory alignment in double (> 0) |
418 |
|
419 |
============================================================== |