root / www / scalability.html @ 1
Historique | Voir | Annoter | Télécharger (8,84 ko)
1 |
<HTML>
|
---|---|
2 |
<HEAD>
|
3 |
<TITLE>HPL Scalability Analysis</TITLE> |
4 |
</HEAD>
|
5 |
|
6 |
<BODY
|
7 |
BGCOLOR = "WHITE" |
8 |
BACKGROUND = "WHITE" |
9 |
TEXT = "#000000" |
10 |
VLINK = "#000099" |
11 |
ALINK = "#947153" |
12 |
LINK = "#0000ff"> |
13 |
|
14 |
<H2>HPL Scalability Analysis</H2> |
15 |
|
16 |
The <A HREF = "scalability.html#model">machine model</A> used for the |
17 |
analysis is first described. This crude model is then used to first |
18 |
estimate the parallel running time of the various phases of the |
19 |
algorithm namely |
20 |
<UL>
|
21 |
<LI><A HREF="scalability.html#pfact">panel factorization and broadcast</A>, |
22 |
<LI><A HREF="scalability.html#updat">trailing submatrix update</A>, |
23 |
<LI><A HREF="scalability.html#backs">backward substitution</A>. |
24 |
</UL>
|
25 |
Finally <A HREF="scalability.html#total">the parallel efficiency</A> |
26 |
of the entire algorithm is estimated according to this machine model. |
27 |
We show that for a given set of parameters HPL is <STRONG>scalable</STRONG> |
28 |
not only with respect to the amount of computation, but also with |
29 |
respect to the communication volume.<BR><BR> |
30 |
<HR NOSHADE |
31 |
|
32 |
<H3<A = "model">The Machine Model</A></H3> |
33 |
|
34 |
Distributed-memory computers consist of processors that are connected |
35 |
using a message passing interconnection network. Each processor has |
36 |
its own memory called the local memory, which is accessible only to |
37 |
that processor. As the time to access a remote memory is longer than |
38 |
the time to access a local one, such computers are often referred to |
39 |
as Non-Uniform Memory Access (NUMA) machines.<BR><BR> |
40 |
|
41 |
The interconnection network of our machine model is static, meaning |
42 |
that it consists of point-to-point communication links among |
43 |
processors. This type of network is also referred to as a direct |
44 |
network as opposed to dynamic networks. The latter are constructed |
45 |
from switches and communication links. These links are dynamically |
46 |
connected to one another by the switching elements to establish, at |
47 |
run time, the paths between processors memories.<BR><BR> |
48 |
|
49 |
The interconnection network of the two-dimensional machine model |
50 |
considered here is a static, fully connected physical topology. It |
51 |
is also assumed that processors can be treated equally in terms |
52 |
of local performance and that the communication rate between two |
53 |
processors depends on the processors considered.<BR><BR> |
54 |
|
55 |
Our model assumes that a processor can send or receive data on only |
56 |
one of its communication ports at a time (assuming it has more than |
57 |
one). In the literature, this assumption is also referred to as the |
58 |
one-port communication model.<BR><BR> |
59 |
|
60 |
The time spent to communicate a message between two given processors |
61 |
is called the communication time Tc. In our machine model, Tc is |
62 |
approximated by a linear function of the number L of double |
63 |
precision (64-bits) items communicated. Tc is the sum of the time to |
64 |
prepare the message for transmission (alpha) and the time (beta * L) |
65 |
taken by the message of length L to traverse the network to its |
66 |
destination, i.e.,<BR><BR> |
67 |
<CENTER>
|
68 |
Tc = alpha + beta L.<BR><BR> |
69 |
</CENTER>
|
70 |
|
71 |
Finally, the model assumes that the communication links are |
72 |
bi-directional, that is, the time for two processors to send each |
73 |
other a message of length L is also Tc. A processor can send and/or |
74 |
receive a message on only one of its communication links at a time. |
75 |
In particular, a processor can send a message while receiving another |
76 |
message from the processor it is sending to at the same time.<BR><BR> |
77 |
|
78 |
Since this document is only concerned with regular local dense linear |
79 |
algebra operations, the time taken to perform one floating point |
80 |
operation is assumed to be summarized by three constants gam1, |
81 |
gam2 and gam3. These quantitites are flop rates approximations of the |
82 |
vector-vector, matrix-vector and matrix-matrix operations for each |
83 |
processor. This very crude approximation summarizes all the steps |
84 |
performed by a processor to achieve such a computation. Obviously, |
85 |
such a model neglects all the phenomena occurring in the processor |
86 |
components, such as cache misses, pipeline startups, memory load or |
87 |
store, floating point arithmetic and so on, that may influence the |
88 |
value of these constants as a function of the problem size for |
89 |
example.<BR><BR> |
90 |
|
91 |
Similarly, the model does not make any assumption on the amount of |
92 |
physical memory per node. It is assumed that if a process has been |
93 |
spawn on a processor, one has ensured that enough memory was |
94 |
available on that processor. In other words, swapping will not occur |
95 |
during the modeled computation.<BR><BR> |
96 |
|
97 |
<STRONG>
|
98 |
This machine model is a very crude approximation that is designed |
99 |
specifically to illustrate the cost of the dominant factors of our |
100 |
particular case.<BR><BR> |
101 |
</STRONG>
|
102 |
<HR NOSHADE |
103 |
|
104 |
<H3<A ="pfact">Panel Factorization and Broadcast</A></H3> |
105 |
|
106 |
Let consider an M-by-N panel distributed over a P-process column. |
107 |
Because of the recursive formulation of the panel factorization, it |
108 |
is reasonable to consider that the floating point operations will |
109 |
be performed at matrix-matrix multiply "speed". For every column in |
110 |
the panel a binary-exchange is performed on 2*N data items. When this |
111 |
panel is broadcast, what matters is the time that the next process |
112 |
column will spend in this communication operation. Assuming one |
113 |
chooses the <A HREF="algorithm.html#bcast">increasing-ring (modified) |
114 |
variant</A>, only one message needs to be taken into account. The
|
115 |
execution time of the panel factorization and broadcast can thus be |
116 |
approximated by:<BR><BR> |
117 |
<CENTER>
|
118 |
Tpfact( M, N ) = (M/P - N/3) N^2 gam3 + N log(P)( alpha + beta 2 N ) + |
119 |
alpha + beta M N / P.<BR><BR> |
120 |
</CENTER>
|
121 |
<HR NOSHADE |
122 |
|
123 |
<H3<A ="updat">Trailing Submatrix Update</A></H3> |
124 |
|
125 |
Let consider the update phase of an N-by-N trailing submatrix |
126 |
distributed on a P-by-Q process grid. From a computational point of |
127 |
view one has to (triangular) solve N right-hand-sides and perform a |
128 |
local rank-NB update of this trailing submatrix. Assuming one chooses |
129 |
the <A HREF="algorithm.html#update">long variant</A>, the execution |
130 |
time of the update operation can be approximated by:<BR><BR> |
131 |
<CENTER>
|
132 |
Tupdate( N, NB ) = gam3 ( N NB^2 / Q + 2 N^2 NB / ( P Q ) ) + |
133 |
alpha ( log( P ) + P - 1 ) + 3 beta N NB / Q.<BR><BR> |
134 |
</CENTER>
|
135 |
The constant "3" in front of the "beta" term is obtained by counting |
136 |
one for the (logarithmic) spread phase and two for the rolling phase; |
137 |
In the case of bi-directional links this constant 3 should therefore |
138 |
be only a 2.<BR><BR> |
139 |
<HR NOSHADE |
140 |
|
141 |
<H3<A ="backs">Backward Substitution</A></H3> |
142 |
|
143 |
The number of floating point operations performed during the backward |
144 |
substitution in given by N^2 / (P*Q). Because of the lookahead, the |
145 |
communication cost can be approximated at each step by two messages |
146 |
of length NB, i.e., the time to communicate the NB-piece of the |
147 |
solution vector from one diagonal block of the matrix to another. It |
148 |
follows that the execution time of the backward substitution can be |
149 |
approximated by:<BR><BR> |
150 |
<CENTER>
|
151 |
Tbacks( N, NB ) = gam2 N^2 / (P Q) + N ( alpha / NB + 2 beta ).<BR><BR> |
152 |
</CENTER>
|
153 |
<HR NOSHADE |
154 |
|
155 |
<H3<A ="total">Putting it All Together</A></H3> |
156 |
|
157 |
The total execution time of the algorithm described above is given by<BR><BR> |
158 |
<CENTER>
|
159 |
Sum(k=0,N,NB)[Tpfact( N-k, NB ) + Tupdate( N-k-NB, NB )] + |
160 |
Tbacks( N, NB ).<BR><BR> |
161 |
</CENTER>
|
162 |
That is, by only considering only the dominant term in alpha, beta and |
163 |
gam3:<BR><BR> |
164 |
<CENTER>
|
165 |
Thpl = 2 gam3 N^3 / ( 3 P Q ) + beta N^2 (3 P + Q) / ( 2 P Q ) + |
166 |
alpha N ((NB + 1) log(P) + P) / NB.<BR><BR> |
167 |
</CENTER>
|
168 |
The serial execution time is given by Tser = 2 gam3 N^3 / 3. If we |
169 |
define the parallel efficiency E as the ratio Tser / ( P Q Thpl ), we |
170 |
obtain:<BR><BR> |
171 |
<CENTER>
|
172 |
E = 1 / ( 1 + 3 beta (3 P + Q) / ( 4 gam3 N ) + |
173 |
3 alpha P Q ((NB + 1) log(P) + P) / (2 N^2 NB gam3) ).<BR><BR> |
174 |
</CENTER>
|
175 |
This last equality shows that when the memory usage per processor |
176 |
N^2 / (P Q) is maintained constant, the parallel efficiency slowly |
177 |
decreases only because of the alpha term. The communication volume |
178 |
(the beta term) however remains constant. Due to these results, HPL |
179 |
is said to be <STRONG>scalable</STRONG> not only with respect to the |
180 |
amount of computation, but also with respect to the communication |
181 |
volume.<BR><BR> |
182 |
|
183 |
<HR NOSHADE |
184 |
<CENTER |
185 |
<A = "index.html"> [Home]</A> |
186 |
<A HREF = "copyright.html"> [Copyright and Licensing Terms]</A> |
187 |
<A HREF = "algorithm.html"> [Algorithm]</A> |
188 |
<A HREF = "scalability.html"> [Scalability]</A> |
189 |
<A HREF = "results.html"> [Performance Results]</A> |
190 |
<A HREF = "documentation.html"> [Documentation]</A> |
191 |
<A HREF = "software.html"> [Software]</A> |
192 |
<A HREF = "faqs.html"> [FAQs]</A> |
193 |
<A HREF = "tuning.html"> [Tuning]</A> |
194 |
<A HREF = "errata.html"> [Errata-Bugs]</A> |
195 |
<A HREF = "references.html"> [References]</A> |
196 |
<A HREF = "links.html"> [Related Links]</A><BR> |
197 |
</CENTER>
|
198 |
<HR NOSHADE |
199 |
</BODY |
200 |
</HTML |