root / www / scalability.html
Historique | Voir | Annoter | Télécharger (8,84 ko)
1 | 1 | equemene | <HTML>
|
---|---|---|---|
2 | 1 | equemene | <HEAD>
|
3 | 1 | equemene | <TITLE>HPL Scalability Analysis</TITLE> |
4 | 1 | equemene | </HEAD>
|
5 | 1 | equemene | |
6 | 1 | equemene | <BODY
|
7 | 1 | equemene | BGCOLOR = "WHITE" |
8 | 1 | equemene | BACKGROUND = "WHITE" |
9 | 1 | equemene | TEXT = "#000000" |
10 | 1 | equemene | VLINK = "#000099" |
11 | 1 | equemene | ALINK = "#947153" |
12 | 1 | equemene | LINK = "#0000ff"> |
13 | 1 | equemene | |
14 | 1 | equemene | <H2>HPL Scalability Analysis</H2> |
15 | 1 | equemene | |
16 | 1 | equemene | The <A HREF = "scalability.html#model">machine model</A> used for the |
17 | 1 | equemene | analysis is first described. This crude model is then used to first |
18 | 1 | equemene | estimate the parallel running time of the various phases of the |
19 | 1 | equemene | algorithm namely |
20 | 1 | equemene | <UL>
|
21 | 1 | equemene | <LI><A HREF="scalability.html#pfact">panel factorization and broadcast</A>, |
22 | 1 | equemene | <LI><A HREF="scalability.html#updat">trailing submatrix update</A>, |
23 | 1 | equemene | <LI><A HREF="scalability.html#backs">backward substitution</A>. |
24 | 1 | equemene | </UL>
|
25 | 1 | equemene | Finally <A HREF="scalability.html#total">the parallel efficiency</A> |
26 | 1 | equemene | of the entire algorithm is estimated according to this machine model. |
27 | 1 | equemene | We show that for a given set of parameters HPL is <STRONG>scalable</STRONG> |
28 | 1 | equemene | not only with respect to the amount of computation, but also with |
29 | 1 | equemene | respect to the communication volume.<BR><BR> |
30 | 1 | equemene | <HR NOSHADE |
31 | 1 | equemene | |
32 | 1 | equemene | <H3<A = "model">The Machine Model</A></H3> |
33 | 1 | equemene | |
34 | 1 | equemene | Distributed-memory computers consist of processors that are connected |
35 | 1 | equemene | using a message passing interconnection network. Each processor has |
36 | 1 | equemene | its own memory called the local memory, which is accessible only to |
37 | 1 | equemene | that processor. As the time to access a remote memory is longer than |
38 | 1 | equemene | the time to access a local one, such computers are often referred to |
39 | 1 | equemene | as Non-Uniform Memory Access (NUMA) machines.<BR><BR> |
40 | 1 | equemene | |
41 | 1 | equemene | The interconnection network of our machine model is static, meaning |
42 | 1 | equemene | that it consists of point-to-point communication links among |
43 | 1 | equemene | processors. This type of network is also referred to as a direct |
44 | 1 | equemene | network as opposed to dynamic networks. The latter are constructed |
45 | 1 | equemene | from switches and communication links. These links are dynamically |
46 | 1 | equemene | connected to one another by the switching elements to establish, at |
47 | 1 | equemene | run time, the paths between processors memories.<BR><BR> |
48 | 1 | equemene | |
49 | 1 | equemene | The interconnection network of the two-dimensional machine model |
50 | 1 | equemene | considered here is a static, fully connected physical topology. It |
51 | 1 | equemene | is also assumed that processors can be treated equally in terms |
52 | 1 | equemene | of local performance and that the communication rate between two |
53 | 1 | equemene | processors depends on the processors considered.<BR><BR> |
54 | 1 | equemene | |
55 | 1 | equemene | Our model assumes that a processor can send or receive data on only |
56 | 1 | equemene | one of its communication ports at a time (assuming it has more than |
57 | 1 | equemene | one). In the literature, this assumption is also referred to as the |
58 | 1 | equemene | one-port communication model.<BR><BR> |
59 | 1 | equemene | |
60 | 1 | equemene | The time spent to communicate a message between two given processors |
61 | 1 | equemene | is called the communication time Tc. In our machine model, Tc is |
62 | 1 | equemene | approximated by a linear function of the number L of double |
63 | 1 | equemene | precision (64-bits) items communicated. Tc is the sum of the time to |
64 | 1 | equemene | prepare the message for transmission (alpha) and the time (beta * L) |
65 | 1 | equemene | taken by the message of length L to traverse the network to its |
66 | 1 | equemene | destination, i.e.,<BR><BR> |
67 | 1 | equemene | <CENTER>
|
68 | 1 | equemene | Tc = alpha + beta L.<BR><BR> |
69 | 1 | equemene | </CENTER>
|
70 | 1 | equemene | |
71 | 1 | equemene | Finally, the model assumes that the communication links are |
72 | 1 | equemene | bi-directional, that is, the time for two processors to send each |
73 | 1 | equemene | other a message of length L is also Tc. A processor can send and/or |
74 | 1 | equemene | receive a message on only one of its communication links at a time. |
75 | 1 | equemene | In particular, a processor can send a message while receiving another |
76 | 1 | equemene | message from the processor it is sending to at the same time.<BR><BR> |
77 | 1 | equemene | |
78 | 1 | equemene | Since this document is only concerned with regular local dense linear |
79 | 1 | equemene | algebra operations, the time taken to perform one floating point |
80 | 1 | equemene | operation is assumed to be summarized by three constants gam1, |
81 | 1 | equemene | gam2 and gam3. These quantitites are flop rates approximations of the |
82 | 1 | equemene | vector-vector, matrix-vector and matrix-matrix operations for each |
83 | 1 | equemene | processor. This very crude approximation summarizes all the steps |
84 | 1 | equemene | performed by a processor to achieve such a computation. Obviously, |
85 | 1 | equemene | such a model neglects all the phenomena occurring in the processor |
86 | 1 | equemene | components, such as cache misses, pipeline startups, memory load or |
87 | 1 | equemene | store, floating point arithmetic and so on, that may influence the |
88 | 1 | equemene | value of these constants as a function of the problem size for |
89 | 1 | equemene | example.<BR><BR> |
90 | 1 | equemene | |
91 | 1 | equemene | Similarly, the model does not make any assumption on the amount of |
92 | 1 | equemene | physical memory per node. It is assumed that if a process has been |
93 | 1 | equemene | spawn on a processor, one has ensured that enough memory was |
94 | 1 | equemene | available on that processor. In other words, swapping will not occur |
95 | 1 | equemene | during the modeled computation.<BR><BR> |
96 | 1 | equemene | |
97 | 1 | equemene | <STRONG>
|
98 | 1 | equemene | This machine model is a very crude approximation that is designed |
99 | 1 | equemene | specifically to illustrate the cost of the dominant factors of our |
100 | 1 | equemene | particular case.<BR><BR> |
101 | 1 | equemene | </STRONG>
|
102 | 1 | equemene | <HR NOSHADE |
103 | 1 | equemene | |
104 | 1 | equemene | <H3<A ="pfact">Panel Factorization and Broadcast</A></H3> |
105 | 1 | equemene | |
106 | 1 | equemene | Let consider an M-by-N panel distributed over a P-process column. |
107 | 1 | equemene | Because of the recursive formulation of the panel factorization, it |
108 | 1 | equemene | is reasonable to consider that the floating point operations will |
109 | 1 | equemene | be performed at matrix-matrix multiply "speed". For every column in |
110 | 1 | equemene | the panel a binary-exchange is performed on 2*N data items. When this |
111 | 1 | equemene | panel is broadcast, what matters is the time that the next process |
112 | 1 | equemene | column will spend in this communication operation. Assuming one |
113 | 1 | equemene | chooses the <A HREF="algorithm.html#bcast">increasing-ring (modified) |
114 | 1 | equemene | variant</A>, only one message needs to be taken into account. The
|
115 | 1 | equemene | execution time of the panel factorization and broadcast can thus be |
116 | 1 | equemene | approximated by:<BR><BR> |
117 | 1 | equemene | <CENTER>
|
118 | 1 | equemene | Tpfact( M, N ) = (M/P - N/3) N^2 gam3 + N log(P)( alpha + beta 2 N ) + |
119 | 1 | equemene | alpha + beta M N / P.<BR><BR> |
120 | 1 | equemene | </CENTER>
|
121 | 1 | equemene | <HR NOSHADE |
122 | 1 | equemene | |
123 | 1 | equemene | <H3<A ="updat">Trailing Submatrix Update</A></H3> |
124 | 1 | equemene | |
125 | 1 | equemene | Let consider the update phase of an N-by-N trailing submatrix |
126 | 1 | equemene | distributed on a P-by-Q process grid. From a computational point of |
127 | 1 | equemene | view one has to (triangular) solve N right-hand-sides and perform a |
128 | 1 | equemene | local rank-NB update of this trailing submatrix. Assuming one chooses |
129 | 1 | equemene | the <A HREF="algorithm.html#update">long variant</A>, the execution |
130 | 1 | equemene | time of the update operation can be approximated by:<BR><BR> |
131 | 1 | equemene | <CENTER>
|
132 | 1 | equemene | Tupdate( N, NB ) = gam3 ( N NB^2 / Q + 2 N^2 NB / ( P Q ) ) + |
133 | 1 | equemene | alpha ( log( P ) + P - 1 ) + 3 beta N NB / Q.<BR><BR> |
134 | 1 | equemene | </CENTER>
|
135 | 1 | equemene | The constant "3" in front of the "beta" term is obtained by counting |
136 | 1 | equemene | one for the (logarithmic) spread phase and two for the rolling phase; |
137 | 1 | equemene | In the case of bi-directional links this constant 3 should therefore |
138 | 1 | equemene | be only a 2.<BR><BR> |
139 | 1 | equemene | <HR NOSHADE |
140 | 1 | equemene | |
141 | 1 | equemene | <H3<A ="backs">Backward Substitution</A></H3> |
142 | 1 | equemene | |
143 | 1 | equemene | The number of floating point operations performed during the backward |
144 | 1 | equemene | substitution in given by N^2 / (P*Q). Because of the lookahead, the |
145 | 1 | equemene | communication cost can be approximated at each step by two messages |
146 | 1 | equemene | of length NB, i.e., the time to communicate the NB-piece of the |
147 | 1 | equemene | solution vector from one diagonal block of the matrix to another. It |
148 | 1 | equemene | follows that the execution time of the backward substitution can be |
149 | 1 | equemene | approximated by:<BR><BR> |
150 | 1 | equemene | <CENTER>
|
151 | 1 | equemene | Tbacks( N, NB ) = gam2 N^2 / (P Q) + N ( alpha / NB + 2 beta ).<BR><BR> |
152 | 1 | equemene | </CENTER>
|
153 | 1 | equemene | <HR NOSHADE |
154 | 1 | equemene | |
155 | 1 | equemene | <H3<A ="total">Putting it All Together</A></H3> |
156 | 1 | equemene | |
157 | 1 | equemene | The total execution time of the algorithm described above is given by<BR><BR> |
158 | 1 | equemene | <CENTER>
|
159 | 1 | equemene | Sum(k=0,N,NB)[Tpfact( N-k, NB ) + Tupdate( N-k-NB, NB )] + |
160 | 1 | equemene | Tbacks( N, NB ).<BR><BR> |
161 | 1 | equemene | </CENTER>
|
162 | 1 | equemene | That is, by only considering only the dominant term in alpha, beta and |
163 | 1 | equemene | gam3:<BR><BR> |
164 | 1 | equemene | <CENTER>
|
165 | 1 | equemene | Thpl = 2 gam3 N^3 / ( 3 P Q ) + beta N^2 (3 P + Q) / ( 2 P Q ) + |
166 | 1 | equemene | alpha N ((NB + 1) log(P) + P) / NB.<BR><BR> |
167 | 1 | equemene | </CENTER>
|
168 | 1 | equemene | The serial execution time is given by Tser = 2 gam3 N^3 / 3. If we |
169 | 1 | equemene | define the parallel efficiency E as the ratio Tser / ( P Q Thpl ), we |
170 | 1 | equemene | obtain:<BR><BR> |
171 | 1 | equemene | <CENTER>
|
172 | 1 | equemene | E = 1 / ( 1 + 3 beta (3 P + Q) / ( 4 gam3 N ) + |
173 | 1 | equemene | 3 alpha P Q ((NB + 1) log(P) + P) / (2 N^2 NB gam3) ).<BR><BR> |
174 | 1 | equemene | </CENTER>
|
175 | 1 | equemene | This last equality shows that when the memory usage per processor |
176 | 1 | equemene | N^2 / (P Q) is maintained constant, the parallel efficiency slowly |
177 | 1 | equemene | decreases only because of the alpha term. The communication volume |
178 | 1 | equemene | (the beta term) however remains constant. Due to these results, HPL |
179 | 1 | equemene | is said to be <STRONG>scalable</STRONG> not only with respect to the |
180 | 1 | equemene | amount of computation, but also with respect to the communication |
181 | 1 | equemene | volume.<BR><BR> |
182 | 1 | equemene | |
183 | 1 | equemene | <HR NOSHADE |
184 | 1 | equemene | <CENTER |
185 | 1 | equemene | <A = "index.html"> [Home]</A> |
186 | 1 | equemene | <A HREF = "copyright.html"> [Copyright and Licensing Terms]</A> |
187 | 1 | equemene | <A HREF = "algorithm.html"> [Algorithm]</A> |
188 | 1 | equemene | <A HREF = "scalability.html"> [Scalability]</A> |
189 | 1 | equemene | <A HREF = "results.html"> [Performance Results]</A> |
190 | 1 | equemene | <A HREF = "documentation.html"> [Documentation]</A> |
191 | 1 | equemene | <A HREF = "software.html"> [Software]</A> |
192 | 1 | equemene | <A HREF = "faqs.html"> [FAQs]</A> |
193 | 1 | equemene | <A HREF = "tuning.html"> [Tuning]</A> |
194 | 1 | equemene | <A HREF = "errata.html"> [Errata-Bugs]</A> |
195 | 1 | equemene | <A HREF = "references.html"> [References]</A> |
196 | 1 | equemene | <A HREF = "links.html"> [Related Links]</A><BR> |
197 | 1 | equemene | </CENTER>
|
198 | 1 | equemene | <HR NOSHADE |
199 | 1 | equemene | </BODY |
200 | 1 | equemene | </HTML |