Statistiques
| Révision :

root / www / scalability.html

Historique | Voir | Annoter | Télécharger (8,84 ko)

1 1 equemene
<HTML>
2 1 equemene
<HEAD>
3 1 equemene
<TITLE>HPL Scalability Analysis</TITLE>
4 1 equemene
</HEAD>
5 1 equemene
6 1 equemene
<BODY
7 1 equemene
BGCOLOR     = "WHITE"
8 1 equemene
BACKGROUND  = "WHITE"
9 1 equemene
TEXT        = "#000000"
10 1 equemene
VLINK       = "#000099"
11 1 equemene
ALINK       = "#947153"
12 1 equemene
LINK        = "#0000ff">
13 1 equemene
14 1 equemene
<H2>HPL Scalability Analysis</H2>
15 1 equemene
16 1 equemene
The <A HREF = "scalability.html#model">machine model</A> used for the
17 1 equemene
analysis is first described.  This crude model is then used  to first
18 1 equemene
estimate  the  parallel running time  of  the various phases  of  the
19 1 equemene
algorithm namely
20 1 equemene
<UL>
21 1 equemene
<LI><A HREF="scalability.html#pfact">panel factorization and broadcast</A>,
22 1 equemene
<LI><A HREF="scalability.html#updat">trailing submatrix update</A>,
23 1 equemene
<LI><A HREF="scalability.html#backs">backward substitution</A>.
24 1 equemene
</UL>
25 1 equemene
Finally <A HREF="scalability.html#total">the  parallel efficiency</A>
26 1 equemene
of the entire algorithm is estimated according to this machine model.
27 1 equemene
We show that for a given set of parameters HPL is <STRONG>scalable</STRONG>
28 1 equemene
not  only  with respect to the amount of computation,  but  also with
29 1 equemene
respect to the communication volume.<BR><BR>
30 1 equemene
<HR NOSHADE
31 1 equemene
32 1 equemene
<H3<A  = "model">The Machine Model</A></H3>
33 1 equemene
34 1 equemene
Distributed-memory computers consist of processors that are connected
35 1 equemene
using  a message passing interconnection network.  Each processor has
36 1 equemene
its own memory called the local memory,  which  is accessible only to
37 1 equemene
that processor.  As the time to access a remote memory is longer than
38 1 equemene
the time to access a local one,  such computers are often referred to
39 1 equemene
as Non-Uniform Memory Access (NUMA) machines.<BR><BR>
40 1 equemene
41 1 equemene
The interconnection network  of our machine model is static,  meaning
42 1 equemene
that   it   consists  of  point-to-point  communication  links  among
43 1 equemene
processors.  This  type  of  network  is also referred to as a direct
44 1 equemene
network as opposed to dynamic networks.  The  latter  are constructed
45 1 equemene
from switches and communication links.  These links  are  dynamically
46 1 equemene
connected  to one another by the switching elements to establish,  at
47 1 equemene
run time, the paths between processors memories.<BR><BR>
48 1 equemene
49 1 equemene
The  interconnection  network  of the two-dimensional  machine  model
50 1 equemene
considered here is a static,  fully  connected physical topology.  It
51 1 equemene
is also assumed  that  processors  can be treated  equally  in  terms
52 1 equemene
of  local performance  and  that  the  communication rate between two
53 1 equemene
processors depends on the processors considered.<BR><BR>
54 1 equemene
55 1 equemene
Our model assumes  that  a processor can send or receive data on only
56 1 equemene
one of its communication ports at a time  (assuming  it has more than
57 1 equemene
one). In the literature,  this  assumption is also referred to as the
58 1 equemene
one-port communication model.<BR><BR>
59 1 equemene
60 1 equemene
The time spent to communicate  a message between two given processors
61 1 equemene
is called the communication time Tc.   In  our machine model,  Tc  is
62 1 equemene
approximated  by  a  linear  function  of  the  number  L  of  double
63 1 equemene
precision (64-bits) items communicated.  Tc is the sum of the time to
64 1 equemene
prepare the message for transmission (alpha) and the time  (beta * L)
65 1 equemene
taken  by the message of length  L  to traverse  the network  to  its
66 1 equemene
destination, i.e.,<BR><BR>
67 1 equemene
<CENTER>
68 1 equemene
Tc = alpha + beta L.<BR><BR>
69 1 equemene
</CENTER>
70 1 equemene
71 1 equemene
Finally,   the   model  assumes  that  the  communication  links  are
72 1 equemene
bi-directional,  that is,  the time  for two processors  to send each
73 1 equemene
other a message of length L is also Tc.  A processor  can send and/or
74 1 equemene
receive  a message on only one of  its communication links at a time.
75 1 equemene
In particular, a processor can send a message while receiving another
76 1 equemene
message from the processor it is sending to at the same time.<BR><BR>
77 1 equemene
78 1 equemene
Since this document is only concerned with regular local dense linear
79 1 equemene
algebra  operations,  the time taken to perform  one  floating  point
80 1 equemene
operation  is  assumed  to  be  summarized by  three constants  gam1,
81 1 equemene
gam2 and gam3. These quantitites are flop rates approximations of the
82 1 equemene
vector-vector,  matrix-vector  and matrix-matrix operations for  each
83 1 equemene
processor.  This  very  crude approximation summarizes all  the steps
84 1 equemene
performed  by a processor  to achieve such a computation.  Obviously,
85 1 equemene
such a model neglects all the phenomena  occurring  in  the processor
86 1 equemene
components,  such as cache misses, pipeline startups, memory load  or
87 1 equemene
store, floating point arithmetic and so on,  that  may  influence the
88 1 equemene
value  of  these  constants  as  a function  of the  problem size for
89 1 equemene
example.<BR><BR>
90 1 equemene
91 1 equemene
Similarly,  the model  does  not make any assumption on the amount of
92 1 equemene
physical memory per node.  It  is  assumed that if a process has been
93 1 equemene
spawn  on  a processor,  one  has  ensured  that  enough  memory  was
94 1 equemene
available  on that processor. In other words, swapping will not occur
95 1 equemene
during the modeled computation.<BR><BR>
96 1 equemene
97 1 equemene
<STRONG>
98 1 equemene
This  machine  model  is  a very crude approximation that is designed
99 1 equemene
specifically  to  illustrate  the cost of the dominant factors of our
100 1 equemene
particular case.<BR><BR>
101 1 equemene
</STRONG>
102 1 equemene
<HR NOSHADE
103 1 equemene
104 1 equemene
<H3<A ="pfact">Panel Factorization and Broadcast</A></H3>
105 1 equemene
106 1 equemene
Let  consider  an  M-by-N  panel distributed over a P-process column.
107 1 equemene
Because  of the recursive formulation of the panel factorization,  it
108 1 equemene
is  reasonable to consider  that  the floating point operations  will
109 1 equemene
be performed at matrix-matrix multiply "speed".  For  every column in
110 1 equemene
the panel a binary-exchange is performed on 2*N data items. When this
111 1 equemene
panel is broadcast,  what  matters  is the time that the next process
112 1 equemene
column  will  spend  in this  communication operation.  Assuming  one
113 1 equemene
chooses the <A HREF="algorithm.html#bcast">increasing-ring (modified)
114 1 equemene
variant</A>,  only  one  message needs to be taken into account.  The
115 1 equemene
execution  time  of the panel factorization and broadcast can thus be
116 1 equemene
approximated by:<BR><BR>
117 1 equemene
<CENTER>
118 1 equemene
Tpfact( M, N ) = (M/P - N/3) N^2 gam3 + N log(P)( alpha + beta 2 N ) +
119 1 equemene
alpha + beta M N / P.<BR><BR>
120 1 equemene
</CENTER>
121 1 equemene
<HR NOSHADE
122 1 equemene
123 1 equemene
<H3<A ="updat">Trailing Submatrix Update</A></H3>
124 1 equemene
125 1 equemene
Let  consider  the  update  phase  of an  N-by-N  trailing  submatrix
126 1 equemene
distributed on a P-by-Q process grid.  From  a computational point of
127 1 equemene
view one has to (triangular) solve N right-hand-sides  and  perform a
128 1 equemene
local rank-NB update of this trailing submatrix. Assuming one chooses
129 1 equemene
the <A HREF="algorithm.html#update">long variant</A>,  the  execution
130 1 equemene
time of the update operation can be approximated by:<BR><BR>
131 1 equemene
<CENTER>
132 1 equemene
Tupdate( N, NB ) = gam3 ( N NB^2 / Q + 2 N^2 NB / ( P Q ) ) +
133 1 equemene
alpha ( log( P ) + P - 1 ) + 3 beta N NB / Q.<BR><BR>
134 1 equemene
</CENTER>
135 1 equemene
The constant "3" in front of the "beta" term is obtained  by counting
136 1 equemene
one for the (logarithmic) spread phase and two for the rolling phase;
137 1 equemene
In the case of bi-directional links  this constant 3 should therefore
138 1 equemene
be only a 2.<BR><BR>
139 1 equemene
<HR NOSHADE
140 1 equemene
141 1 equemene
<H3<A ="backs">Backward Substitution</A></H3>
142 1 equemene
143 1 equemene
The number of floating point operations performed during the backward
144 1 equemene
substitution in given by  N^2 / (P*Q).  Because of the lookahead, the
145 1 equemene
communication cost  can be approximated at each step  by two messages
146 1 equemene
of length NB, i.e.,  the time  to  communicate  the NB-piece  of  the
147 1 equemene
solution vector from one diagonal block of the matrix to another.  It
148 1 equemene
follows that the execution time of the backward substitution  can  be
149 1 equemene
approximated by:<BR><BR>
150 1 equemene
<CENTER>
151 1 equemene
Tbacks( N, NB ) = gam2 N^2  / (P Q) + N ( alpha / NB + 2 beta ).<BR><BR>
152 1 equemene
</CENTER>
153 1 equemene
<HR NOSHADE
154 1 equemene
155 1 equemene
<H3<A ="total">Putting it All Together</A></H3>
156 1 equemene
157 1 equemene
The total execution time of the algorithm described above is given by<BR><BR>
158 1 equemene
<CENTER>
159 1 equemene
Sum(k=0,N,NB)[Tpfact( N-k, NB ) + Tupdate( N-k-NB, NB )] +
160 1 equemene
Tbacks( N, NB ).<BR><BR>
161 1 equemene
</CENTER>
162 1 equemene
That is, by only considering only the dominant term in alpha, beta and
163 1 equemene
gam3:<BR><BR>
164 1 equemene
<CENTER>
165 1 equemene
Thpl = 2 gam3 N^3  / ( 3 P Q ) + beta N^2 (3 P + Q) / ( 2 P Q ) +
166 1 equemene
alpha N ((NB + 1) log(P) + P) / NB.<BR><BR>
167 1 equemene
</CENTER>
168 1 equemene
The serial execution time is given by Tser = 2 gam3 N^3  / 3. If we
169 1 equemene
define the parallel efficiency  E  as the ratio  Tser / ( P Q Thpl ), we
170 1 equemene
obtain:<BR><BR>
171 1 equemene
<CENTER>
172 1 equemene
E = 1 / ( 1 + 3 beta (3 P + Q) / ( 4 gam3 N ) +
173 1 equemene
3 alpha P Q ((NB + 1) log(P) + P) / (2 N^2 NB gam3) ).<BR><BR>
174 1 equemene
</CENTER>
175 1 equemene
This  last equality  shows  that when the memory usage per  processor
176 1 equemene
N^2 / (P Q)  is maintained  constant, the parallel efficiency  slowly
177 1 equemene
decreases  only  because of the alpha term.  The communication volume
178 1 equemene
(the beta term) however remains constant.  Due to these results,  HPL
179 1 equemene
is said to be <STRONG>scalable</STRONG> not only with respect  to the
180 1 equemene
amount of computation,  but also  with  respect  to the communication
181 1 equemene
volume.<BR><BR>
182 1 equemene
183 1 equemene
<HR NOSHADE
184 1 equemene
<CENTER
185 1 equemene
<A  = "index.html">            [Home]</A>
186 1 equemene
<A HREF = "copyright.html">        [Copyright and Licensing Terms]</A>
187 1 equemene
<A HREF = "algorithm.html">        [Algorithm]</A>
188 1 equemene
<A HREF = "scalability.html">      [Scalability]</A>
189 1 equemene
<A HREF = "results.html">          [Performance Results]</A>
190 1 equemene
<A HREF = "documentation.html">    [Documentation]</A>
191 1 equemene
<A HREF = "software.html">         [Software]</A>
192 1 equemene
<A HREF = "faqs.html">             [FAQs]</A>
193 1 equemene
<A HREF = "tuning.html">           [Tuning]</A>
194 1 equemene
<A HREF = "errata.html">           [Errata-Bugs]</A>
195 1 equemene
<A HREF = "references.html">       [References]</A>
196 1 equemene
<A HREF = "links.html">            [Related Links]</A><BR>
197 1 equemene
</CENTER>
198 1 equemene
<HR NOSHADE
199 1 equemene
</BODY
200 1 equemene
</HTML