Statistiques
| Révision :

root / www / scalability.html @ 1

Historique | Voir | Annoter | Télécharger (8,84 ko)

1
<HTML>
2
<HEAD>
3
<TITLE>HPL Scalability Analysis</TITLE>
4
</HEAD>
5

    
6
<BODY 
7
BGCOLOR     = "WHITE"
8
BACKGROUND  = "WHITE"
9
TEXT        = "#000000"
10
VLINK       = "#000099"
11
ALINK       = "#947153"
12
LINK        = "#0000ff">
13

    
14
<H2>HPL Scalability Analysis</H2>
15

    
16
The <A HREF = "scalability.html#model">machine model</A> used for the
17
analysis is first described.  This crude model is then used  to first
18
estimate  the  parallel running time  of  the various phases  of  the 
19
algorithm namely
20
<UL>
21
<LI><A HREF="scalability.html#pfact">panel factorization and broadcast</A>,
22
<LI><A HREF="scalability.html#updat">trailing submatrix update</A>, 
23
<LI><A HREF="scalability.html#backs">backward substitution</A>. 
24
</UL>
25
Finally <A HREF="scalability.html#total">the  parallel efficiency</A>
26
of the entire algorithm is estimated according to this machine model.
27
We show that for a given set of parameters HPL is <STRONG>scalable</STRONG>
28
not  only  with respect to the amount of computation,  but  also with
29
respect to the communication volume.<BR><BR>
30
<HR NOSHADE
31

    
32
<H3<A  = "model">The Machine Model</A></H3>
33

    
34
Distributed-memory computers consist of processors that are connected
35
using  a message passing interconnection network.  Each processor has
36
its own memory called the local memory,  which  is accessible only to
37
that processor.  As the time to access a remote memory is longer than
38
the time to access a local one,  such computers are often referred to
39
as Non-Uniform Memory Access (NUMA) machines.<BR><BR>
40

    
41
The interconnection network  of our machine model is static,  meaning
42
that   it   consists  of  point-to-point  communication  links  among
43
processors.  This  type  of  network  is also referred to as a direct
44
network as opposed to dynamic networks.  The  latter  are constructed 
45
from switches and communication links.  These links  are  dynamically
46
connected  to one another by the switching elements to establish,  at
47
run time, the paths between processors memories.<BR><BR>
48
 
49
The  interconnection  network  of the two-dimensional  machine  model
50
considered here is a static,  fully  connected physical topology.  It
51
is also assumed  that  processors  can be treated  equally  in  terms
52
of  local performance  and  that  the  communication rate between two
53
processors depends on the processors considered.<BR><BR>
54

    
55
Our model assumes  that  a processor can send or receive data on only
56
one of its communication ports at a time  (assuming  it has more than
57
one). In the literature,  this  assumption is also referred to as the
58
one-port communication model.<BR><BR>
59
 
60
The time spent to communicate  a message between two given processors
61
is called the communication time Tc.   In  our machine model,  Tc  is
62
approximated  by  a  linear  function  of  the  number  L  of  double
63
precision (64-bits) items communicated.  Tc is the sum of the time to
64
prepare the message for transmission (alpha) and the time  (beta * L)
65
taken  by the message of length  L  to traverse  the network  to  its 
66
destination, i.e.,<BR><BR>
67
<CENTER>
68
Tc = alpha + beta L.<BR><BR>
69
</CENTER>
70

    
71
Finally,   the   model  assumes  that  the  communication  links  are
72
bi-directional,  that is,  the time  for two processors  to send each 
73
other a message of length L is also Tc.  A processor  can send and/or
74
receive  a message on only one of  its communication links at a time.
75
In particular, a processor can send a message while receiving another
76
message from the processor it is sending to at the same time.<BR><BR>
77
 
78
Since this document is only concerned with regular local dense linear
79
algebra  operations,  the time taken to perform  one  floating  point 
80
operation  is  assumed  to  be  summarized by  three constants  gam1, 
81
gam2 and gam3. These quantitites are flop rates approximations of the
82
vector-vector,  matrix-vector  and matrix-matrix operations for  each
83
processor.  This  very  crude approximation summarizes all  the steps
84
performed  by a processor  to achieve such a computation.  Obviously,
85
such a model neglects all the phenomena  occurring  in  the processor
86
components,  such as cache misses, pipeline startups, memory load  or
87
store, floating point arithmetic and so on,  that  may  influence the
88
value  of  these  constants  as  a function  of the  problem size for
89
example.<BR><BR>
90
 
91
Similarly,  the model  does  not make any assumption on the amount of
92
physical memory per node.  It  is  assumed that if a process has been
93
spawn  on  a processor,  one  has  ensured  that  enough  memory  was 
94
available  on that processor. In other words, swapping will not occur
95
during the modeled computation.<BR><BR>
96
 
97
<STRONG>
98
This  machine  model  is  a very crude approximation that is designed
99
specifically  to  illustrate  the cost of the dominant factors of our
100
particular case.<BR><BR>
101
</STRONG>
102
<HR NOSHADE
103

    
104
<H3<A ="pfact">Panel Factorization and Broadcast</A></H3>
105

    
106
Let  consider  an  M-by-N  panel distributed over a P-process column.
107
Because  of the recursive formulation of the panel factorization,  it
108
is  reasonable to consider  that  the floating point operations  will
109
be performed at matrix-matrix multiply "speed".  For  every column in
110
the panel a binary-exchange is performed on 2*N data items. When this
111
panel is broadcast,  what  matters  is the time that the next process
112
column  will  spend  in this  communication operation.  Assuming  one
113
chooses the <A HREF="algorithm.html#bcast">increasing-ring (modified)
114
variant</A>,  only  one  message needs to be taken into account.  The
115
execution  time  of the panel factorization and broadcast can thus be
116
approximated by:<BR><BR>
117
<CENTER>
118
Tpfact( M, N ) = (M/P - N/3) N^2 gam3 + N log(P)( alpha + beta 2 N ) +
119
alpha + beta M N / P.<BR><BR>
120
</CENTER>
121
<HR NOSHADE
122

    
123
<H3<A ="updat">Trailing Submatrix Update</A></H3>
124

    
125
Let  consider  the  update  phase  of an  N-by-N  trailing  submatrix
126
distributed on a P-by-Q process grid.  From  a computational point of
127
view one has to (triangular) solve N right-hand-sides  and  perform a 
128
local rank-NB update of this trailing submatrix. Assuming one chooses
129
the <A HREF="algorithm.html#update">long variant</A>,  the  execution
130
time of the update operation can be approximated by:<BR><BR>
131
<CENTER>
132
Tupdate( N, NB ) = gam3 ( N NB^2 / Q + 2 N^2 NB / ( P Q ) ) +
133
alpha ( log( P ) + P - 1 ) + 3 beta N NB / Q.<BR><BR>
134
</CENTER>
135
The constant "3" in front of the "beta" term is obtained  by counting
136
one for the (logarithmic) spread phase and two for the rolling phase;
137
In the case of bi-directional links  this constant 3 should therefore
138
be only a 2.<BR><BR>
139
<HR NOSHADE
140

    
141
<H3<A ="backs">Backward Substitution</A></H3>
142

    
143
The number of floating point operations performed during the backward
144
substitution in given by  N^2 / (P*Q).  Because of the lookahead, the
145
communication cost  can be approximated at each step  by two messages
146
of length NB, i.e.,  the time  to  communicate  the NB-piece  of  the 
147
solution vector from one diagonal block of the matrix to another.  It
148
follows that the execution time of the backward substitution  can  be
149
approximated by:<BR><BR>
150
<CENTER>
151
Tbacks( N, NB ) = gam2 N^2  / (P Q) + N ( alpha / NB + 2 beta ).<BR><BR>
152
</CENTER>
153
<HR NOSHADE
154

    
155
<H3<A ="total">Putting it All Together</A></H3>
156

    
157
The total execution time of the algorithm described above is given by<BR><BR>
158
<CENTER>
159
Sum(k=0,N,NB)[Tpfact( N-k, NB ) + Tupdate( N-k-NB, NB )] +
160
Tbacks( N, NB ).<BR><BR>
161
</CENTER>
162
That is, by only considering only the dominant term in alpha, beta and
163
gam3:<BR><BR>
164
<CENTER>
165
Thpl = 2 gam3 N^3  / ( 3 P Q ) + beta N^2 (3 P + Q) / ( 2 P Q ) +
166
alpha N ((NB + 1) log(P) + P) / NB.<BR><BR>
167
</CENTER>
168
The serial execution time is given by Tser = 2 gam3 N^3  / 3. If we
169
define the parallel efficiency  E  as the ratio  Tser / ( P Q Thpl ), we
170
obtain:<BR><BR>
171
<CENTER>
172
E = 1 / ( 1 + 3 beta (3 P + Q) / ( 4 gam3 N ) +
173
3 alpha P Q ((NB + 1) log(P) + P) / (2 N^2 NB gam3) ).<BR><BR>
174
</CENTER>
175
This  last equality  shows  that when the memory usage per  processor
176
N^2 / (P Q)  is maintained  constant, the parallel efficiency  slowly
177
decreases  only  because of the alpha term.  The communication volume
178
(the beta term) however remains constant.  Due to these results,  HPL
179
is said to be <STRONG>scalable</STRONG> not only with respect  to the
180
amount of computation,  but also  with  respect  to the communication
181
volume.<BR><BR>
182

    
183
<HR NOSHADE
184
<CENTER
185
<A  = "index.html">            [Home]</A>
186
<A HREF = "copyright.html">        [Copyright and Licensing Terms]</A>
187
<A HREF = "algorithm.html">        [Algorithm]</A>
188
<A HREF = "scalability.html">      [Scalability]</A>
189
<A HREF = "results.html">          [Performance Results]</A>
190
<A HREF = "documentation.html">    [Documentation]</A>
191
<A HREF = "software.html">         [Software]</A>
192
<A HREF = "faqs.html">             [FAQs]</A>
193
<A HREF = "tuning.html">           [Tuning]</A>
194
<A HREF = "errata.html">           [Errata-Bugs]</A>
195
<A HREF = "references.html">       [References]</A>
196
<A HREF = "links.html">            [Related Links]</A><BR>
197
</CENTER>
198
<HR NOSHADE
199
</BODY
200
</HTML