/www/scalability.html - Annoter - HPL sur GPU - Forge du Centre Blaise Pascal

1

equemene

<HTML>

2

1

equemene

<HEAD>

3

1

equemene

<TITLE>HPL Scalability Analysis</TITLE>

4

1

equemene

</HEAD>

5

1

equemene

6

1

equemene

<BODY

7

1

equemene

BGCOLOR     = "WHITE"

8

1

equemene

BACKGROUND  = "WHITE"

9

1

equemene

TEXT        = "#000000"

10

1

equemene

VLINK       = "#000099"

11

1

equemene

ALINK       = "#947153"

12

1

equemene

LINK        = "#0000ff">

13

1

equemene

14

1

equemene

<H2>HPL Scalability Analysis</H2>

15

1

equemene

16

1

equemene

The <A HREF = "scalability.html#model">machine model</A> used for the

17

1

equemene

analysis is first described.  This crude model is then used  to first

18

1

equemene

estimate  the  parallel running time  of  the various phases  of  the

19

1

equemene

algorithm namely

20

1

equemene

<UL>

21

1

equemene

<LI><A HREF="scalability.html#pfact">panel factorization and broadcast</A>,

22

1

equemene

<LI><A HREF="scalability.html#updat">trailing submatrix update</A>,

23

1

equemene

<LI><A HREF="scalability.html#backs">backward substitution</A>.

24

1

equemene

</UL>

25

1

equemene

Finally <A HREF="scalability.html#total">the  parallel efficiency</A>

26

1

equemene

of the entire algorithm is estimated according to this machine model.

27

1

equemene

We show that for a given set of parameters HPL is <STRONG>scalable</STRONG>

28

1

equemene

not  only  with respect to the amount of computation,  but  also with

29

1

equemene

respect to the communication volume.<BR><BR>

30

1

equemene

<HR NOSHADE

31

1

equemene

32

1

equemene

<H3<A  = "model">The Machine Model</A></H3>

33

1

equemene

34

1

equemene

Distributed-memory computers consist of processors that are connected

35

1

equemene

using  a message passing interconnection network.  Each processor has

36

1

equemene

its own memory called the local memory,  which  is accessible only to

37

1

equemene

that processor.  As the time to access a remote memory is longer than

38

1

equemene

the time to access a local one,  such computers are often referred to

39

1

equemene

as Non-Uniform Memory Access (NUMA) machines.<BR><BR>

40

1

equemene

41

1

equemene

The interconnection network  of our machine model is static,  meaning

42

1

equemene

that   it   consists  of  point-to-point  communication  links  among

43

1

equemene

processors.  This  type  of  network  is also referred to as a direct

44

1

equemene

network as opposed to dynamic networks.  The  latter  are constructed

45

1

equemene

from switches and communication links.  These links  are  dynamically

46

1

equemene

connected  to one another by the switching elements to establish,  at

47

1

equemene

run time, the paths between processors memories.<BR><BR>

48

1

equemene

49

1

equemene

The  interconnection  network  of the two-dimensional  machine  model

50

1

equemene

considered here is a static,  fully  connected physical topology.  It

51

1

equemene

is also assumed  that  processors  can be treated  equally  in  terms

52

1

equemene

of  local performance  and  that  the  communication rate between two

53

1

equemene

processors depends on the processors considered.<BR><BR>

54

1

equemene

55

1

equemene

Our model assumes  that  a processor can send or receive data on only

56

1

equemene

one of its communication ports at a time  (assuming  it has more than

57

1

equemene

one). In the literature,  this  assumption is also referred to as the

58

1

equemene

one-port communication model.<BR><BR>

59

1

equemene

60

1

equemene

The time spent to communicate  a message between two given processors

61

1

equemene

is called the communication time Tc.   In  our machine model,  Tc  is

62

1

equemene

approximated  by  a  linear  function  of  the  number  L  of  double

63

1

equemene

precision (64-bits) items communicated.  Tc is the sum of the time to

64

1

equemene

prepare the message for transmission (alpha) and the time  (beta * L)

65

1

equemene

taken  by the message of length  L  to traverse  the network  to  its

66

1

equemene

destination, i.e.,<BR><BR>

67

1

equemene

<CENTER>

68

1

equemene

Tc = alpha + beta L.<BR><BR>

69

1

equemene

</CENTER>

70

1

equemene

71

1

equemene

Finally,   the   model  assumes  that  the  communication  links  are

72

1

equemene

bi-directional,  that is,  the time  for two processors  to send each

73

1

equemene

other a message of length L is also Tc.  A processor  can send and/or

74

1

equemene

receive  a message on only one of  its communication links at a time.

75

1

equemene

In particular, a processor can send a message while receiving another

76

1

equemene

message from the processor it is sending to at the same time.<BR><BR>

77

1

equemene

78

1

equemene

Since this document is only concerned with regular local dense linear

79

1

equemene

algebra  operations,  the time taken to perform  one  floating  point

80

1

equemene

operation  is  assumed  to  be  summarized by  three constants  gam1,

81

1

equemene

gam2 and gam3. These quantitites are flop rates approximations of the

82

1

equemene

vector-vector,  matrix-vector  and matrix-matrix operations for  each

83

1

equemene

processor.  This  very  crude approximation summarizes all  the steps

84

1

equemene

performed  by a processor  to achieve such a computation.  Obviously,

85

1

equemene

such a model neglects all the phenomena  occurring  in  the processor

86

1

equemene

components,  such as cache misses, pipeline startups, memory load  or

87

1

equemene

store, floating point arithmetic and so on,  that  may  influence the

88

1

equemene

value  of  these  constants  as  a function  of the  problem size for

89

1

equemene

example.<BR><BR>

90

1

equemene

91

1

equemene

Similarly,  the model  does  not make any assumption on the amount of

92

1

equemene

physical memory per node.  It  is  assumed that if a process has been

93

1

equemene

spawn  on  a processor,  one  has  ensured  that  enough  memory  was

94

1

equemene

available  on that processor. In other words, swapping will not occur

95

1

equemene

during the modeled computation.<BR><BR>

96

1

equemene

97

1

equemene

<STRONG>

98

1

equemene

This  machine  model  is  a very crude approximation that is designed

99

1

equemene

specifically  to  illustrate  the cost of the dominant factors of our

100

1

equemene

particular case.<BR><BR>

101

1

equemene

</STRONG>

102

1

equemene

<HR NOSHADE

103

1

equemene

104

1

equemene

<H3<A ="pfact">Panel Factorization and Broadcast</A></H3>

105

1

equemene

106

1

equemene

Let  consider  an  M-by-N  panel distributed over a P-process column.

107

1

equemene

Because  of the recursive formulation of the panel factorization,  it

108

1

equemene

is  reasonable to consider  that  the floating point operations  will

109

1

equemene

be performed at matrix-matrix multiply "speed".  For  every column in

110

1

equemene

the panel a binary-exchange is performed on 2*N data items. When this

111

1

equemene

panel is broadcast,  what  matters  is the time that the next process

112

1

equemene

column  will  spend  in this  communication operation.  Assuming  one

113

1

equemene

chooses the <A HREF="algorithm.html#bcast">increasing-ring (modified)

114

1

equemene

variant</A>,  only  one  message needs to be taken into account.  The

115

1

equemene

execution  time  of the panel factorization and broadcast can thus be

116

1

equemene

approximated by:<BR><BR>

117

1

equemene

<CENTER>

118

1

equemene

Tpfact( M, N ) = (M/P - N/3) N^2 gam3 + N log(P)( alpha + beta 2 N ) +

119

1

equemene

alpha + beta M N / P.<BR><BR>

120

1

equemene

</CENTER>

121

1

equemene

<HR NOSHADE

122

1

equemene

123

1

equemene

<H3<A ="updat">Trailing Submatrix Update</A></H3>

124

1

equemene

125

1

equemene

Let  consider  the  update  phase  of an  N-by-N  trailing  submatrix

126

1

equemene

distributed on a P-by-Q process grid.  From  a computational point of

127

1

equemene

view one has to (triangular) solve N right-hand-sides  and  perform a

128

1

equemene

local rank-NB update of this trailing submatrix. Assuming one chooses

129

1

equemene

the <A HREF="algorithm.html#update">long variant</A>,  the  execution

130

1

equemene

time of the update operation can be approximated by:<BR><BR>

131

1

equemene

<CENTER>

132

1

equemene

Tupdate( N, NB ) = gam3 ( N NB^2 / Q + 2 N^2 NB / ( P Q ) ) +

133

1

equemene

alpha ( log( P ) + P - 1 ) + 3 beta N NB / Q.<BR><BR>

134

1

equemene

</CENTER>

135

1

equemene

The constant "3" in front of the "beta" term is obtained  by counting

136

1

equemene

one for the (logarithmic) spread phase and two for the rolling phase;

137

1

equemene

In the case of bi-directional links  this constant 3 should therefore

138

1

equemene

be only a 2.<BR><BR>

139

1

equemene

<HR NOSHADE

140

1

equemene

141

1

equemene

<H3<A ="backs">Backward Substitution</A></H3>

142

1

equemene

143

1

equemene

The number of floating point operations performed during the backward

144

1

equemene

substitution in given by  N^2 / (P*Q).  Because of the lookahead, the

145

1

equemene

communication cost  can be approximated at each step  by two messages

146

1

equemene

of length NB, i.e.,  the time  to  communicate  the NB-piece  of  the

147

1

equemene

solution vector from one diagonal block of the matrix to another.  It

148

1

equemene

follows that the execution time of the backward substitution  can  be

149

1

equemene

approximated by:<BR><BR>

150

1

equemene

<CENTER>

151

1

equemene

Tbacks( N, NB ) = gam2 N^2  / (P Q) + N ( alpha / NB + 2 beta ).<BR><BR>

152

1

equemene

</CENTER>

153

1

equemene

<HR NOSHADE

154

1

equemene

155

1

equemene

<H3<A ="total">Putting it All Together</A></H3>

156

1

equemene

157

1

equemene

The total execution time of the algorithm described above is given by<BR><BR>

158

1

equemene

<CENTER>

159

1

equemene

Sum(k=0,N,NB)[Tpfact( N-k, NB ) + Tupdate( N-k-NB, NB )] +

160

1

equemene

Tbacks( N, NB ).<BR><BR>

161

1

equemene

</CENTER>

162

1

equemene

That is, by only considering only the dominant term in alpha, beta and

163

1

equemene

gam3:<BR><BR>

164

1

equemene

<CENTER>

165

1

equemene

Thpl = 2 gam3 N^3  / ( 3 P Q ) + beta N^2 (3 P + Q) / ( 2 P Q ) +

166

1

equemene

alpha N ((NB + 1) log(P) + P) / NB.<BR><BR>

167

1

equemene

</CENTER>

168

1

equemene

The serial execution time is given by Tser = 2 gam3 N^3  / 3. If we

169

1

equemene

define the parallel efficiency  E  as the ratio  Tser / ( P Q Thpl ), we

170

1

equemene

obtain:<BR><BR>

171

1

equemene

<CENTER>

172

1

equemene

E = 1 / ( 1 + 3 beta (3 P + Q) / ( 4 gam3 N ) +

173

1

equemene

3 alpha P Q ((NB + 1) log(P) + P) / (2 N^2 NB gam3) ).<BR><BR>

174

1

equemene

</CENTER>

175

1

equemene

This  last equality  shows  that when the memory usage per  processor

176

1

equemene

N^2 / (P Q)  is maintained  constant, the parallel efficiency  slowly

177

1

equemene

decreases  only  because of the alpha term.  The communication volume

178

1

equemene

(the beta term) however remains constant.  Due to these results,  HPL

179

1

equemene

is said to be <STRONG>scalable</STRONG> not only with respect  to the

180

1

equemene

amount of computation,  but also  with  respect  to the communication

181

1

equemene

volume.<BR><BR>

182

1

equemene

183

1

equemene

<HR NOSHADE

184

1

equemene

<CENTER

185

1

equemene

<A  = "index.html">            [Home]</A>

186

1

equemene

<A HREF = "copyright.html">        [Copyright and Licensing Terms]</A>

187

1

equemene

<A HREF = "algorithm.html">        [Algorithm]</A>

188

1

equemene

<A HREF = "scalability.html">      [Scalability]</A>

189

1

equemene

<A HREF = "results.html">          [Performance Results]</A>

190

1

equemene

<A HREF = "documentation.html">    [Documentation]</A>

191

1

equemene

<A HREF = "software.html">         [Software]</A>

192

1

equemene

<A HREF = "faqs.html">             [FAQs]</A>

193

1

equemene

<A HREF = "tuning.html">           [Tuning]</A>

194

1

equemene

<A HREF = "errata.html">           [Errata-Bugs]</A>

195

1

equemene

<A HREF = "references.html">       [References]</A>

196

1

equemene

<A HREF = "links.html">            [Related Links]</A><BR>

197

1

equemene

</CENTER>

198

1

equemene

<HR NOSHADE

199

1

equemene

</BODY

200

1

equemene

</HTML

Centre Blaise Pascal » HPL sur GPU

root / www / scalability.html

-equemene
+<HTML>
-equemene
+<HEAD>
-equemene
+<TITLE>HPL Scalability Analysis</TITLE>
-equemene
+</HEAD>
 equemene
-equemene
+<BODY
-equemene
+BGCOLOR     = "WHITE"
-equemene
+BACKGROUND  = "WHITE"
-equemene
+TEXT        = "#000000"
-equemene
+VLINK       = "#000099"
-equemene
+ALINK       = "#947153"
-equemene
+LINK        = "#0000ff">
 equemene
-equemene
+<H2>HPL Scalability Analysis</H2>
 equemene
-equemene
+The <A HREF = "scalability.html#model">machine model</A> used for the
-equemene
+analysis is first described.  This crude model is then used  to first
-equemene
+estimate  the  parallel running time  of  the various phases  of  the
-equemene
+algorithm namely
-equemene
+<UL>
-equemene
+<LI><A HREF="scalability.html#pfact">panel factorization and broadcast</A>,
-equemene
+<LI><A HREF="scalability.html#updat">trailing submatrix update</A>,
-equemene
+<LI><A HREF="scalability.html#backs">backward substitution</A>.
-equemene
+</UL>
-equemene
+Finally <A HREF="scalability.html#total">the  parallel efficiency</A>
-equemene
+of the entire algorithm is estimated according to this machine model.
-equemene
+We show that for a given set of parameters HPL is <STRONG>scalable</STRONG>
-equemene
+not  only  with respect to the amount of computation,  but  also with
-equemene
+respect to the communication volume.<BR><BR>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A  = "model">The Machine Model</A></H3>
 equemene
-equemene
+Distributed-memory computers consist of processors that are connected
-equemene
+using  a message passing interconnection network.  Each processor has
-equemene
+its own memory called the local memory,  which  is accessible only to
-equemene
+that processor.  As the time to access a remote memory is longer than
-equemene
+the time to access a local one,  such computers are often referred to
-equemene
+as Non-Uniform Memory Access (NUMA) machines.<BR><BR>
 equemene
-equemene
+The interconnection network  of our machine model is static,  meaning
-equemene
+that   it   consists  of  point-to-point  communication  links  among
-equemene
+processors.  This  type  of  network  is also referred to as a direct
-equemene
+network as opposed to dynamic networks.  The  latter  are constructed
-equemene
+from switches and communication links.  These links  are  dynamically
-equemene
+connected  to one another by the switching elements to establish,  at
-equemene
+run time, the paths between processors memories.<BR><BR>
 equemene
-equemene
+The  interconnection  network  of the two-dimensional  machine  model
-equemene
+considered here is a static,  fully  connected physical topology.  It
-equemene
+is also assumed  that  processors  can be treated  equally  in  terms
-equemene
+of  local performance  and  that  the  communication rate between two
-equemene
+processors depends on the processors considered.<BR><BR>
 equemene
-equemene
+Our model assumes  that  a processor can send or receive data on only
-equemene
+one of its communication ports at a time  (assuming  it has more than
-equemene
+one). In the literature,  this  assumption is also referred to as the
-equemene
+one-port communication model.<BR><BR>
 equemene
-equemene
+The time spent to communicate  a message between two given processors
-equemene
+is called the communication time Tc.   In  our machine model,  Tc  is
-equemene
+approximated  by  a  linear  function  of  the  number  L  of  double
-equemene
+precision (64-bits) items communicated.  Tc is the sum of the time to
-equemene
+prepare the message for transmission (alpha) and the time  (beta * L)
-equemene
+taken  by the message of length  L  to traverse  the network  to  its
-equemene
+destination, i.e.,<BR><BR>
-equemene
+<CENTER>
-equemene
+Tc = alpha + beta L.<BR><BR>
-equemene
+</CENTER>
 equemene
-equemene
+Finally,   the   model  assumes  that  the  communication  links  are
-equemene
+bi-directional,  that is,  the time  for two processors  to send each
-equemene
+other a message of length L is also Tc.  A processor  can send and/or
-equemene
+receive  a message on only one of  its communication links at a time.
-equemene
+In particular, a processor can send a message while receiving another
-equemene
+message from the processor it is sending to at the same time.<BR><BR>
 equemene
-equemene
+Since this document is only concerned with regular local dense linear
-equemene
+algebra  operations,  the time taken to perform  one  floating  point
-equemene
+operation  is  assumed  to  be  summarized by  three constants  gam1,
-equemene
+gam2 and gam3. These quantitites are flop rates approximations of the
-equemene
+vector-vector,  matrix-vector  and matrix-matrix operations for  each
-equemene
+processor.  This  very  crude approximation summarizes all  the steps
-equemene
+performed  by a processor  to achieve such a computation.  Obviously,
-equemene
+such a model neglects all the phenomena  occurring  in  the processor
-equemene
+components,  such as cache misses, pipeline startups, memory load  or
-equemene
+store, floating point arithmetic and so on,  that  may  influence the
-equemene
+value  of  these  constants  as  a function  of the  problem size for
-equemene
+example.<BR><BR>
 equemene
-equemene
+Similarly,  the model  does  not make any assumption on the amount of
-equemene
+physical memory per node.  It  is  assumed that if a process has been
-equemene
+spawn  on  a processor,  one  has  ensured  that  enough  memory  was
-equemene
+available  on that processor. In other words, swapping will not occur
-equemene
+during the modeled computation.<BR><BR>
 equemene
-equemene
+<STRONG>
-equemene
+This  machine  model  is  a very crude approximation that is designed
-equemene
+specifically  to  illustrate  the cost of the dominant factors of our
-equemene
+particular case.<BR><BR>
-equemene
+</STRONG>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="pfact">Panel Factorization and Broadcast</A></H3>
 equemene
-equemene
+Let  consider  an  M-by-N  panel distributed over a P-process column.
-equemene
+Because  of the recursive formulation of the panel factorization,  it
-equemene
+is  reasonable to consider  that  the floating point operations  will
-equemene
+be performed at matrix-matrix multiply "speed".  For  every column in
-equemene
+the panel a binary-exchange is performed on 2*N data items. When this
-equemene
+panel is broadcast,  what  matters  is the time that the next process
-equemene
+column  will  spend  in this  communication operation.  Assuming  one
-equemene
+chooses the <A HREF="algorithm.html#bcast">increasing-ring (modified)
-equemene
+variant</A>,  only  one  message needs to be taken into account.  The
-equemene
+execution  time  of the panel factorization and broadcast can thus be
-equemene
+approximated by:<BR><BR>
-equemene
+<CENTER>
-equemene
+Tpfact( M, N ) = (M/P - N/3) N^2 gam3 + N log(P)( alpha + beta 2 N ) +
-equemene
+alpha + beta M N / P.<BR><BR>
-equemene
+</CENTER>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="updat">Trailing Submatrix Update</A></H3>
 equemene
-equemene
+Let  consider  the  update  phase  of an  N-by-N  trailing  submatrix
-equemene
+distributed on a P-by-Q process grid.  From  a computational point of
-equemene
+view one has to (triangular) solve N right-hand-sides  and  perform a
-equemene
+local rank-NB update of this trailing submatrix. Assuming one chooses
-equemene
+the <A HREF="algorithm.html#update">long variant</A>,  the  execution
-equemene
+time of the update operation can be approximated by:<BR><BR>
-equemene
+<CENTER>
-equemene
+Tupdate( N, NB ) = gam3 ( N NB^2 / Q + 2 N^2 NB / ( P Q ) ) +
-equemene
+alpha ( log( P ) + P - 1 ) + 3 beta N NB / Q.<BR><BR>
-equemene
+</CENTER>
-equemene
+The constant "3" in front of the "beta" term is obtained  by counting
-equemene
+one for the (logarithmic) spread phase and two for the rolling phase;
-equemene
+In the case of bi-directional links  this constant 3 should therefore
-equemene
+be only a 2.<BR><BR>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="backs">Backward Substitution</A></H3>
 equemene
-equemene
+The number of floating point operations performed during the backward
-equemene
+substitution in given by  N^2 / (P*Q).  Because of the lookahead, the
-equemene
+communication cost  can be approximated at each step  by two messages
-equemene
+of length NB, i.e.,  the time  to  communicate  the NB-piece  of  the
-equemene
+solution vector from one diagonal block of the matrix to another.  It
-equemene
+follows that the execution time of the backward substitution  can  be
-equemene
+approximated by:<BR><BR>
-equemene
+<CENTER>
-equemene
+Tbacks( N, NB ) = gam2 N^2  / (P Q) + N ( alpha / NB + 2 beta ).<BR><BR>
-equemene
+</CENTER>
-equemene
+<HR NOSHADE
 equemene
-equemene
+<H3<A ="total">Putting it All Together</A></H3>
 equemene
-equemene
+The total execution time of the algorithm described above is given by<BR><BR>
-equemene
+<CENTER>
-equemene
+Sum(k=0,N,NB)[Tpfact( N-k, NB ) + Tupdate( N-k-NB, NB )] +
-equemene
+Tbacks( N, NB ).<BR><BR>
-equemene
+</CENTER>
-equemene
+That is, by only considering only the dominant term in alpha, beta and
-equemene
+gam3:<BR><BR>
-equemene
+<CENTER>
-equemene
+Thpl = 2 gam3 N^3  / ( 3 P Q ) + beta N^2 (3 P + Q) / ( 2 P Q ) +
-equemene
+alpha N ((NB + 1) log(P) + P) / NB.<BR><BR>
-equemene
+</CENTER>
-equemene
+The serial execution time is given by Tser = 2 gam3 N^3  / 3. If we
-equemene
+define the parallel efficiency  E  as the ratio  Tser / ( P Q Thpl ), we
-equemene
+obtain:<BR><BR>
-equemene
+<CENTER>
-equemene
+E = 1 / ( 1 + 3 beta (3 P + Q) / ( 4 gam3 N ) +
-equemene
+alpha P Q ((NB + 1) log(P) + P) / (2 N^2 NB gam3) ).<BR><BR>
-equemene
+</CENTER>
-equemene
+This  last equality  shows  that when the memory usage per  processor
-equemene
+N^2 / (P Q)  is maintained  constant, the parallel efficiency  slowly
-equemene
+decreases  only  because of the alpha term.  The communication volume
-equemene
+(the beta term) however remains constant.  Due to these results,  HPL
-equemene
+is said to be <STRONG>scalable</STRONG> not only with respect  to the
-equemene
+amount of computation,  but also  with  respect  to the communication
-equemene
+volume.<BR><BR>
 equemene
-equemene
+<HR NOSHADE
-equemene
+<CENTER
-equemene
+<A  = "index.html">            [Home]</A>
-equemene
+<A HREF = "copyright.html">        [Copyright and Licensing Terms]</A>
-equemene
+<A HREF = "algorithm.html">        [Algorithm]</A>
-equemene
+<A HREF = "scalability.html">      [Scalability]</A>
-equemene
+<A HREF = "results.html">          [Performance Results]</A>
-equemene
+<A HREF = "documentation.html">    [Documentation]</A>
-equemene
+<A HREF = "software.html">         [Software]</A>
-equemene
+<A HREF = "faqs.html">             [FAQs]</A>
-equemene
+<A HREF = "tuning.html">           [Tuning]</A>
-equemene
+<A HREF = "errata.html">           [Errata-Bugs]</A>
-equemene
+<A HREF = "references.html">       [References]</A>
-equemene
+<A HREF = "links.html">            [Related Links]</A><BR>
-equemene
+</CENTER>
-equemene
+<HR NOSHADE
-equemene
+</BODY
-equemene
+</HTML