Statistiques
| Révision :

root / www / HPL_pdpancrN.html

Historique | Voir | Annoter | Télécharger (3,54 ko)

1
<HTML>
2
<HEAD>
3
<TITLE>HPL_pdpancrN HPL 2.0 Library Functions September 10, 2008</TITLE> 
4
</HEAD>
5

    
6
<BODY BGCOLOR="WHITE" TEXT = "#000000" LINK = "#0000ff" VLINK = "#000099"
7
      ALINK = "#ffff00">
8

    
9
<H1>Name</H1>
10
<B>HPL_pdpancrN</B> Crout panel factorization.
11

    
12
<H1>Synopsis</H1>
13
<CODE>#include "hpl.h"</CODE><BR><BR>
14
<CODE>void</CODE>
15
<CODE>HPL_pdpancrN(</CODE>
16
<CODE>HPL_T_panel *</CODE>
17
<CODE>PANEL</CODE>,
18
<CODE>const int</CODE>
19
<CODE>M</CODE>,
20
<CODE>const int</CODE>
21
<CODE>N</CODE>,
22
<CODE>const int</CODE>
23
<CODE>ICOFF</CODE>,
24
<CODE>double *</CODE>
25
<CODE>WORK</CODE>
26
<CODE>);</CODE>
27

    
28
<H1>Description</H1>
29
<B>HPL_pdpancrN</B>
30
factorizes  a panel of columns that is a sub-array of a
31
larger one-dimensional panel  A using the Crout variant of the  usual
32
one-dimensional algorithm.  The lower triangular N0-by-N0 upper block
33
of the panel is stored in no-transpose form (i.e. just like the input
34
matrix itself).
35
 
36
Bi-directional  exchange  is  used  to  perform  the  swap::broadcast
37
operations  at once  for one column in the panel.  This  results in a
38
lower number of slightly larger  messages than usual.  On P processes
39
and assuming bi-directional links,  the running time of this function
40
can be approximated by (when N is equal to N0):
41
 
42
   N0 * log_2( P ) * ( lat + ( 2*N0 + 4 ) / bdwth ) +
43
   N0^2 * ( M - N0/3 ) * gam2-3
44
 
45
where M is the local number of rows of  the panel, lat and bdwth  are
46
the latency and bandwidth of the network for  double  precision  real
47
words, and gam2-3 is  an  estimate  of the  Level 2 and Level 3  BLAS
48
rate of execution. The  recursive  algorithm  allows indeed to almost
49
achieve  Level 3 BLAS  performance  in the panel factorization.  On a
50
large  number of modern machines,  this  operation is however latency
51
bound,  meaning  that its cost can  be estimated  by only the latency
52
portion N0 * log_2(P) * lat.  Mono-directional links will double this
53
communication cost.
54
 
55
Note that  one  iteration of the the main loop is unrolled. The local
56
computation of the absolute value max of the next column is performed
57
just after its update by the current column. This allows to bring the
58
current column only  once through  cache at each  step.  The  current
59
implementation  does not perform  any blocking  for  this sequence of
60
BLAS operations, however the design allows for plugging in an optimal
61
(machine-specific) specialized  BLAS-like kernel.  This idea has been
62
suggested to us by Fred Gustavson, IBM T.J. Watson Research Center.
63

    
64
<H1>Arguments</H1>
65
<PRE>
66
PANEL   (local input/output)          HPL_T_panel *
67
        On entry,  PANEL  points to the data structure containing the
68
        panel information.
69
</PRE>
70
<PRE>
71
M       (local input)                 const int
72
        On entry,  M specifies the local number of rows of sub(A).
73
</PRE>
74
<PRE>
75
N       (local input)                 const int
76
        On entry,  N specifies the local number of columns of sub(A).
77
</PRE>
78
<PRE>
79
ICOFF   (global input)                const int
80
        On entry, ICOFF specifies the row and column offset of sub(A)
81
        in A.
82
</PRE>
83
<PRE>
84
WORK    (local workspace)             double *
85
        On entry, WORK  is a workarray of size at least 2*(4+2*N0).
86
</PRE>
87

    
88
<H1>See Also</H1>
89
<A HREF="HPL_dlocmax.html">HPL_dlocmax</A>,
90
<A HREF="HPL_dlocswpN.html">HPL_dlocswpN</A>,
91
<A HREF="HPL_dlocswpT.html">HPL_dlocswpT</A>,
92
<A HREF="HPL_pdmxswp.html">HPL_pdmxswp</A>,
93
<A HREF="HPL_pdpancrT.html">HPL_pdpancrT</A>,
94
<A HREF="HPL_pdpanllN.html">HPL_pdpanllN</A>,
95
<A HREF="HPL_pdpanllT.html">HPL_pdpanllT</A>,
96
<A HREF="HPL_pdpanrlN.html">HPL_pdpanrlN</A>,
97
<A HREF="HPL_pdpanrlT.html">HPL_pdpanrlT</A>.
98

    
99
</BODY>
100
</HTML>