root / man / man3 / HPL_pdpancrT.3
Historique | Voir | Annoter | Télécharger (2,95 ko)
1 |
.TH HPL_pdpancrT 3 "September 10, 2008" "HPL 2.0" "HPL Library Functions" |
---|---|
2 |
.SH NAME |
3 |
HPL_pdpancrT \- Crout panel factorization. |
4 |
.SH SYNOPSIS |
5 |
\fB\&#include "hpl.h"\fR |
6 |
|
7 |
\fB\&void\fR |
8 |
\fB\&HPL_pdpancrT(\fR |
9 |
\fB\&HPL_T_panel *\fR |
10 |
\fI\&PANEL\fR, |
11 |
\fB\&const int\fR |
12 |
\fI\&M\fR, |
13 |
\fB\&const int\fR |
14 |
\fI\&N\fR, |
15 |
\fB\&const int\fR |
16 |
\fI\&ICOFF\fR, |
17 |
\fB\&double *\fR |
18 |
\fI\&WORK\fR |
19 |
\fB\&);\fR |
20 |
.SH DESCRIPTION |
21 |
\fB\&HPL_pdpancrT\fR |
22 |
factorizes a panel of columns that is a sub-array of a |
23 |
larger one-dimensional panel A using the Crout variant of the usual |
24 |
one-dimensional algorithm. The lower triangular N0-by-N0 upper block |
25 |
of the panel is stored in transpose form. |
26 |
|
27 |
Bi-directional exchange is used to perform the swap::broadcast |
28 |
operations at once for one column in the panel. This results in a |
29 |
lower number of slightly larger messages than usual. On P processes |
30 |
and assuming bi-directional links, the running time of this function |
31 |
can be approximated by (when N is equal to N0): |
32 |
|
33 |
N0 * log_2( P ) * ( lat + ( 2*N0 + 4 ) / bdwth ) + |
34 |
N0^2 * ( M - N0/3 ) * gam2-3 |
35 |
|
36 |
where M is the local number of rows of the panel, lat and bdwth are |
37 |
the latency and bandwidth of the network for double precision real |
38 |
words, and gam2-3 is an estimate of the Level 2 and Level 3 BLAS |
39 |
rate of execution. The recursive algorithm allows indeed to almost |
40 |
achieve Level 3 BLAS performance in the panel factorization. On a |
41 |
large number of modern machines, this operation is however latency |
42 |
bound, meaning that its cost can be estimated by only the latency |
43 |
portion N0 * log_2(P) * lat. Mono-directional links will double this |
44 |
communication cost. |
45 |
|
46 |
Note that one iteration of the the main loop is unrolled. The local |
47 |
computation of the absolute value max of the next column is performed |
48 |
just after its update by the current column. This allows to bring the |
49 |
current column only once through cache at each step. The current |
50 |
implementation does not perform any blocking for this sequence of |
51 |
BLAS operations, however the design allows for plugging in an optimal |
52 |
(machine-specific) specialized BLAS-like kernel. This idea has been |
53 |
suggested to us by Fred Gustavson, IBM T.J. Watson Research Center. |
54 |
.SH ARGUMENTS |
55 |
.TP 8 |
56 |
PANEL (local input/output) HPL_T_panel * |
57 |
On entry, PANEL points to the data structure containing the |
58 |
panel information. |
59 |
.TP 8 |
60 |
M (local input) const int |
61 |
On entry, M specifies the local number of rows of sub(A). |
62 |
.TP 8 |
63 |
N (local input) const int |
64 |
On entry, N specifies the local number of columns of sub(A). |
65 |
.TP 8 |
66 |
ICOFF (global input) const int |
67 |
On entry, ICOFF specifies the row and column offset of sub(A) |
68 |
in A. |
69 |
.TP 8 |
70 |
WORK (local workspace) double * |
71 |
On entry, WORK is a workarray of size at least 2*(4+2*N0). |
72 |
.SH SEE ALSO |
73 |
.BR HPL_dlocmax \ (3), |
74 |
.BR HPL_dlocswpN \ (3), |
75 |
.BR HPL_dlocswpT \ (3), |
76 |
.BR HPL_pdmxswp \ (3), |
77 |
.BR HPL_pdpancrN \ (3), |
78 |
.BR HPL_pdpanllN \ (3), |
79 |
.BR HPL_pdpanllT \ (3), |
80 |
.BR HPL_pdpanrlN \ (3), |
81 |
.BR HPL_pdpanrlT \ (3). |