root / man / man3 / HPL_pdpanrlN.3
Historique | Voir | Annoter | Télécharger (3,01 ko)
1 | 1 | equemene | .TH HPL_pdpanrlN 3 "September 10, 2008" "HPL 2.0" "HPL Library Functions" |
---|---|---|---|
2 | 1 | equemene | .SH NAME |
3 | 1 | equemene | HPL_pdpanrlN \- Right-looking panel factorization. |
4 | 1 | equemene | .SH SYNOPSIS |
5 | 1 | equemene | \fB\&#include "hpl.h"\fR |
6 | 1 | equemene | |
7 | 1 | equemene | \fB\&void\fR |
8 | 1 | equemene | \fB\&HPL_pdpanrlN(\fR |
9 | 1 | equemene | \fB\&HPL_T_panel *\fR |
10 | 1 | equemene | \fI\&PANEL\fR, |
11 | 1 | equemene | \fB\&const int\fR |
12 | 1 | equemene | \fI\&M\fR, |
13 | 1 | equemene | \fB\&const int\fR |
14 | 1 | equemene | \fI\&N\fR, |
15 | 1 | equemene | \fB\&const int\fR |
16 | 1 | equemene | \fI\&ICOFF\fR, |
17 | 1 | equemene | \fB\&double *\fR |
18 | 1 | equemene | \fI\&WORK\fR |
19 | 1 | equemene | \fB\&);\fR |
20 | 1 | equemene | .SH DESCRIPTION |
21 | 1 | equemene | \fB\&HPL_pdpanrlN\fR |
22 | 1 | equemene | factorizes a panel of columns that is a sub-array of a |
23 | 1 | equemene | larger one-dimensional panel A using the Right-looking variant of the |
24 | 1 | equemene | usual one-dimensional algorithm. The lower triangular N0-by-N0 upper |
25 | 1 | equemene | block of the panel is stored in no-transpose form (i.e. just like the |
26 | 1 | equemene | input matrix itself). |
27 | 1 | equemene | |
28 | 1 | equemene | Bi-directional exchange is used to perform the swap::broadcast |
29 | 1 | equemene | operations at once for one column in the panel. This results in a |
30 | 1 | equemene | lower number of slightly larger messages than usual. On P processes |
31 | 1 | equemene | and assuming bi-directional links, the running time of this function |
32 | 1 | equemene | can be approximated by (when N is equal to N0): |
33 | 1 | equemene | |
34 | 1 | equemene | N0 * log_2( P ) * ( lat + ( 2*N0 + 4 ) / bdwth ) + |
35 | 1 | equemene | N0^2 * ( M - N0/3 ) * gam2-3 |
36 | 1 | equemene | |
37 | 1 | equemene | where M is the local number of rows of the panel, lat and bdwth are |
38 | 1 | equemene | the latency and bandwidth of the network for double precision real |
39 | 1 | equemene | words, and gam2-3 is an estimate of the Level 2 and Level 3 BLAS |
40 | 1 | equemene | rate of execution. The recursive algorithm allows indeed to almost |
41 | 1 | equemene | achieve Level 3 BLAS performance in the panel factorization. On a |
42 | 1 | equemene | large number of modern machines, this operation is however latency |
43 | 1 | equemene | bound, meaning that its cost can be estimated by only the latency |
44 | 1 | equemene | portion N0 * log_2(P) * lat. Mono-directional links will double this |
45 | 1 | equemene | communication cost. |
46 | 1 | equemene | |
47 | 1 | equemene | Note that one iteration of the the main loop is unrolled. The local |
48 | 1 | equemene | computation of the absolute value max of the next column is performed |
49 | 1 | equemene | just after its update by the current column. This allows to bring the |
50 | 1 | equemene | current column only once through cache at each step. The current |
51 | 1 | equemene | implementation does not perform any blocking for this sequence of |
52 | 1 | equemene | BLAS operations, however the design allows for plugging in an optimal |
53 | 1 | equemene | (machine-specific) specialized BLAS-like kernel. This idea has been |
54 | 1 | equemene | suggested to us by Fred Gustavson, IBM T.J. Watson Research Center. |
55 | 1 | equemene | .SH ARGUMENTS |
56 | 1 | equemene | .TP 8 |
57 | 1 | equemene | PANEL (local input/output) HPL_T_panel * |
58 | 1 | equemene | On entry, PANEL points to the data structure containing the |
59 | 1 | equemene | panel information. |
60 | 1 | equemene | .TP 8 |
61 | 1 | equemene | M (local input) const int |
62 | 1 | equemene | On entry, M specifies the local number of rows of sub(A). |
63 | 1 | equemene | .TP 8 |
64 | 1 | equemene | N (local input) const int |
65 | 1 | equemene | On entry, N specifies the local number of columns of sub(A). |
66 | 1 | equemene | .TP 8 |
67 | 1 | equemene | ICOFF (global input) const int |
68 | 1 | equemene | On entry, ICOFF specifies the row and column offset of sub(A) |
69 | 1 | equemene | in A. |
70 | 1 | equemene | .TP 8 |
71 | 1 | equemene | WORK (local workspace) double * |
72 | 1 | equemene | On entry, WORK is a workarray of size at least 2*(4+2*N0). |
73 | 1 | equemene | .SH SEE ALSO |
74 | 1 | equemene | .BR HPL_dlocmax \ (3), |
75 | 1 | equemene | .BR HPL_dlocswpN \ (3), |
76 | 1 | equemene | .BR HPL_dlocswpT \ (3), |
77 | 1 | equemene | .BR HPL_pdmxswp \ (3), |
78 | 1 | equemene | .BR HPL_pdpancrN \ (3), |
79 | 1 | equemene | .BR HPL_pdpancrT \ (3), |
80 | 1 | equemene | .BR HPL_pdpanllN \ (3), |
81 | 1 | equemene | .BR HPL_pdpanllT \ (3), |
82 | 1 | equemene | .BR HPL_pdpanrlT \ (3). |