/TUNING - HPL sur GPU - Forge du Centre Blaise Pascal

root / TUNING @ 9

Historique | Voir | Annoter | Télécharger (17,07 ko)

       ==============================================================
        Performance Tuning and setting up the input data file HPL.dat
        Current as of release 2.0 - September 10, 2008
       ==============================================================
        Check out  the website  www.netlib.org/benchmark/hpl  for the
        latest information.
        After  having  built  the executable hpl/bin/<arch>/xhpl, one
        may want to modify the input  data  file  HPL.dat.  This file
        should  reside  in  the  same  directory  as  the  executable
        hpl/bin/<arch>/xhpl.  An example  HPL.dat file is provided by
        default.  This  file  contains  information about the problem
        sizes,  machine configuration,  and  algorithm features to be
        used by the executable. It is 30 lines long. All the selected
        parameters  will  be  printed  in the output generated by the
        executable.
        At the end of this file,  there  is a couple of  experimental
        guide lines that you may find useful.
       ==============================================================
        File HPL.dat (description):
        Line 1: (unused) Typically  one  would  use this line for its
        own good. For example, it could be used to summarize the con-
        tent of the input file. By default this line reads:
        HPL Linpack benchmark input file
        Line 2: (unused) same as line 1. By default this line reads:
        Innovative Computing Laboratory, University of Tennessee
        Line 3: the  user  can  choose where the output should be re-
        directed to.  In the case of a file, a name is necessary, and
        this  is  the  line  where one wants to specify it.  Only the
        first name on this line is significative. By default, the li-
        ne reads:
        HPL.out  output file name (if any)
        This  means  that if  one chooses to redirect the output to a
        file, the file will be called "HPL.out". The rest of the line
        is unused,  and this space to put some informative comment on
        the meaning of this line.
        Line 4: This line specifies  where the  output should go. The
        line is formatted, it must be a positive integer, the rest is
        unsignificant.  3 choices are possible for the positive inte-
        ger,  6 means that the output will go  the standard output, 7
        means  that the  output will go to the standard error. Any o-
        ther  integer  means  that  the  output  should be redirected
        to a file,  which  name has been specified in the line above.
        This line by default reads:
 device out (6=stdout,7=stderr,file)
        which  means  that  the  output generated  by  the executable
        should be redirected to the standard output.
        Line 5: This line specifies the number of problem sizes to be
        executed. This number should be less than or equal to 20. The
        first  integer  is  significant,  the rest is ignored. If the
        line reads:
 # of problems sizes (N)
        this  means  that  the user is willing to run 3 problem sizes
        that will be specified in the next line.
        Line 6:  This  line  specifies the problem sizes one wants to
        run.  Assuming  the  line  above  started with 3, the 3 first
        positive  integers  are significant, the rest is ignored. For
        example:
 6000 10000    Ns
        means that one wants xhpl to run 3 (specified in line 5) pro-
        blem sizes, namely 3000, 6000 and 10000.
        Line 7: This line  specifies  the number of block sizes to be
        runned. This number  should  be  less  than  or equal to  20.
        The first integer is significant, the rest is ignored. If the
        line reads:
 # of NBs
        this means that the user is willing to use 5 block sizes that
        will be specified in the next line.
        Line 8: This line specifies the block sizes one wants to run.
        Assuming  the line above started with 5, the 5 first positive
        integers are significant, the rest is ignored. For example:
 100 120 140 160 NBs
        means  that  one  wants  xhpl  to use 5 (specified in line 7)
        block sizes, namely 80, 100, 120, 140 and 160.
        Line 9 specifies how the  MPI processes should be mapped onto
        the nodes of your platform.  There are currently two possible
        mappings, namely row- and column-major. This feature is main-
        ly  useful  when these nodes  are  themselves multi-processor
        computers. A row-major mapping is recommended.
        Line 10: This line specifies  the  number  of process grid to
        be runned.  This  number  should be less than or equal to 20.
        The first integer is significant, the rest is ignored. If the
        line reads:
 # of process grids (P x Q)
        this  means  that you are willing to try 2 process grid sizes
        that will be specified in the next line.
        Line 11-12: These  two  lines specify  the  number of process
        rows  and  columns of each grid you want to run on.  Assuming
        the line above (10) started with 2,  the 2 first positive in-
        tegers of those two lines are significant,  the rest is igno-
        red. For example:
 2          Ps
 8          Qs
        means  that one wants to run  xhpl  on  2 process grids (line
 ), namely 1 by 6 and 2 by 8.  Note:  In this example, it is
        required then to start xhpl on at least 16 nodes  (max of P_i
        xQ_i). The runs on the two grids will be consecutive.  If one
        was starting xhpl on more than 16 nodes, say 52, only 6 would
        be used for the first grid  (1x6) and then 16  (2x8) would be
        used for the second grid.  The fact  that you started the MPI
        job on 52 nodes,  will not make HPL use all of them.  In this
        example, only 16 would be used. If one wants to run xhpl with
 processes one needs to specify a grid of 52 processes, for
        example the following lines would do the job:
 2         Ps
 8         Qs
        Line 13: This  line  specifies  the  threshold  the residuals
        should be compared to.  The  residuals  should be or order 1,
        but are in practice slightly less than this, typically 0.001.
        This  line  is  made of a real number, the rest is unsignifi-
        cant. For example:
 .0         threshold
        In practice,  a value of 16.0 will cover most cases.  For va-
        rious reasons,  it is possible that some of the residuals be-
        come slightly larger, say for example 35.6.  xhpl  will  flag
        those runs as failed,  however they can be considered as cor-
        rect.  A run can be considered as failed if the residual is a
        few order of magnitude  bigger than 1 for example 10^6 or mo-
        re. Note: if one was to specify a threshold of 0.0, all tests
        would be flagged  as failed, even though the answer is likely
        to be correct.  It is allowed to specify a negative value for
        this threshold,  in  which case the checks will be by-passed,
        no matter what the value is, as soon as it is negative.  This
        feature  allows to save time when performing a lot of experi-
        ments, say for instance during the tuning phase. Example:
        -16.0        threshold
        The remaning lines  allow  to specifies algorithmic features.
        xhpl  will  run  all  possible combinations of those for each
        problem  size,  block size, process grid combination. This is
        handy  when one looks for an "optimal" set of parameters.  To
        understand  a little bit better,  let  say  first a few words
        about  the algorithm implemented in HPL. Basically this is  a
        right-looking  version  with  row-partial pivoting. The panel
        factorization is matrix-matrix operation based and recursive,
        dividing the panel into  NDIV  subpanels  at each step.  This
        part  of  the  panel   factorization   is  denoted  below  by
        "recursive panel fact. (RFACT)". The recursion stops when the
        current panel is made of less than or equal to NBMIN columns.
        At  that  point,  xhpl  uses  a matrix-vector operation based
        factorization denoted below by  "PFACTs".  Classic  recursion
        would then use  NDIV=2,  NBMIN=1.  There  are  essentially  3
        numerically  equivalent  LU  factorization algorithm variants
        (left-looking, Crout  and  right-looking).  In  HPL,  one can
        choose  every one  of those  for the  RFACT,  as well as  the
        PFACT. The following lines of HPL.dat allows you to set those
        parameters.
        Lines 14-21: (Example 1)
 # of panel fact
 1 2   PFACTs (0=left, 1=Crout, 2=Right)
 # of recursive stopping criterium
 2 4 8 NBMINs (>= 1)
 # of panels in recursion
 3 4   NDIVs
 # of recursive panel fact.
 1 2   RFACTs (0=left, 1=Crout, 2=Right)
        This  example  would  try all variants of PFACT, 4 values for
        NBMIN,  namely 1, 2, 4 and 8,  3 values for NDIV namely 2,  3
        and 4, and all variants for RFACT.  Lines 14-21: (Example 1)
 # of panel fact
 0     PFACTs (0=left, 1=Crout, 2=Right)
 # of recursive stopping criterium
 8     NBMINs (>= 1)
 # of panels in recursion
 NDIVs
 # of recursive panel fact.
 RFACTs (0=left, 1=Crout, 2=Right)
        This example would try  2 variants of PFACT namely right loo-
        king and left looking, 2 values for NBMIN, namely 4 and 8,  1
        value for NDIV namely 2, and one variant for RFACT.
        In the  main loop of the algorithm,  the current panel of co-
        lumn is broadcast in process rows  using  a virtual  ring to-
        pology. HPL offers various choices, and one most  likely want
        to use the increasing ring modified encoded as 1.  4  is also
        a good choice. Lines 22-23: (Example 1):
 # of broadcast
 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
        This will cause HPL  to broadcast the current panel using the
        increasing ring modified topology. Lines 22-23: (Example 2):
 # of broadcast
 4     BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
        This will cause  HPL to broadcast the current panel using the
        increasing ring virtual topology and the long message algori-
        thm.
        Lines 24-25  allow  to  specify  the look-ahead depth used by
        HPL. A depth of 0 means that the next panel is factorized af-
        ter the update by the current panel is completely finished. A
        depth of 1 means that the next panel is factorized immediate-
        ly after being updated.  The  update by the current  panel is
        then finished.  A depth of k means that the k next panels are
        factorized immediately after being updated. The update by the
        current  panel is then finished. It turns out that a depth of
 seems  to give the best results, but may need a large pro-
        blem size  before one can see the performance gain. So use 1,
        if you do not know better,  otherwise  you may want to try 0.
        Look-ahead of depths 2  and larger will probably not give you
        better results.  Lines 24-25: (Example 1):
 # of lookahead depth
 DEPTHs (>=0)
        This will cause HPL to use a look-ahead of depth 1.
        Lines 24-25: (Example 2):
 # of lookahead depth
 1     DEPTHs (>=0)
        This will cause HPL to use a look-ahead of depths 0 and 1.
        Lines 26-27  allow to specify  the swapping algorithm used by
        HPL for all tests.  There  are  currently  two swapping algo-
        rithms  available,  one  based  on "binary exchange"  and the
        other one based on a  "spread-roll"  procedure  (also  called
        "long" below. For large problem sizes, this last one is like-
        ly to be more efficient. The user can also choose to mix both
        variants, that is "binary-exchange"  for  a number of columns
        less  than a threshold value, and then the  "spread-roll" al-
        gorithm.  This threshold  value is then specified on Line 27.
        Lines 26-27: (Example 1):
 SWAP (0=bin-exch,1=long,2=mix)
 swapping threshold
        This will cause HPL to use the "long" or  "spread-roll" swap-
        ping algorithm.  Note  that a threshold  is specified in that
        example but not used by HPL. Lines 26-27: (Example 2):
 SWAP (0=bin-exch,1=long,2=mix)
 swapping threshold
        This will cause HPL to use the "long" or  "spread-roll" swap-
        ping  algorithm  as  soon as there is more than 60 columns in
        the row panel.  Otherwise,  the  "binary-exchange"  algorithm
        will be used instead.
        Line 28  allows  to specify whether the upper triangle of the
        panel  of  columns  should  be  stored  in  no-transposed  or
        transposed form. Example:
 L1 in (0=transposed,1=no-transposed) form
        Line 29 allows to specify whether the panel of rows  U should
        be stored in no-transposed or transposed form. Example:
 U  in (0=transposed,1=no-transposed) form
        Line 30 enables/disables the equilibration phase. This option
        will not be used unless you selected 1 or 2 in Line 26. Ex:
 Equilibration (0=no,1=yes)
        Line 31  allows  to  specify  the alignment in memory for the
        memory space allocated by HPL. On modern machines, one proba-
        bly wants to use 4, 8 or 16. This may result in a tiny amount
        of memory wasted. Example:
 memory alignment in double (> 0)
       ==============================================================
        Guide lines:
 ) Figure  out  a  good  block  size  for  the  matrix-matrix
        multiply routine. The best method is to try a few out. If you
        happen  to know  the block size  used  by  the  matrix-matrix
        multiply routine, a small multiple of that block size will do
        fine.
        HPL  uses the block size NB for the data distribution as well
        as  for   the  computational   granularity.   From   a   data
        distribution point of view,  the smallest  NB, the better the
        load balance.  You  definitely  want  to stay away  from very
        large values of NB.  From a computation point of view,  a too
        small value of  NB may limit the computational performance by
        a large factor because almost no data reuse will occur in the
        highest level of the memory hierarchy. The number of messages
        will also increase.  Efficient  matrix-multiply  routines are
        often internally blocked.  Small multiples  of  this blocking
        factor are likely to be good block sizes for HPL.  The bottom
        line  is  that  "good"  block sizes  are almost always in the
        [32..256] interval. The best values depend on the computation
        / communication performance ratio of your system.  To  a much
        less  extent,  the problem size  matters  as  well.  Say  for
        example,  you emperically found that 44 was a good block size
        with respect to performance.  88 or 132  are likely  to  give
        slightly better  results for large problem sizes because of a
        slighlty higher flop rate.
 )  The process mapping  should  not matter  if  the nodes of
        your platform are single processor computers.  If these nodes
        are multi-processors, a row-major mapping is recommended.
 ) HPL likes "square" or slightly flat process grids.  Unless
        you  are using  a very small process grid, stay away from the
 -by-Q and P-by-1 process grids.
 ) Panel factorization parameters:  a good start are the fol-
        lowing for the lines 14-21:
 # of panel fact
 PFACTs (0=left, 1=Crout, 2=Right)
 # of recursive stopping criterium
 8     NBMINs (>= 1)
 # of panels in recursion
 NDIVs
 # of recursive panel fact.
 RFACTs (0=left, 1=Crout, 2=Right)
 ) Broadcast parameters: at this time, it is far from obvious
        to me what the best setting is,  so i would probably try them
        all. If I had to guess I would probably start with the follo-
        wing for the lines 22-23:
 # of broadcast
 3     BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
        The best broadcast  depends  on your problem size and harware
        performance. My take is that 4 or 5  may be  competitive  for
        machines  featuring  very  fast nodes  comparatively  to  the
        network.
 ) Look-ahead depth: as mentioned above  0 or 1 are likely to
        be the best choices.  This also  depends  on the problem size
        and machine configuration, so I would try "no look-ahead (0)"
        and "look-ahead of depth 1 (1)". That is for lines 24-25:
 # of lookahead depth
 1     DEPTHs (>=0)
 ) Swapping:  one  can select only one of the three algorithm
        in the input file. Theoretically, mix (2) should win, however
        long (1) might just be good enough. The  difference should be
        small between those two assuming  a swapping threshold of the
        order of the block size (NB) selected. If  this  threshold is
        very large, HPL will use bin_exch (0) most of the time and if
        it  is  very  small  (< NB) long (1)  will always be used. In
        short  and  assuming  the  block size (NB)  used is say 60, I
        would choose for the lines 26-27:
 SWAP (0=bin-exch,1=long,2=mix)
 swapping threshold
        I would also try the long variant.  For  a very  small number
        of processes  in every column of the process grid  (say < 4),
        very little performance difference should be observable.
 ) Local storage:  I do not think Line 28 matters.  Pick 0 in
        doubt.  Line 29 is more important.  It controls how the panel
        of rows should be stored. No doubt 0 is better. The caveat is
        that in that case the matrix-multiply function is called with
        ( Notrans, Trans, ... ), that is C := C - A B^T.  Unless  the
        computational  kernel  you  are  using  has a very poor (with
        respect to performance)  implementation  of that case, and is
        much more efficient with  ( Notrans, Notrans, ... ) just pick
 as well. So, my choice:
 L1 in (0=transposed,1=no-transposed) form
 U  in (0=transposed,1=no-transposed) form
 ) Equilibration:  It  is hard to tell  whether equilibration
        should always be performed or not. Not knowing much about the
        random matrix generated and because the overhead is so small
        compared to the possible gain, I turn it on all the time.
 Equilibration (0=no,1=yes)
 ) For  alignment, 4 should be plenty,  but just to be safe,
        one may want to pick 8 instead.
 memory alignment in double (> 0)
       ==============================================================

Centre Blaise Pascal » HPL sur GPU

root / TUNING @ 9