Révision c3cc6a4e papers/2014/reservation/paper.tex

b/papers/2014/reservation/paper.tex
1

  
2
\documentclass[conference]{IEEEtran}
3
% Add the compsoc option for Computer Society conferences.
4

  
1
\documentclass[times]{speauth}
2
\usepackage{relsize}

3
% \usepackage{moreverb}

4
% \usepackage[dvips,colorlinks,bookmarksopen,bookmarksnumbered,citecolor=red,urlcolor=red]{hyperref}
5 5
\usepackage{ctable}
6 6
\usepackage{cite}
7
\usepackage[cmex10]{amsmath}
7
% \usepackage[cmex10]{amsmath}
8 8
\usepackage{acronym}
9 9
\usepackage{graphicx}
10 10
\usepackage{multirow}
11
\usepackage{listings}
12
\usepackage{color}
13
\usepackage{xcolor}
14
\usepackage{balance}
11
% \usepackage{balance}
15 12
\usepackage{algorithm2e}
16 13

  
14
\def\volumeyear{2014}
15

  
17 16
\acrodef{KWAPI}{KiloWatt API}
18 17
\acrodef{KWRanking}{KiloWatt Ranking}
19 18
\acrodef{VPC}{Virtual Private Cloud}
......
22 21
\acrodef{AWS}{Amazon Web Services}
23 22
\acrodef{VM}{Virtual Machine}
24 23
\acrodef{REST}{REpresentational State Transfer}
24
\acrodef{RPC}{Remote Procedure Call}
25
\acrodef{PUE}{Power Usage Effectiveness}
26
\acrodef{IPMI}{Intelligent Platform Management Interface}
27
\acrodef{PDU}{Power Distribution Unit}
28
\acrodef{ePDU}{enclosure PDU}
29
\acrodef{JSON}{JavaScript Object Notation}
30
\acrodef{RRD}{Round-Robin Database}
25 31

  
26
\colorlet{@punct}{red!60!black}
27
\definecolor{@delim}{RGB}{20,105,176}
28

  
29
\lstdefinelanguage{json}{
30
    basicstyle=\footnotesize\ttfamily,
31
    literate=
32
     *{\ }{{{\ }}}{1}
33
      {:}{{{\color{@punct}{:}}}}{1}
34
      {,}{{{\color{@punct}{,}}}}{1}
35
      {\{}{{{\color{@delim}{\{}}}}{1}
36
      {\}}{{{\color{@delim}{\}}}}}{1}
37
      {[}{{{\color{@delim}{[}}}}{1}
38
      {]}{{{\color{@delim}{]}}}}{1},
39
}
32
\begin{document}
40 33

  
41
\newcommand{\includeJSON}[1]{\lstinputlisting[language=json,firstnumber=1]{#1}}
34
\runningheads{F. Rossigneux \textit{et al.}}{A Resource Reservation System for OpenStack}
42 35

  
43
% correct bad hyphenation here
44
\hyphenation{op-tical net-works semi-conduc-tor}
36
\title{A Resource Reservation System to Improve the\\Support for HPC Applications in OpenStack}
45 37

  
46
\begin{document}
38
\author{Fran\c{c}ois Rossigneux, Laurent Lef\`{e}vre and Marcos Dias de Assun\c{c}\~ao}
47 39

  
48
\title{A Resource Reservation System to Improve the Support for HPC Applications in OpenStack}
40
\address{LIP, ENS de Lyon, University of Lyon, France}
49 41

  
50
% \author{\IEEEauthorblockN{Fran\c{c}ois Rossigneux, Marcos Dias de Assun\c{c}\~ao, Laurent Lef\`{e}vre}
51
% \IEEEauthorblockA{INRIA Avalon, LIP Laboratory\\
52
% Ecole Normale Sup\'{e}rieure de Lyon\\
53
% University of Lyon, France}
54
% }
42
\corraddr{46 all\'{e} d'Italie, 69364, Lyon, FRANCE}
55 43

  
56
\maketitle
57 44

  
58 45
\begin{abstract}
59
A reservation system is important to enable users to plan the execution of their applications and for providers to deliver performance guarantees; key to many \ac{HPC} applications. We have developed a reservation framework called Climate that interfaces with Nova, OpenStack's compute scheduler. Climate manages reservations and their placement on physical hosts taking into account several factors, such as energy efficiency. 
46
A reservation system is important to enable users to plan the execution of their applications and for providers to deliver better performance guarantees. We present a reservation framework called Climate that interfaces with Nova, OpenStack's compute scheduler, and enables the provisioning of bare-metal resources. Climate manages reservations and their placement on physical hosts taking into account several factors, including resource constraints and energy efficiency. For selecting the most energy-efficient resources, Climate uses a software framework termed as \ac{KWAPI}. This work describes the overall software system that can be used for managing advance reservations and minimising energy consumption in OpenStack clouds. 
60 47
\end{abstract}
61 48

  
62
\IEEEpeerreviewmaketitle
49
\keywords{resource reservation; high-performance computing; OpenStack}
50

  
51
\maketitle
63 52

  
64 53
\section{Introduction}
65 54
\acresetall
66
Cloud computing \cite{ArmbrustCloud:2009} has become an important building block to providing IT resources and services to organisations of all sizes. The workload consolidation that clouds provide by virtualising resources and enabling customers to share the same physical infrastructure brings several advantages such as energy efficiency and higher utilisation. The on-demand business model practiced by most cloud providers permits customers to request resources when needed and pay only for what they use.
67 55

  
68
Though this consolidated model suits most of today's use cases, certain applications that demand \ac{HPC} are not fully portable to this scenario as they are generally resource intensive and sensitive to performance variations. The press occasionally reports on the use of public clouds to perform \ac{HPC}, but reported cases are often large bag-of-task applications whose execution requires almost Herculean effort \cite{ec2supercomputer:2013}. A large part of \ac{HPC} applications, however, still demand homogeneity among computing nodes and predictable network performance. The means employed by cloud providers to offer customers with high and predictable performance mostly consist in deploying bare-metal resources or using specialised virtual machines placed in groups where high network throughput and low latency can be guaranteed. This model may seem in contrast with traditional cloud use cases as it is expensive and provides little flexibility in terms of workload consolidation and resource elasticity. Attempts to use public clouds or co-locate \ac{HPC} applications on the same physical hardware, however, have proven difficult \cite{OstermannHPCEC2:2010,AniruddhaHPCCloud:2013,YoungeVirtHPC:2011,WangNetPerf:2010}.
56
Cloud computing \cite{ArmbrustCloud:2009} has become an important model for delivering the IT resources and services that organisationg require to run their businesses. Amongst the claimed benefits of the cloud computing model, the most appealing derive from economies of scale and often include resource consolidation, elasticity, good availability, and wide geographical coverage. The on-demand resource provisioning scheme explored since the early days of cloud computing enables customers to request a number of resources --- often
57
virtual machines, or storage and network capacity --- and pay by the hour of use. As technology and resource management techniques matured, more elaborated economic models have become available, where customers can reserve a number of machines for a long period at a discount price, or bid for resources whose price varies dynamically according to current demand (\textit{e.g.} AWS EC2 spot
58
instances \cite{AmazonEC2Spot}).
59

  
60
Current cloud models pose challenges to providers who need to offer the resource elasticity that customers expect. Techniques such as advance reservation, which have been widely explored in other systems such as clusters and grids, are not commonplace in current cloud computing offers. Resource reservation may be advantageous to certain customers as it enables them to specify a start and finish time dictating when they need resources, and can help providers by offering means through which they can have better estimates on future resource usage. 
69 61

  
70
Over the past \ac{HPC} users have been tolerant to resource availability as they generally share large clusters, to which exclusive access is made via submitting a job that may wait in queue for a time often longer than the job execution itself. We do not consider that clouds should adopt a similar queuing model, but we believe that a compromise between wait time and on-demand access could be explored for \ac{HPC} in the cloud via resource reservations. Reservations provide means for reliable allocation and allow customers to plan the execution of their applications. Reservation models currently in use in public clouds rely on requesting resources in advance for a long time period (\textit{i.e.} from one to three years) or bidding for virtual machine instances in a spot market.
62
In addition, although a large number of current applications and services can deal with the workload consolidation explored by providers via resource virtualisation, certain applications that demand \ac{HPC} are not fully portable to this scenario. They are generally resource intensive and sensitive to performance variations. The means employed by cloud providers to offer customers with high and predictable performance mostly consist in deploying bare-metal resources or using specialised virtual machines placed in groups where high network throughput and low latency can be guaranteed. This model may seem in contrast with traditional cloud use cases as it results in large operational cost and provides little flexibility in terms of workload consolidation and resource elasticity. Reservations provide means for reliable allocation and allow customers to plan the execution of their applications. Current reservation models employed by public clouds, however, rely on reserving resources in advance for a long time period (\textit{i.e.} from one to three years).
71 63

  
72
In this paper, we describe a reservation framework for reserving and provisioning specialised \ac{HPC} resources using a popular open source cloud platform. The proposed solution, implemented as a configurable component of OpenStack\footnote{https://wiki.openstack.org}, provides reservation models that are more sophisticated and flexible than those currently offered by public cloud providers. By using discrete event simulations, we demonstrate that the use of reservations of bare-metal resources can bring benefits for both cloud providers and customers.
64
In this paper, we describe a reservation framework for reserving and provisioning of bare-metal resources using a popular open source cloud platform. The proposed solution, implemented as a configurable component of OpenStack\footnote{https://wiki.openstack.org}, provides reservation models that are more sophisticated and flexible than those currently offered by public cloud providers. The framework leverages \ac{KWAPI}, a software framework that monitors the energy consumption of data centre resources and interfaces with OpenStack's telemetry infrastructure.  
73 65

  
74 66
% ----------------------------------------------------------------------------------------
75 67

  
76 68
\section{Background and Related Work}
77 69
\label{sec:related_work}
78 70

  
79
Grid'5000 \cite{Grid5000} --- an experimental platform comprising several sites in France and Luxembourg --- provides its users with single tenant environments by using bare-metal provisioning and allowing resources to be reserved in advance. In order to learn more about the impact of reservations on the use of a data centre, we considered a simple experiment. We used the log of advance reservations generated by the scheduler \cite{CapitOAR:2005} one of Grid'5000 sites (\textit{i.e.} the data centre located in Lyon) and simulated a scheduling under two cases, namely the current scenario where users make reservations to use resources at a time in future; and an extrapolation where all requests from the log were treated as immediate reservations, thus emulating a more elastic cloud-like scenario. Figure~\ref{fig:lyon_case} shows the maximum number of cores required to handle requests over time under both cases. Although this is an extreme scenario and certain users would probably not anticipate their reservations if using a cloud, had they known resources were available during work hours, we believe some users would change their behaviour; particularly those who currently use the infrastructure over weekends or at night.
80

  
81
\begin{figure}[ht]
82
\centering 
83
\includegraphics[width=1.\columnwidth]{figs/lyon_use_may_2013.pdf} 
84
\caption{Maximum number of CPU cores required at the g5k's Lyon site.}
85
\label{fig:lyon_case}
86
\end{figure}
87

  
88
Grid'5000 is an experimental platform, but similar bursts of requests during a few hours of the day have also been noticed in other systems whose logs have been extensively studied \cite{FeitelsonPWA:2014}. Although bare-metal or specialised \acp{VM} minimise performance penalties for running \ac{HPC} applications on a cloud, providing the elasticity with which other cloud users are familiar with can be prohibitive, and advance reservations may be explored to help minimising these costs. The next section describes some of the current reservation models available for clouds and discusses their limitations.
71
Bursts of requests during a few hours of the day have been noticed in several systems whose logs have been extensively studied \cite{FeitelsonPWA:2014}. Although bare-metal or specialised \acp{VM} minimise performance penalties for running \ac{HPC} applications on a cloud, providing the elasticity with which other cloud users are familiar with can be prohibitive, and advance reservations may be explored to help minimising these costs. This section describes some of the current reservation models available for clouds and discusses their limitations. It also provides background information on OpenStack and the main components leveraged by the reservation framework. 
89 72

  
90 73
\subsection{Reservation in Clouds}
91 74

  
92
The benefits and drawbacks of resource reservations have been extensively studied for various systems, such as clusters of computers \cite{SmithSchedARs:2000, LawsonMultiQueue:2002, MargoReservations:2000, RoblitzARPlacement:2006}, meta-schedulers \cite{SnellARMetaScheduling:2000}, computational grids \cite{ElmrothBrokerCCPE:2009,FarooqlaxityARs:2005,FosterGara:1999}, virtual clusters and virtual infrastructure \cite{ChaseCOD:2003}; and have been applied under multiple scenarios including co-allocation of resources \cite{NettoFlexibleARs:2007}, and improving performance predictability of certain applications \cite{WieczorekARWorkflow:2006}. 
93

  
94
As of writing, \ac{AWS}\footnote{http://aws.amazon.com/ec2/} offers cloud services that suit several of today's use cases and providers the richest set of reservation options for virtual machine instances. \ac{AWS} offers four models of allocating \ac{VM} instances, namely on-demand, reserved instances, spot instances and dedicated --- the latter are allocated within a \ac{VPC}. Under on-demand use, customers are charged on an hourly basis for the instances they allocate. The performance is not guaranteed as resources are shared among customers and their performance depends on the cloud workload. Reserved instances can be requested at a discount price under the establishment of long term contracts, and provide more guarantees in terms of performance than their on-demand counterparts. To request spot instances, a customer must specify a limit price --- often referred to as bid price --- she is willing to pay for a given instance type. If the bid price surpasses the spot market price, the user receives the requested instances. Existing spot instances can be destroyed if the spot price exceeds a user's bid. Instances created within a \ac{VPC} provide some degree of isolation at the network level, but their physical hosts may contain instances from multiple customers. To improve fault tolerance, it is possible to request dedicated instances at a premium, so that instances will not share the host. Under all the models described here, \ac{AWS} allows users to request \ac{HPC} instances, optimised for processing, memory use, I/O, and instances with \acp{GPU}.
75
The benefits and drawbacks of resource reservations have been extensively studied for various systems, such as clusters of computers \cite{SmithSchedARs:2000, LawsonMultiQueue:2002, MargoReservations:2000, RoblitzARPlacement:2006}, meta-schedulers \cite{SnellARMetaScheduling:2000}, computational grids \cite{ElmrothBrokerCCPE:2009,FarooqlaxityARs:2005,FosterGara:1999}, virtual clusters and virtual infrastructure \cite{ChaseCOD:2003}; and have been applied under multiple scenarios including co-allocation of resources \cite{NettoFlexibleARs:2007}, and improving performance predictability of certain applications \cite{WieczorekARWorkflow:2006}. As of writing, \ac{AWS}\footnote{http://aws.amazon.com/ec2/} offers cloud services that suit several of today's use cases and provides the richest set of reservation options for virtual machine instances. \ac{AWS} offers four models of allocating \ac{VM} instances, namely on-demand, reserved instances, spot instances and dedicated --- the latter are allocated within a \ac{VPC}. Under on-demand use, customers are charged on an hourly basis for the instances they allocate. The performance is not guaranteed as resources are shared among customers and their performance depends on the cloud workload. Reserved instances can be requested at a discount price under the establishment of long term contracts, and provide more guarantees in terms of performance than their on-demand counterparts. To request spot instances, a customer must specify a limit price --- often referred to as bid price --- she is willing to pay for a given instance type. If the bid price surpasses the spot market price, the user receives the requested instances. Existing spot instances can be destroyed if the spot price exceeds a user's bid. Instances created within a \ac{VPC} provide some degree of isolation at the network level, but their physical hosts may contain instances from multiple customers. To improve fault tolerance, it is possible to request dedicated instances at a premium, so that instances will not share the host. Under all the models described here, \ac{AWS} allows users to request \ac{HPC} instances, optimised for processing, memory use, I/O, and instances with \acp{GPU}.
95 76

  
96 77
OpenNebula cloud management tool \cite{OpenNebula:2011} provides reservation and scheduling services by using Haizea \cite{SotomayorLeases:2008}. Haizea supports multiple types of reservations (\textit{e.g.} immediate, advanced and best-effort) and takes into account the time required to prepare and configure the resources occupied by the virtual machines (\textit{e.g.} time to transfer virtual machine image files) so that a reservation can start at the exact time a user required. When used in simulation mode, Haizea offers means to evaluate the scheduling impact of accepting a set of reservations. We found that the  Haizea model --- where a scheduler able to handle reservations was incorporated as a module of a cloud resource manager --- as good starting point to provide reservation and \ac{HPC} support for OpenStack. In the following sections we provide background information on OpenStack's compute management architecture, and how we extend it by providing loosely coupled components to handle resource reservations.
97 78

  
98 79
\subsection{OpenStack}
99 80

  
100
OpenStack is an open source cloud computing platform suitable to both public and private settings. It manages and automates deployment of virtual machines on pools of compute resources, can work with a range of virtualisation technologies and can handle a variety of tenants. OpenStack has gained traction and received support from a wide development community who incorporated several features, including block storage, image library, network provisioning framework, authentication and authorisation, among other services. In this work we focused mainly on the compute management service, called Nova \footnote{https://wiki.openstack.org/wiki/Nova}. 
81
OpenStack is an open source cloud computing platform suitable to both public and private settings. It manages and automates deployment of virtual machines on pools of compute resources, can work with a range of virtualisation technologies and can handle a variety of tenants. OpenStack has gained traction and received support from a wide development community who incorporated several features, including block storage, image library, network provisioning framework, authentication and authorisation, among other services. In this work we focus mainly on the compute management service, called Nova \footnote{https://wiki.openstack.org/wiki/Nova}, and the telemetry infrastructure termed as Ceilometer\footnote{http://docs.openstack.org/developer/ceilometer/architecture.html}. 
101 82

  
102 83
Nova manages a cloud computing fabric and is responsible for instantiating virtual machines on a pool of available hosts running \textit{nova-compute}. Nova uses its \textit{nova-scheduler} service to determine how to place virtual machine requests. Its API (\textit{i.e. nova-api}) is used by clients to request virtual machine instances. In order to make a request for a \ac{VM}, a client needs to specify a number of requirements (\textit{e.g.} the flavour), which determine the instance to be created. By default, Nova uses a filter scheduler that initially obtains a list of physical hosts, applies a set of filters on the list to select servers that match the client's criteria, and then ranks the remaining servers according to a pre-configured weighing scheme. Nova comes with a set of pre-configured filters, but the choice of filters and weighing is configurable. Nova can also be configured to accept hints from a client to influence how hosts are filtered and ranked upon a new request.
103 84

  
85
Ceilometer is OpenStack's framework for collecting performance metrics and information on resource consumption. It allows for data collection under three methods:
86

  
87
\begin{itemize}
88
\item \textbf{Bus listener agent}, which picks events on OpenStack's notification bus and turns them into Ceilometer samples (\textit{e.g.} cumulative type, gauge or delta) that can then be stored into the database or provided to an external system via publishing pipeline.
89

  
90
\item \textbf{Push agents}, more intrusive, consist in deploying agents on the monitored nodes to push data remotely to be taken by the collector.
91

  
92
\item \textbf{Polling agents} that poll APIs or other tools to collect information about monitored resources.
93
\end{itemize} 
94

  
95
The last two methods depend on a combination of central agent, computer agents and collector. The compute agents run on nodes and retrieve information about resource usage related to a given virtual machine instance and a resource owner. The central agent, on the other hand, executes \textit{pollsters} on the management server to retrieve data that is not linked to a particular instance. Pollsters are software components executed, for example, to poll resources by using an API or other methods. The Ceilometer database, which can be queried via Ceilometer API, allows an external system to view the history of a resource's metrics, and enables the system to set and receive alarms.
96

  
104 97
% ----------------------------------------------------------------------------------------
105 98

  
106
\begin{figure*}[htb]
99
\section{Reservation Framework for OpenStack}
100
\label{sec:reservation_system}
101

  
102
The reservation system --- termed as Climate --- provides means for reserving and deploying bare-metal resources using OpenStack. The framework, whose architecture is depicted in Figure~\ref{fig:reservation_architecture}, has been used as basis for implementing other types of reservation systems for OpenStack. Climate aims to provide support for scheduling advance reservations, taking into account the energy efficiency of underlying resources and without being intrusive to Nova. In order to do so, Climate comprises the following components:
103

  
104
\begin{itemize}
105
\item \textbf{Reservation API}: used by client applications and users to reserve resources and query the status of reservations.
106
\item \textbf{Climate Inventory}: a service that stores information about the physical nodes that can be used for reservations.
107
\item \textbf{Climate Scheduler}: responsible for scheduling reservation requests on the available nodes.
108
\item \textbf{Energy-Consumption Monitoring Framework}: component responsible for monitoring the energy consumption of physical resources and interfacing with OpenStack telemetry infrastructure. 
109
\end{itemize} 
110

  
111
The next sections provide more details about each Climate component and how they interact with one another.
112

  
113
\begin{figure}[htb]
107 114
\centering 
108
\includegraphics[width=0.65\linewidth]{figs/architecture.pdf} 
115
\includegraphics[width=0.85\linewidth]{figs/architecture.pdf} 
109 116
\caption{Architecture of the proposed reservation framework.}
110 117
\label{fig:reservation_architecture}
111
\end{figure*}
112

  
113
\section{Proposed Reservation System}
114
\label{sec:reservation_system}
115

  
116
The architecture of the reservation framework, termed as Climate, is depicted in Figure~\ref{fig:reservation_architecture}. The architecture aims to provide scheduling support for advance reservations without being intrusive to Nova. It consists of the following main components.
118
\end{figure}
117 119

  
118 120
\subsection{Reservation API} 
119 121

  
120
Climate API enables users and client applications to manage reservations. The following parameters should be specified in order to create a reservation:
122
Climate provides a \ac{REST} API that enables users and client applications to manage reservation requests. When requesting a reservation --- Step 1 in \ref{fig:reservation_architecture} --- a client should supply the following parameters:
121 123

  
122 124
\begin{itemize}
123 125
\item \textbf{host\_properties}: characteristics of required servers;
124
\item \textbf{start\_time}: earliest time for starting the reservation;
126
\item \textbf{start\_time}: earliest time at which the reservation can start;
125 127
\item \textbf{end\_time}: latest time for completing the reservation;
126 128
\item \textbf{duration}: time duration of the reservation; and
127 129
\item \textbf{quantity}: number of servers required;
128 130
\end{itemize}
129 131

  
130
If \textbf{start\_time} and \textbf{end\_time} are not specified, the request is treated as an immediate reservation, thus starting at the current time. The API is provided as a \ac{REST} service that implements the calls described in Table~\ref{tab:climate_api}. The API relies on two backend services, namely \textbf{Climate Inventory} and \textbf{Climate Scheduler}, for storying information about hosts and managing reservations respectively. Handling a reservation request is performed in two phases. First, the API queries Climate Inventory to discover the available hosts that match the criteria specified by \textbf{host\_properties}. Second, a filtered list of hosts, along with the other request parameters, are given to Climate Scheduler, which then finds a time period over which the reservation can be granted.
132
If \textbf{start\_time} and \textbf{end\_time} are not specified, the request is treated as an immediate reservation, thus starting at the current time. The \ac{REST} API implements the calls described in Table~\ref{tab:climate_api} and relies on two backend services, namely \textbf{Climate Inventory} and \textbf{Climate Scheduler}, for storying information about hosts and managing reservations respectively. Handling a reservation request is performed in two phases. First, the API queries Climate Inventory to discover the available hosts that match the criteria specified by \textbf{host\_properties}. The Inventory interfaces with OpenStack Nova to keep the list of available hosts up-to-date --- as demonstrated by Step 2 in Figure \ref{fig:reservation_architecture}. Second, a filtered list of hosts, along with the other request parameters, are given to Climate Scheduler, which then finds a time period over which the reservation can be granted.   
131 133

  
132
The API is not only an interface for tenants. Nova uses it to find available hosts and to determine a set of resources associated with a reservation, and modules that need to query the reservation schedule do so via the API. Moreover, the API uses the same security infrastructure provided by OpenStack, including messages carrying Keystone tokens\footnote{http://docs.openstack.org/developer/keystone/}, which are used to allow a client application to discover the hosts associated with a reservation.
133

  
134
\begin{table}
134
\begin{table}[hbt]
135 135
\caption{Climate REST API calls}
136 136
\label{tab:climate_api}
137 137
\centering
......
152 152
\end{tabular}
153 153
\end{table}
154 154

  
155
 \subsection{Climate Inventory} 
155
The API is not only an interface for tenants. Nova uses it to find available hosts and to determine a set of resources associated with a reservation when a client claims the reserved resources --- Step 3 in Figure \ref{fig:reservation_architecture} --- and modules that need to query the reservation schedule do so via the API. Moreover, the API uses the same security infrastructure provided by OpenStack, including messages carrying Keystone tokens\footnote{http://docs.openstack.org/developer/keystone/}, which are used to allow a client application to discover the hosts associated with a reservation.
156

  
157

  
158
\subsection{Climate Inventory} 
156 159
 
157
Climate Inventory is an RPC service used by the reservation API to discover the hosts that are possible candidates to serve a reservation request. The candidates are servers that both match the host properties specified in the request and are available during the requested time (\textit{i.e.} their \textbf{running\_vm} field in Nova's database is set to 0). NovaClient queries Nova's API and the list of potential candidates is filtered using the \textbf{json\_filter} syntax specified by Nova.
160
Climate Inventory is a \ac{RPC} service used by the reservation API to discover the hosts that are possible candidates to serve a reservation request. The candidates are servers that both match the host properties specified in the request and are available during the requested time (\textit{i.e.} their \textbf{running\_vm} field in Nova's database is set to 0). To do so, the Inventory used the NovaClient, which queries Nova's API and filters the list of potential candidates using the \textbf{json\_filter} syntax specified by Nova.
158 161

  
159
As mentioner beforehand, Climate Inventory uses \textbf{host\_properties} as filtering criteria. In order to specify the required hosts a user needs to create such a filter. For this end, the \textbf{/properties/} call of the reservation API provides a catalogue of properties used for filtering. By default, the call shows the properties used for filtering based on the list of hosts registered with Nova, but an admin can choose to disable or enable certain a property.
162
As mentioner beforehand, Climate Inventory uses \textbf{host\_properties} as filtering criteria. In order to specify the required hosts, a user needs to create such a filter. To ease this task, the \textbf{/properties/} call of the reservation API provides a catalogue of properties used for filtering. By default, the call shows the properties used for filtering based on the list of hosts registered with Nova, but an admin can choose to disable or enable certain properties.
160 163

  
161 164
\subsection{Climate Scheduler}
162 165

  
163 166
This component manages the reservation schedule and extends Nova's filtering scheduler by providing a set of resource filters and ranking (or weighing) criteria for handling reservation requests, as described as follows.
164 167

  
165 168
\subsubsection{Filtering}
166
Nova filter accepts a scheduling hint --- in our case used to provide a reservation ID created by Climate. When an ID is provided, the filter uses the reservation API, providing an admin Keystone token, to retrieve the list of hosts associated with the reservation. If no reservation ID is given, Nova still uses the reservation API to establish the list of hosts that have been reserved. Only the hosts that are not in this list can be used to serve the request.
169
Nova filter accepts a scheduling hint --- in our case used to provide a reservation ID created by Climate. When an ID is provided, the filter uses the reservation API --- also providing an admin Keystone token --- to retrieve the list of hosts associated with the reservation, Step 4 in Figure \ref{fig:reservation_architecture}. If no reservation ID is given, Nova still uses the reservation API to establish the list of hosts that have been reserved. Only the hosts that are not in this list can be used to serve the request.
167 170

  
168 171
\subsubsection{Ranking}
169
We created two Nova weighers. The first weigher is ignored by reservation requests and ranks machines according to their free time until the next reservation. If handling a request for a non-reserved instance, the weigher tries to place the instance on a host that is available for the longest period. This helps minimise the chance of having to migrate the instance at a later time to vacate its host for a reservation.
170

  
171
The second weigher, termed as \ac{KWRanking} ranks machines by their power efficiency (\textit{i.e.} FLOPS/Watt). This weigher relies on:
172
We created two Nova weighers. The first weigher is ignored by reservation requests and ranks machines according to their free time until the next reservation. If handling a request for a non-reserved instance, the weigher tries to place the instance on a host that is available for the longest period. This helps minimise the chance of having to migrate the instance at a later time to vacate its host for a reservation. The second weigher, termed as \ac{KWRanking} ranks machines by their power efficiency (\textit{i.e.} FLOPS/Watt) and relies on:
172 173

  
173 174
\begin{itemize}
174
\item A software infrastructure called \ac{KWAPI}, described in detail in our previous work \cite{Rossigneux:2014}, built for monitoring the power consumed by resources of a data centre and for interfacing with Ceilometer to provide power consumption data. Ceilometer is OpenStack's telemetry infrastructure used to monitor performance metrics\footnote{https://wiki.openstack.org/wiki/Ceilometer}.
175
\item A software infrastructure called \ac{KWAPI} built for monitoring the power consumed by resources of a data centre and for interfacing with Ceilometer to provide power consumption data. Ceilometer is OpenStack's telemetry infrastructure used to monitor performance metrics\footnote{https://wiki.openstack.org/wiki/Ceilometer}.  
175 176

  
176 177
\item A benchmark executed on the machines to determine their delivered performance by watt.
177 178
\end{itemize}
178 179

  
179 180
The goal of this weigher is to prioritise the use of the most power efficient machines, and create windows during which the least efficient resources could be powered off or placed in low power consumption modes. Climate provides an API that enables switching resources on/off, or putting them into standby mode. Choosing between placing a resource in standby mode or switching it off depends on the length of time during which it is expected to remain idle. As switching a resource back on to serve an impending request often takes time, means to estimate future workload are generally important.
180
 
181
% ----------------------------------------------------------------------------------------
182
\section{Reservation Strategies}
183
\label{sec:strategies}
184

  
185
To benefit from adding support for reservation of bare-metal resources, we discuss strategies that a private cloud could implement towards reducing the amount of energy consumed by compute resources. These strategies could be used to complement the ranking system described above by, for instance, selecting the least power efficient resources to be switched off.
186

  
187
\subsection{Power-Off Idle Resources}
188

  
189
As illustrated by Algorithm~\ref{algo:alloc_policy}, the strategy considered here checks periodically what resources are idle (Lines \ref{algo:check_idle_start} to \ref{algo:check_idle_end}). Also periodically, the strategy determines that resources have remained idle over a given time period and whether they will likely remain idle over a time horizon because they are not assigned to any request (Lines \ref{algo:switch_start} to \ref{algo:switch_start}). Once determined that a resource has remained idle for a number of consecutive intervals and it is not committed to serve a reservation over a give time horizon --- \textit{i.e.} when reservation is enabled --- the resource is powered off. Previous work \cite{OrgerieSaveWatts:2008} has evaluated the impact of decisions on appropriate intervals for measuring idleness, for deciding on the horizon for switching off resources committed to reservations. Here we consider that the measurement interval, idleness time, and reservation horizon are respectively 1, 5 and 50 minutes.
190
          
191
\IncMargin{-0.6em}
192
\RestyleAlgo{ruled}\LinesNumbered
193
\begin{algorithm}[ht]
194
\caption{Na\"ive resource allocation strategy.}
195
\label{algo:alloc_policy} 
196
\DontPrintSemicolon
197
\SetAlgoLined
198
\SetAlgoVlined
199
\footnotesize{
200

  
201
\label{algo:check_idle_start}\textbf{procedure} checkIdleNodes()\;
202
\Begin{ 
203
	$Res\_idle \leftarrow $ list of idle resources during past $x$ intervals\;
204
	$Res\_idle_{t} \leftarrow $ list of idle resources at interval $t$\;
205
	$Res\_idle \leftarrow Res\_idle \cap Res\_idle_{t}$ \label{algo:check_idle_end}
206
}
207

  
208
\BlankLine
209
\label{algo:switch_start}\textbf{procedure} switchResourcesOnOff()\;
210
\Begin{ 
211
	$Res\_on_{t} \leftarrow $ list of resources switched on\;
212
	$Res\_off_{t} \leftarrow $ list of resources switched off\;
213
	$Res\_reserv_{t,h} \leftarrow $ resources reserved until horizon $h$\;
214
	$Res\_idle \leftarrow $ list of idle resources during past $x$ intervals\;
215
	
216
	// Obtain list of resources that will remain idle during $h$\;
217
	$Switch\_off = Res\_idle \cap Res\_off_{t} \cap Res\_reserv_{t,h}$\;
218
  $switchOff(Switch\_off)$\; \label{algo:switch_end}
219
}
220

  
221
\BlankLine
222
\label{algo:res_arrival_start}\textbf{procedure} requestSubmitted(Request $req$)\;
223
\Begin{ 
224
  $av\_boot\_time \leftarrow $ get average machine deployment time\;
225
  $now \leftarrow $ get current time\;
226
  // Find a place for $req$ in the scheduling queue\;
227
  $schedule(req)$\;
228
  \eIf{$res.start\_time < now + av\_boot\_time$}{
229
      $Req\_res \leftarrow $ get list of resources required by $req$\;
230
      $Res\_off \leftarrow $ list of resources switched off\;
231
      $Switch\_on \leftarrow Req\_res \cap Res\_off$\;
232
      $switchOn(Switch\_on)$\;  
233
   }{  
234
      $time\_check \leftarrow res.start\_time - av\_boot\_time$\;
235
      // schedule event to check whether machines need to be\;
236
      // switched on at time $time\_check$\;
237
      $check\_machines(req, time\_check)$\; \label{algo:res_arrival_end}
238
   }
239
}
240

  
241
\BlankLine
242
\While{system is running} {
243
   every minute call checkIdleNodes()\;
244
   every 5 minutes call switchResourcesOnOff()\;
245
}
246
}
247
\end{algorithm}
248
\IncMargin{0.6em} 
249

  
250
When a request arrives, the strategy finds a place for it in the scheduling agenda. Once the schedule for the request is determined, the strategy verifies whether resources need to be switched on, or if a future resource check must be scheduled (Lines \ref{algo:res_arrival_start} to \ref{algo:res_arrival_end}). This policy is simple and efficient from an energy consumption perspective, but it can lead to high performance degradation if resources need to be constantly switched on or off. 
251

  
252
\subsection{User Behaviour}
253

  
254
We believe that users of a cloud infrastructure would plan their resource demands in advance and use reservations if enough incentives were provided. These incentives could materialise in the form of discount prices for allocating resources or information on how their behavioural changes affects resource allocation and maximise energy savings \cite{OrgerieSaveWatts:2008}. In this work we do not focus on devising the proper incentives for users to adhere to using reservations. As discussed in Section \ref{sec:experiments}, the experiments consider that at least part of users find enough incentives to change their allocation decisions and hence reserve resources in advance.
255
% ----------------------------------------------------------------------------------------
256 181

  
257
\section{Reservation Workloads and Bare-Metal Deployment}
258
\label{sec:workloads}
259

  
260
This section describes the workloads used to evaluate the use of advance reservation for provisioning of bare-metal resources in a cloud environment, and the model used for deployment. 
261

  
262
\subsection{Reservation Workloads}
263

  
264
As traces of cloud workloads are very difficult to obtain, we use request logs gathered from Grid'5000 sites and adapt them to model cloud users' resource demands. There are essentially two types of requests that users of Grid'5000 can make, namely \textit{best-effort} and \textit{reservations}. Scheduling of best-effort requests is done as in most batch-scheduling systems, where resources are assigned to requests whenever they become available. Reservations allow users to request a set of resources over a defined time frame. Resource reservations have priority over best-effort requests, which means that the latter are cancelled whenever resources are reserved. In this work we ignore best-effort requests.
265

  
266
Under normal operation, although Grid'5000 enables resource reservations, user requests are conditioned by the available resources. For instance, a user willing to allocate resources for an experiment will often check a site's agenda, see what resources are available and will eventually make a reservation during a convenient time frame. If the user cannot find enough resources, she will either adapt her requirements to resource availability --- \textit{e.g.} change the number of required resources, and reservation start or/and finish time --- or choose another site with available capacity. The request traces, however, do not capture what users' initial requirements were before they make their requests.
267

  
268
In order to obtain a workload trace on provisioning of bare-metal resources that is more cloud oriented, we adapt the request traces and infrastructure capacity of Grid'5000 by making the following changes to reservation requests:
269

  
270
\begin{enumerate}
271
\item \label{enum:cond1} Requests whose original submission time is within working hours and start time lies outside these hours are considered as on-demand requests starting at their original submission time.
272
\item \label{enum:cond2} Remaining requests are considered as on-demand requests both submitted and starting at their original start time.
273
\item \label{enum:capacity} The resource capacity of a site is modified to the maximum number of CPU cores required to honour all requests, plus a safety factor.
274
\end{enumerate}
275

  
276
Change \ref{enum:cond1} modifies the behaviour of users who today explore resources during off-peak periods, whereas \ref{enum:cond2} alters the current practice of planning experiments in advance and reserving resources before they are taken by other users. Although the changes may seem extreme at first, they allow us to evaluate what we consider to be our \textit{worst case scenario} where reservation is not enabled. Moreover, as mentioned earlier, we believe the model adopted by existing clouds, where short-term advance reservations are generally not allowed and prices of on-demand instances do not vary over time, users would have little incentives to explore off-peak periods or plan their demand in advance. Change \ref{enum:capacity} reflects the industry practice of provisioning resources to handle peak demand and including a margin of safety.
277

  
278

  
279
\subsection{Bare-Metal Deployment}
280

  
281
As the provisioning of bare-metal resources by clouds is a relatively new topic, it is also difficult to find workload traces that provide information on the time of operations required to deploy resources, such as switching resources on, cloning operating system images, and partitioning physical disks.
282

  
283
To model the time required for deployment, we use traces gathered from Grid'5000 sites and generated by Kadeploy3 \cite{JeanvoineKadeploy3:2013}. Kadeploy3 is a disk imaging and cloning tool that takes a file containing the operating system to deploy (\textit{i.e} an environment) and copies it to target nodes. An environment deployment by Kadeploy3 consists of essentially three phases:
284

  
285
\begin{enumerate}
286
 \item Minimal environment setup, where nodes reboot into a minimal environment containing tools required for partinioning disks during the phase.
287
 \item Environment installation, when the environment is broadcast and copied to all nodes, and post-copy operations are made. 
288
 \item Reboot of nodes using the deployed environment.
289
\end{enumerate}
290

  
291
In order to build a model for bare-metal deployment we gathered several years of Kadeploy3 traces from three Grid'5000 sites --- Rennes, Reims and Lyon --- and evaluated the time to execute the three phases described above. Table~\ref{tab:kadeploy_clusters} lists the characteristics of the clusters considered in this study.
292

  
293
\begin{table}[!hbt]
294
\caption{Clusters whose deployment logs were considered.}
295
\label{tab:kadeploy_clusters} 
296
\footnotesize
297
  \begin{tabular}{ccccl}
298
    \toprule
299
    \multirow{2}{10mm}{\centering{\textbf{Cluster Name}}} &
300
    \multirow{2}{7mm}{\centering{\textbf{Site Name}}} &
301
    \multirow{2}{7mm}{\centering{\textbf{\# Nodes}}} &
302
    \multirow{2}{10mm}{\centering{\textbf{Install Date}}} & 
303
    \multirow{2}{30mm}{\centering{\textbf{Node Characteristics}}}\\
304
    & & & & \\
305
    \toprule
306
    parapluie & Rennes & 40 & Oct. 2010 & \multirow{3}{30mm}{2 CPUs AMD\@1.7GHz, 12 cores/CPU, 47GB RAM, 232GB DISK}\\ 
307
    & & & \\ 
308
    & & & \\ \midrule
309
    parapide & Rennes & 25 & Nov. 2011 & \multirow{3}{30mm}{2 CPUs Intel\@2.93GHz, 4 cores/CPU, 23GB RAM, 465GB DISK}\\ 
310
    & & & \\ 
311
    & & & \\ \midrule
312
    paradent & Rennes & 64 & Feb. 2009 & \multirow{3}{30mm}{2 CPUs Intel\@2.5GHz, 4 cores/CPU, 31GB RAM, 298GB DISK}\\ 
313
    & & & \\ 
314
    & & & \\ \midrule
315
    stremi & Reims & 44 & Jan. 2011 & \multirow{3}{30mm}{2 CPUs AMD\@1.7GHz, 12 cores/CPU, 47GB RAM, 232GB DISK}\\
316
    & & & \\ 
317
    & & & \\ \midrule
318
    sagittaire & Lyon & 79 & Jul. 2007 & \multirow{3}{30mm}{2 CPUs AMD\@2.4GHz, 1 core/CPU, 1GB RAM, 68GB DISK}\\ 
319
    & & & \\ 
320
    & & & \\ 
321
    \bottomrule
322
  \end{tabular} 
323
\end{table}
182
To determine the most efficient hosts, \ac{KWRanking} queries the \textbf{Benchmark Execution} module --- Step 5 of Figure \ref{fig:reservation_architecture} --- which returns the hosts FLOPS/Watt information. The Benchmark Execution obtains the performance per watt information about hosts by triggering the execution benchmarks requested by the scheduler to measure the hosts performance, and gathering information on their power consumption from Ceilometer --- Step 6 and 7 respectivelly. 
324 183

  
325
All deployments from Jan. 2010 through Dec. 2013 have been considered. The first step towards building a model for deployment consisted in building time histograms and visually examining what probability distributions were more likely to fit the obtained data. Once this first step were a number of distributions were considered, we found that log-normal, gamma and generalised gamma were more likely to fit the data. Figure~\ref{fig:deploy_fitting} depics the results of fitting the three distributions to the deployment time information of each cluster.
326

  
327
\begin{figure*}
328
\centering{
329
\parbox{.32\linewidth}{
330
\includegraphics[width=1.\linewidth]{figs/deployment_parapluie.pdf} 
331
}
332
\begin{minipage}{.32\linewidth}
333
\includegraphics[width=1.\linewidth]{figs/deployment_parapide.pdf}
334
\end{minipage}
335
\begin{minipage}{.32\linewidth}
336
\includegraphics[width=1.\linewidth]{figs/deployment_paradent.pdf}
337
\end{minipage}
338
}
339
\centering{
340
\begin{minipage}{.32\linewidth}
341
\includegraphics[width=1.\linewidth]{figs/deployment_stremi.pdf}
342
\end{minipage}
343
\begin{minipage}{.32\linewidth}
344
\includegraphics[width=1.\linewidth]{figs/deployment_sagittaire.pdf}
345
\end{minipage}
346
}
347
\caption{Deployment time histograms and distribution fitting.}
348
\label{fig:deploy_fitting}
349
\end{figure*}
350

  
351
The goodness of fit of the distributions has also been submitted to the Kolmogorov–Smirnov test (KS test), whose statistic (D-statistic) quantifies the distance between the distribution function of empirical values and the cumulative distribution function of the reference distribution. Table~\ref{tab:ks_test} summarises the results for the considered clusters. 
352

  
353
\begin{table}[!hbt]
354
\centering
355
\caption{Kolmogorov-Smirnov test for goodness of fit.}
356
\label{tab:ks_test} 
357
\footnotesize
358
  \begin{tabular}{cccc}
359
    \toprule
360
    \multirow{2}{15mm}{\centering{\textbf{Cluster Name}}} & \multicolumn{3}{c}{\textbf{D-Statistics}}\\
361
    \cmidrule(r){2-4}
362
     & \textbf{Log-normal} & \textbf{Gamma} & \textbf{Gen. Gamma}\\
363
    \toprule
364
    parapluie & 0.054 & 0.056 & 0.057 \\ \midrule 
365
    parapide & 0.088 & 0.089 & 0.087 \\ \midrule
366
    paradent & 0.039 & 0.036 & 0.037 \\ \midrule
367
    stremi & 0.053 & 0.056 & 0.055 \\ \midrule
368
    sagittaire & 0.070 & 0.073 & 0.075 \\
369
    \bottomrule
370
  \end{tabular} 
371
\end{table}
184
% Detail the benchmarks here...
372 185

  
373
Although the D-statistics does not present great differences among the results obtained by the distributions, log-normal provides slighly better fit to most clusters. We hence choose to base deployment time on a log-normal distribution.
186
The information on the power consumed by hosts is provided to Ceilometer by \ac{KWAPI} as depicted by Step 8. Although \ac{KWAPI} is described in detail in previous work \cite{Rossigneux:2014}, Section~\ref{sec:kwapi} presents an overview and describes how it fits the reservation system for OpenStack.
374 187

  
375 188
% ----------------------------------------------------------------------------------------
189
  
190
\section{Energy-Consumption Monitoring Framework}
191
\label{sec:kwapi}
376 192

  
377
\section{Experimental Setup and Results}
378
\label{sec:experiments}
193
\ac{KWAPI} is a generic and flexible framework that interfaces with OpenStack to provide it with power consumption information collected from multiple heterogeneous probes. It is integrated with Ceilometer; OpenStack's  component conceived to provide a framework to collect a large range of metrics for metering purposes\footnote{https://wiki.openstack.org/wiki/Ceilometer}. The \ac{KWAPI} architecture, depicted in Figure~\ref{fig:architecture}, follows a publish/subscribe model based on a set of layers:
379 194

  
380
In this section, we evaluate the potential for energy savings when employing a reservation framework for bare-metal provisioning in private clouds. We discuss the experimental setup and metrics, and then analyse the obtained performance results.
195
\begin{itemize} 
196
\item \textbf{Drivers}: data producers responsible for measuring the power consumption of monitored resources and providing the collected data to consumers via a communication bus. 
197
\item \textbf{Data Consumers}: or \textbf{Consumers} for short, that subscribe to receive and process the measurement information. 
198
\end{itemize}
381 199

  
382
\subsection{Experimental Setup}
200
\begin{figure}[!htb]
201
\center
202
\includegraphics[width=0.95\linewidth]{figs/kwapi_architecture.pdf}
203
\caption{Overview of \ac{KWAPI}'s architecture.}
204
\label{fig:architecture}
205
\end{figure}
383 206

  
384
A discrete-event simulator developed in house was used to model and simulate the resource allocation and request scheduling in a private cloud setting. We resorted to simulation as it enables controlled, repeatable and large-scale experiments. We model and simulate two private cloud environments. Both infrastructure capacity and resource requests are expressed in number of machines.
207
The communication between layers is handled by a bus. Data consumers can subscribe to receive information collected by drivers from multiple sites. Both drivers and consumers are easily extensible to support, respectively, several types of wattmeters (\textit{i.e.} energy consumption probes) and provide additional data processing services. A \ac{REST} API is designed as a data consumer to provide a programming interface for developers and system administrators. It interfaces with OpenStack by providing the information (\textit{i.e.} by polling monitored devices) required by a \textit{\ac{KWAPI} Pollster} to feed Ceilometer. The following sections provide more details on the main architecture components and their relationship with OpenStack Ceilometer.
385 208

  
386
To model the load of the two considered environments, we collected traces of reservation requests from two Grid'5000 sites, namely Lyon and Reims spanning six months, from Jan. 2014 to Jun. 2014. For each environment we consider two scenarios, one where the workload trace is modified according to the rules described in Section~\ref{sec:workloads}; and another using the original trace. The first is termed as a \textit{cloud} scenario, whereas the second is \textit{reservation}. In this way, there is a total of four scenarios --- \textit{lyon\_cloud}, \textit{lyon\_reservation}, \textit{reims\_cloud} and \textit{reims\_reservation}. As mentioned earlier, under cloud scenarios all requests are made by users on demand.
209
\subsection{Driver Layer}
387 210

  
388
The number of resources of each environment is achieved by simulating their corresponding cloud scenarios under a large number of resources, so that no request is rejected. The maximum number of resources used during the evaluation is taken as the site capacity. In the scenarios we consider here, Lyon and Reims have 195 and 136 machines respectively.
211
Drivers are threads initialised by a Driver Manager with a set of parameters loaded from a file compliant with the OpenStack configuration format. These parameters are used to query the meters (\textit{e.g.} IP address and port) and determine the sensor ID to be used in the collected metrics. The measurements that a driver obtains are represented as \ac{JSON} dictionaries that maintain a small footprint and that can be easily parsed. The size of dictionaries varies depending on the number of fields set by drivers (\textit{i.e.} whether message signing is enabled). Drivers can manage incidents themselves, but the manager also checks periodically if all threads are active, restarting them if necessary. It is important to avoid losing measurements because the reported information is in W instead of kWh. 
389 212

  
390
Based on the deployment information from Kadeploy, we model the time required to boot powered-off machines requested by a reservation using a log-normal distribution whose scale is $\log(500)$ and shape is $0.4$.
213
Wattmeters available in the market vary in terms of physical interconnection, communication protocols, packaging and precision of measurements they take. They are mostly packaged in multiple outlet power strips called \acp{PDU} or \acp{ePDU}, and more recently in the \ac{IPMI} cards embedded in the computers themselves. Support for several types of wattmeter has been implemented, which drivers can use to interface with a wide range of equipments. In our work, we used \ac{IPMI} initially at Nova to shutdown and turn on compute nodes, but nowadays we also use it to query a computer chassis remotely.
391 214

  
392
 
393
\subsection{Performance Metrics}
215
Although Ethernet is generally used to transport \ac{IPMI} or SNMP packets over IP, USB and RS-232 serial links are also common. Wattmeters that use Ethernet are generally connected to an administration network (isolated from the data centre main data network). Moreover, wattmeters may differ in the manner they operate; some equipments send measurements to a management node on a regularly basis (push mode), whereas others respond to queries (pull mode). Other characteristics that differ across wattmeters include: 
394 216

  
395
To evaluate the potential for energy savings by introducing support for resource reservation, we first take into account the worst case scenario and quantify the time resources remain idle (\textit{i.e.} $res_{idle}$) from the start of an evaluation $t_{start}$ through its end $t_{end}$ --- the time of submission of the last request --- which is given by: 
217
\begin{itemize}
218
\item refresh rate (\textit{i.e.} maximum number of measurements per second);
219
\item measurement precision; and 
220
\item methodology applied to each measurement (\textit{e.g.} mean of several measurements, instantaneous values, and exponential moving averages).
221
\end{itemize}
396 222

  
397
\[ res_{idle} = \int_{t_{start}}^{t_{end}} ct - cu \,dt \]
223
Table \ref{tab:wattmeters} shows the characteristics of equipments we deployed and used with \ac{KWAPI} in our cloud infrastructure.
398 224

  
399
\noindent where $ct$ and $cu$ are respectively the site machine capacity and number of machines in use at a time $t$. As we consider that computer nodes would ideally be switched off when idle, $res_{idle}$ is taken as upper bound to potential savings. The actual energy savings $e_{savings}$ is the amount of time machines are in fact switched off during interval $t_{start}$ to $t_{end}$ --- \textit{i.e.}, $e_{savings} = \int_{t_{start}}^{t_{end}} c_{off} \,dt$, where $c_{off}$ is the number of powered-off machines.
225
\begin{table}
226
\centering
227
\caption{Wattmeter infrastructure}
228
\label{tab:wattmeters}
229
\begin{footnotesize}
230
\begin{tabular}{llcc}
231
\toprule
232
\multirow{2}{18mm}{\textbf{Device Name}} & \multirow{2}{30mm}{\textbf{Interface}} & \multirow{2}{12mm}{\centering{\textbf{Refresh Time (s)}}} & \multirow{2}{10mm}{\centering{\textbf{Precision (W)}}}  \\
233
& & & \\
234
\toprule
235
Dell iDrac6    & IPMI / Ethernet           & 5    & 7 \\
236
\midrule
237
Eaton          & Serial, SNMP via Ethernet & 5    & 1 \\
238
\midrule
239
OmegaWatt      & IrDA Serial               & 1    & 0.125 \\
240
\midrule
241
Schleifenbauer & SNMP via Ethernet         & 3    & 0.1 \\
242
\midrule
243
Watts Up?      & Proprietary via USB       & 1    & 0.1 \\
244
\midrule
245
ZEZ LMG450     & Serial                    & 0.05 & 0.01 \\
246
\bottomrule
247
\end{tabular}
248
\end{footnotesize}
249
\end{table}
400 250

  
401
Switching resources off, however, may lead to a scenario where they must be switched back on to serve a request that arrives. Booting up resources takes time and increases the time required to make resources available to users, specially under immediate reservations. Therefore, we assess the impact of switching resources off on the quality of service users perceive by computing the aggregate delay $delay_{req}$ of affected requests $R_{delay}$, which is given by:
402 251

  
403
% \[ delay_{req} = \sum_{r \in R_{delay}} {machines_r} \times \frac{{time\_boot_r}}{duration_r} \]
252
\subsection{Data Consumers}
404 253

  
405
\[ delay_{req} = \sum_{r \in R_{delay}} (r_{dep\_start} + r_{dep\_end}) - r_{start\_time} \]
254
A data consumer retrieves and processes measurements taken by drivers and provided via bus. Consumers expose the information to other services including Ceilometer and visualisation tools. By using a system of prefixes, consumers can subscribe to all producers or a subset of them. When receiving a message, a consumer verifies the signature, extracts the content and processes the data. By default \ac{KWAPI} provides two data consumers, namely the KWAPI REST API (used to interface with Ceilometer) and a visualisation consumer.
406 255

  
407
\noindent where $r_{dep\_start}$ is the time at which deployment of the requested nodes started, $r_{dep\_end}$ is when the last machine became ready to be used, and $_{start\_time}$ is the time when the request was supposed to start. As discussed earlier, to model the machine deployment time, we used information collected from Kadeploy3 traces.
256
\subsubsection{REST API:}
408 257

  
258
The API consumer computes the number of kWh of each driver probe, adds a timestamp, and stores the last value in watts. If a driver has not provided measurements for a long time, the corresponding data is removed. The REST API allows an external system to retrieve the name of probes, measurements in W or kWh, and timestamps. The API is secured by OpenStack Keystone tokens\footnote{http://keystone.openstack.org}, whereby the consumer needs to ensure the validity of a token before sending a response to the system. 
409 259

  
410
\subsection{Evaluation Results}
260
\subsubsection{Visualisation:}
411 261

  
412
Figure \ref{fig:potential_savings} summarises the results for potential energy savings. As the scheduling strategy switches resources off almost immediately once it identifies that they have remained idle for a period and will continue to be in a time horizon, it is able to explore almost all potential savings under both cloud and reservation scenarios. This simple policy does not consider a high cost of powering off/on resources such as issues related to air-conditioning and power supply. Even though the strategy is simple, the cloud and reservation scenarios present different results regarding quality of service.
262
The visualisation consumer builds \ac{RRD} files from received measurements, and generates graphs that show the energy consumption over a given period, with additional information such as average electricity consumption, minimum and maximum watt values, last value, total energy and cost in Euros. \ac{RRD} files are of fixed size and store several collections of metrics with different granularities. A web interface displays the generated graphics and a cache mechanism triggers the creation of graphs during queries only if they are out of date. These visualisation resources offer quick feedback to administrators and users during execution of tasks and applications.
413 263

  
414
\begin{figure}[htb]
415
\centering 
416
\includegraphics[width=0.95\linewidth]{figs/potential_saving.pdf} 
417
\caption{Potential savings and aggregate off periods.}
418
\label{fig:potential_savings}
419
\end{figure}
264
\subsection{Internal Communication Bus}
420 265

  
421
As shown in Figure~\ref{fig:request_delay}, the request delay is substantially reduced under scenarios with resource reservations. That is caused because with reservations, resources can be switched on before the reservation starts. Hence, the system does not spend reserved-resource time for environment deployment. 
266
\ac{KWAPI} uses ZeroMQ \cite{HintjensZeroMQ:2013}, a fast broker-less messaging framework written in C++, where transmitters play the role of buffers. ZeroMQ supports a wide range of bus modes, including cross-thread communication, IPC, and TCP. Switching from one mode to another is straightforward. ZeroMQ also provides several design patterns such as publish/subscribe and request/response. As mentioner earlier, in our publish/subscribe architecture drivers are publishers, and data consumers are subscribers. If no data consumer is subscribed to receive data from a given driver, the latter will not send any information through the network.
422 267

  
423
\begin{figure}[htb]
424
\centering 
425
\includegraphics[width=0.95\linewidth]{figs/request_delay.pdf} 
426
\caption{Aggregate request delay in resource/hour.}
427
\label{fig:request_delay}
428
\end{figure}
268
Moreover, one or more optional forwarders can be installed between drivers and data consumers to minimise network usage. Forwarders are designed to act as especial data consumers who subscribe to receive information from a driver and multicast it to all normal data consumers subscribed to receive the information. Forwarders enable the design of complex topologies and optimisation of network usage when handling data from multiple sites. They can also be used to bypass network isolation problems and perform load balancing.
269

  
270
\subsection{Interface with Ceilometer}
271

  
272
We opted for integrating KWAPI with an existing open source cloud platform to ease deployment and use. Leveraging the capabilities offered OpenStack can help in the adoption of a monitoring system and reduce its learning curve.
273

  
274
Ceilometer's central agent and a dedicated pollster (\textit{i.e.} \ac{KWAPI} Pollster) are used to publish and store energy metrics into Ceilometer's database. They query the REST API data consumer and publish cumulative (kWh) and gauge (W) counters that are not associated with a particular tenant, since a server can host multiple clients simultaneously. 
275

  
276
Depending on the number of monitored devices and the frequency at which measurements are taken, wattmeters can generate a large amount of data thus demanding storage capacity for further processing and analysis. Management systems often store and perform pre-processing locally on monitored nodes, but such an approach can impact on CPU utilisation and influence the power consumption. In addition, resource managers may switch off idle nodes or set them to stand by mode to save energy, which make them unavailable for processing. Centralised storage, on the other hand, allows for faster data access and processing, but can generate more traffic given that measurements need to be continuously transferred over the network to a central point.  
277

  
278
Ceilometer using its own central database, which is used here to store the energy consumption metrics. In this way, systems that interface with OpenStack's Ceilometer, including Nova, can easily retrieve the data. It is important to notice that, even though Ceilometer provides the notion of a central repository for metrics, it also uses a database abstraction that enables the use of distributed systems such as Apache Hadoop HDFS, Apache Cassandra, and MongoDB. 
279

  
280
The granularity at which measurements are taken and metrics are computed is another important factor because user needs vary depending on what they wish to evaluate. Taking measurements at one-second interval or smaller is common under several scenarios, which can be a challenge in an infrastructure comprising hundreds or thousands of nodes, demanding efficient and scalable mechanisms for transferring information on power consumption. Hence, in the next section we evaluate the throughput of KWAPI under a few scenarios.
281

  
282

  
283
% Describe machine wake-up/shutdown here...
429 284

  
430 285
% ----------------------------------------------------------------------------------------
431 286

  
287

  
432 288
\section{Conclusion}
433 289
\label{sec:conclusion}
434 290

  
......
440 296

  
441 297
This research is supported by the French Fonds national pour la Soci\'{e}t\'{e} Num\'{e}rique (FSN) XLcloud project. Some experiments presented in this paper were carried out using the Grid'5000 experimental testbed, being developed under the Inria ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (see https://www.grid5000.fr). 
442 298

  
443
\bibliographystyle{IEEEtran}
299
\bibliographystyle{wileyj}
444 300
%\balance
445 301
\bibliography{references}
446 302

  

Formats disponibles : Unified diff