Our Journey on Data Center Efficiency
If you have a datacenter, operational efficiency is important for you. If not, then, I alert you to look at it..
Here I will talk about our journey to efficient datacenter
"Well defined problem is solved 50 percent"
We first looked at our operational expenditures (OpEx). We prepared OpEx report in monthly intervals (in Excel). We see that energy consumption roughly ranks 1/3 of the overall Opex. (Others are labor and maintenance for software and hardware IT components, each approx. 1/3 of the overall).
PUE (Power Usage Effectiveness) is a well established gauge for infrastructure efficiency. So we calculated it according to "The Green Grid" guidelines. Even we published a dashboard for our datacenter metrics. We have used handsome dashing framework for this.
We have implemented lots of probes to our energy analyzers. Basically, we have used SNMP protocol to get data (via small bash scripts) and we used RRD graphs to visually report and nagios for notification and checks.
We have migrated to measurable and reportable datacenter.
Now we have two important routes to take
1- Infrastructure Efficiency
2- IT efficiency
1- Infrastructure Efficiency
2- IT efficiency
We took both (You should too).In fact, I will recommend Emerson's Energy Logic initiative here.Please look at it.
Generally, IT devices report temperature and power usage data. IT devices have management ports and you can retrieve these data through interfaces. You may have already heard iLO, iDRAC as the names of these interfaces. These are based on IPMI specification. I am sure storage and network devices have such (proprietary or not) standards.
Via IPMI, we have collected power and thermal data from IT devices. Can you imagine what can be done with these data ?
- You can see server utilization. Because CPU energy usage changes according to load on servers. According to power consumption you can evaluate how busy is the server. Here, we have seen ghost/idle servers in the datacenter. These are candidates for removal or consolidation/virtualization. We have also identified top energy consumers. These are also candidates for renewal.
- Hot spots at the server inlet level. We have identified hot servers. We have also seen the variance in the system room. We are able make cooling decisions in the room. For example, we may over cool the system room. If so, we can increase the set points for cooling units. ASHRAE is an organization working on thermal guidelines. They have set 18-27 celcius for server inlet temperatures. Be careful, room temperature is not important, inlet temperature is the significant one.
Other thing is that, through our monitoring software, we identified underutilized servers (Yes, besides IPMI, you can do this by monitoring CPU utilization). We reclaimed unused capacities, especially from virtualized environments.
After monitoring inlet temperatures, we have seen that we are over cooling our system rooms. ASHRAE has allowed up to 27 celcius. So we have increased cooling unit set points for single system room. According to our probes. If we did this for our datacenter, we would save $40.000 annually. This equates to 40 vehicles CO2 emission prevention (because of less energy consumption). And 0.03 decrease in our PUE.
We have also started a pilot infrastructure efficiency project. In this regard, we have bought consultancy and learnt more.
We have implemented 2 things in this context
- Cold/hot air seperation: we have done this through CAC (Cold Aisle containment), blanking panels, correctly located grids and grommets around cables coming through the surface
- Better monitoring: As ASHRAE says, we have left room cooling and migrated to aisle cooling. we have only monitoring aisles. Moreover cooling units pull and push air temperatures are also important. Because we want to see our cooling units' working patterns.
To make a long story short, I show you our changes and their effect on PUE.
Lets comment on this graph..
We have started with 24 celcius set point for cooling units. This is not the inlet temperature, beware. First, we have turned off a cooling unit (just after the analysis). We have implemented CAC. we have increased set point to 26 Celsius.ANd lastly we have turned off another cooling unit.
We have identified that increasing cooling unit's set temperature by 1 celsius, we get %2 saving in our cooling consumption. Industry standard says for each 1 celcius, you wil get %4 saving. So we have identified that real saving is lower.
Another thing is that only CAC does not produce any saving!. It just marks the start of the journey. It allows you to increase set points or recover from hot spots
We have also turned some cooling units off. Because they were unnecessary for CAC cooling. We have easily spotted idle cooling units by pull/push temperature sensors. They do not contribute to the cooling of the room.
If you think on what will be the effect of set point increase for the system rooms cooling unit break down endurance. How will be the availability affected?
I will share you the Panduit study here. They have seen that; in hot/cold arrangement of cabinets case, after cooling unit break down, it reached critical level (35 celsius) at inlet temperature in 4 minutes. In CAC case, it took 19 minutes to arrive critical level. So CAC definetly give benefit to room availability.
Why don't you share your journey, reflection and ideas here, so that everybody gets benefit..