Saturday 29 November 2014

Managing Your DataCenter Physical Infrastructure - Part 2

2.       Availability Management – Identifying Availability & Reliability requirements against actual performance  and if required introduce improvement to meet and sustain quality of service
a.       Why: Once availability is defined, SLA should be monitored to analyze the potential downtimes from impact of any individual component or entire system
b.   Challenges: Metrics Reporting, Raising alarms, Planned Downtimes  and Continuous Infra Improvements
c.       Solution:
                                                              i.      A Tool which reports uptime/downtime of infrastructure or service, while identifying the cause of downtime, providing the time-stamp or duration and time it took for recovery. It’s a best practice to configure tools to provide  Pro-active warning
                                                            ii.      A Tool, which doesn’t require special training or expert e.g UPS or Battery health, temperature or humidity, Disk or Power status etc.
                                                          iii.      Unplanned downtime can lead to false alerts. Using Maintenance mode in system units e.g.VMware ESXi, Storage Array Controllers, UPS or Blade Servers etc; A usual mistake often happens is to rollback all changes made to put a system into maintenance, hence need to be done with caution. It is suggested to use tool which provide alert if any condition is left uncorrected post maintenance.
                                                           iv.      To make improvements, the very first step is identify its need along with Risk involved in bringing up the change. Now FMEA techniques may not be known to everyone, hence its is suggested to use a tool which does risk assessment. You can even use health reports as you reference points to bring in improvement. Corrective Measures to Mitigate these Risk will become your Improvement Plan. Note: Improvement should be continuous. Some examples can be: power consumption, Disk full status, Performance reports, Cluster Loads etc.

3.       Capacity Management – providing IT resources as when required at right cost.
a.    Why: Current and Future requirements keep changing and need to monitoring and addressed
b.   Challenges: Asset Management with on-going changes (monitoring, recording, tracking), providing capacity as when required, Optimizing capacity for more ROI and better management, Incremental Scalability
c.   Risk involved: unplanned downtime if resources over-utilized e.g. Power, CPU/RAM in ESXi, Network Bandwidth etc
d.     Solution:
                                                              i.      A tool that performs centralize monitoring for current usage and alerts upon potential over-load of resources e.g. Power & Cooling Monitoring Systems for DataCenter by Emerson/ APC /Schneider-electric, Network Bandwidth Monitor, vCOPs for VMware Clusters, Storage Performance Indicators, HP Insight Managers for HP Blade enclosures or Servers, Brocade Fabric Managers etc.
                                                            ii.      Capacity requirements are tend to miss or not considered during implementation. Hence a tool is required for Trending Analysis , which also alerts on Threshold violations or over-loads. This tool should be referred before going ahead with Future procurements or new deployments.
                                                          iii.      A poorly designed datacenter may requires more resources (server, storage, network, space, power) (High CAPEX) and hence would cost more to operate (High OPEX). Analysis of requirements should be done during designing of datacenter or even during new deployments. It is good to implement Six Sigma DMADV techniques if possible.
                                                           iv.      Weekly/Month/Quarterly reviews on current capacity and usage trends will forecast the future required scalability too. Ideally while designing a new datacenter, every single component (server, storage, network, virtualization platform, space, power) are designed with such a scale that they can either bear 30% incremental capacity (scalability) and should be operable for atleast next 3 years. Even support contracts are considered in the same manner.
                                                             v.      In terms of DataCenter; Location, Power (input, socket), Cooling, Rack Space and Cabling are the major requirement and consideration in terms of Capacity Management; while in terms of Network, Network ports, bandwidth, VLAN, IPs etc. can be considered. In terms of Servers  (Physical or Virtual) , CPU/cores, RAM, Cluster etc can be considered; while for Storage, type of connectivity (FC, NFS, iSCSI, FCoE, DAS), Space required, IOPS, Backup, Recovery & Redundancy options (RAID, snapshots, Clones, replication) etc should be considered.
                                                           vi.      Usually Life Cycle or Capacity Manager track the inventory and usage of their assets, which is a best practice as well. It is also suggested to visualize the impact on capacity with every new deployment.

                                                         vii.      Note that TCO & ROI need to be considered and is a deciding factor when it comes to business decisions. 

Continue to Read..

Part 1: Managing Your DataCenter
                             Part 3: Managing Your DataCenter

No comments:

Post a Comment