Title: Bringing Spatial Flexiblity to Google Fleet

Company: Google

data center image

Google Cloud wished to optimize the spatial flexibility across its fleet. Lower flexibility was hindering efficient resource allocation and increasing operational costs. The existing infrastructure lacked the ability to adapt dynamically to resource requirements, impacting the overall fungibility of resources.

The task assigned to me was to design a Proof of Concept (PoC) addressing this issue, focusing on both proactive (for all resources) and reactive (for just network bandwidth) approaches. The goal was to enhance spatial flexibility by efficiently managing resource availability and requirements.

For the proactive approach, my focus was on forecasting and efficiently managing resource availability and requirements within Google's fleet. I understood and combined the system that predicted resource availability, encompassing compute, storage, and network resources, the system that was anticipating resource requirements for jobs in specific data centers and the system that took in new additional resources that are newly brought yet to be included in the datacenter. This proactive forecasting allowed to predict the date when resource availability would fall below the requirements. Subsequently, I proposed a mechanism to notify relevant stakeholders and initiated the relocation of services to different data centers, ensuring optimal resource utilization and preventing potential disruptions.

In parallel, for the reactive approach, when a user initiates a service move request, I proposed a method to facilitate spatial flexibility within the fleet. First, I identified services that should be moved together, forming a move unit. Subsequently, I calculated the aggregated network usage of the move unit and estimated the approximate change in network usage, considering categories such as inter-cluster and intra-cluster changes. Using this information, I filtered destination locations based on network bandwidth availability and other resource criteria, ensuring a streamlined and efficient relocation process.

The implemented Proof of Concept successfully addressed the challenges posed by the lack of spatial flexibility in Google Cloud's fleet. The proactive approach ensured optimal resource utilization by forecasting and managing resource availability, while the reactive approach streamlined user-initiated service relocations.