Terraforming infrastructure for HPC workloads on Google Cloud — Part 1

sam raza
4 min readMay 14, 2019
Photo by Robert Nix (Flickr)

HPC Infrastructure on Public Cloud — Background

As more enterprises migrate to the cloud, it is an observable fact that the cloud provides velocity in innovation and speedy decision making.

According to a survey, more than 70% of enterprises already have one or more applications on the public cloud, with rest of them moving some or most of their workloads to the public cloud by 2021.

One of the main reasons for adopting public cloud within enterprise is being able to use innovative tooling and additional bursts of compute, network and storage resources at a relatively low cost and high velocity. This trend is even more prominent within High Performance Computing arms of organisations. For decision makers, developers and programme managers, being able to test their workloads in development or testing environments so that they can make better analysis with quick turnarounds. Being able to leverage cloud bursting for Compute intensive calculations has been the main driving force behind push for public cloud usage. HPC Workloads that now take days (Thanks to the on-demand compute power of cloud) instead of weeks have made it very attractive for teams with HPC needs to expedite some of their workloads to public cloud.

Cloud has allowed teams to use the tools and methodologies it provides to turn infrastructure requests into a self-service operating model.

Gone are the days when a business unit had to raise a request for additional infrastructure for their HPC workloads, wait until someone picked up their request and went about building it. The costs involved in installation of new kit combined with operational team’s bottlenecks and complicated billing / costing exercises for those teams. Cloud has allowed teams to use the tools and methodologies it provides to turn infrastructure requests into a self-service operating model.

Burst pattern and mesh networking topology

Through this post I look at ways to build HPC infrastructure model in public cloud. HPC workloads can be implemented using few patterns, depending on what the actual goals are, any specific technical requirements and tools needed. There are already tools that can help with some of these workloads (e.g. distributed batch processing) from GCP DataProc to Kubernetes. We however will be focusing on GCP’s bursting cloud pattern with mesh network topology usually suitable for batch workloads using traditional VMs in organisations that are heavily regulated, require fine grained controls on security, access policies and control on regional network traffic & storage (e.g. in case of GDPR). The benefits of using these patterns are that together they are generic enough for anyone to adapt with the middleware of their liking which allows for full control and flexibility. The patterns / methods I mention here can be used across multiple clouds but I will limit the scope of this post to Google Cloud Platform.

Google recommends the use of Shared VPC model (https://cloud.google.com/vpc/docs/shared-vpc) with host / service projects where centralised governance of security and access / policies as well as separation of billing / cost is required. Not surprisingly a lot of enterprises fall under this category where centralised governance is a must.

Reference Architecture

Now lets look at the reference architecture we will be using for this post. You can see that we are making a few assumptions. First of all this architecture takes a Google recommended hybrid-cloud approach where a client has a dedicated interconnect or VPN with GCP services and host project is setup as the landing zone for customer’s on-prem data / gcp services especially in case of accessing GCP services using private ip addresses.

Secondly this architecture will cater for scenarios which require cloud bursting pattern as mentioned earlier so we can use mesh networking topology for that purpose. We have a host project (on the left) and we share its network / subnet with the service project on the right. Also in the service project (on the right) we have another network with a larger CIDR range so we can have lot more VMs within this subnet. This would essentially be your compute layer doing the HPC part. This model allows you to potentially have thousands of VMs / Cores only limited by the Quotas set by Google.

You don’t need to have a VPN established to Google Cloud to follow along with this post series.

In the next post we will look at the base setup in GCP, project pre-requisites and some skeleton terraform code, how we organise it using best practices, provision infrastructure with it, optimising for local development as well as adapting it for a CI / CD pipeline, any caveats and how you can avoid them.

References & Recommended readings

--

--

sam raza

London - Rackspace — Infrastructure, Design, Code, Philosophy & Random Ramblings. https://www.linkedin.com/in/samraza/