Platform Service Dashboards
The Problem
With rapid adoption of the cloud , we see a lot of digital transformation projects with emphasis on automation, fast delivery and with an aim to provide a landing zone for the wider organisation.
Most of these projects stagnate in terms of digital delivery after 18 to 24 months of inception and after only few products deployed in prod.
This is due to a wide range of things but one of the major ones is Change Management introduced to Infrastructure changes.
Since the platform is a critical part of infrastructure, the stakeholders are worried about the potential downtime of their application . They are worried about the downtime it has on their customers , regulatory compliance, reputation etc. Due to this the number of releases by the platform engineering team mostly reduces from 4 to 5 releases per a week to 1 release per month.
This will have a wider impact on security , features delivered, retaining the talented engineers who are used to fast delivery etc.
Proposal
To solve the above issue, what we need is a shift in thinking and delivery of platform services.
We need to deliver platform services with a Service Oriented approach where we have Dashboards to show the health status of each platform service.
We need to design services from the beginning which are resilient, highly available and monitored.
The concept of Service Oriented Platform Services look like the following
- All platform services should have status metrics
- The status should be based a combination of deployment pipeline status, integration test status, sample application deployed on the platform status etc
- A service can have smaller sub services
- Each service should have SLA defined
- End to end integration tests are used to test infrastructure reliability
- Sample applications are deployed on the platform and they have status metrics
- The Platform Engineering team should expose this service's status via a Dashboard.
- The customers (application teams, stakeholders) should be able to see the status of the platform services and check if they are meeting the SLAs
Lets Say Platform Engineering Team Manages the following Services using automation.
- Organisations
- CI/CD
- IAM
- Accounts or Projects or Subscriptions
- Netowokring
- Compute
Following diagram shows this concept for the above services, where every service has a status metric and status of a service(combined status of subservices) can be determined from top to bottom or bottom to top.
We can use something JIRA statuspages to expose the Platform Dashboards.
Conclusion
When they make any change in the lower environment they should get enough feedback over the course of 6 to 12 months to assess the impact the change have on applications availability.
Hopefully this should give enough confidence for the stakeholders to approve the change in automatically or in few days for most changes.