Operational Stability & Uptime: Maintained 99.9% availability for core cloud services by proactively monitoring GCP and AWS environments and resolving infrastructure alerts swiftly.
Infrastructure Maintenance: Successfully executed routine maintenance tasks, including patching, security updates, and resource resizing across existing EC2/Compute Engine instances and managed databases.
Incident Response & Troubleshooting: Acted as a primary responder for infrastructure incidents, reducing Mean Time to Resolution (MTTR) through effective root cause analysis and documentation.
Pipeline Support: Maintained and optimized existing CI/CD pipelines (GitLab/Azure DevOps) to ensure smooth deployment of updates for engineering teams.
Cost & Resource Optimization: Identified and remediated idle or over-provisioned resources in AWS and GCP, contributing to monthly cloud cost efficiencies.
IaC Execution: Implemented defined infrastructure changes using existing Infrastructure-as-Code (Terraform/CloudFormation) templates to support feature releases.
Monitoring & Alerting: Tuned and improved dashboards (e.g., GCP Cloud Logging, AWS CloudWatch, Prometheus, Grafana, and Loki) to reduce noise and ensure critical alerts are actionable.
Security Compliance: Assisted in the execution of security audits and successfully implemented required policy changes (IAM adjustments, security group updates) as defined by the security team.
Serve as a key operational hand, maintaining the health of cloud resources, troubleshooting production incidents, and executing infrastructure updates.
Automate routine operational tasks and maintenance jobs using scripting languages like Bash and Python.
Manage and troubleshoot running containers (Docker/Kubernetes) in a production environment, including hands-on work with managed solutions like GKE and EKS.
Read, understand, and modify existing Terraform or CloudFormation scripts to apply updates.
Set up and respond to alerts in tools like CloudWatch, Prometheus, or Datadog.
Troubleshoot failed builds or deployments within CI/CD pipelines (GitLab/GitHub Actions/Azure DevOps).
Communicate status updates clearly during incidents and work collaboratively with development teams to resolve blockers.
Cloud Proficiency (GCP & AWS): Solid hands-on experience with core services (Compute, Storage, Networking, IAM) in both GCP and AWS, focused on administration and maintenance rather than complex architecture.
Operational Scripting: Proficiency in scripting (Bash, Python) to automate routine operational tasks and maintenance jobs.
Container Operations: Competence in managing and troubleshooting running containers (Docker/Kubernetes) in a production environment, with experience in GKE and EKS.
Infrastructure as Code (IaC) Usage: Ability to read, understand, and modify existing Terraform or CloudFormation scripts to apply updates.
Monitoring & Observability: Experience setting up and responding to alerts in tools like CloudWatch, Prometheus, or Datadog.
CI/CD Familiarity: Understanding of how pipelines work (GitLab/GitHub Actions/Azure DevOps) and ability to troubleshoot failed builds or deployments.
Reliability Focused: Prioritizes stability and uptime above all; is thorough and cautious when making changes to production environments.
Problem Solver: Enjoys digging into logs and metrics to find the root cause of an issue.
Collaborative Communication: Clearly communicates status updates during incidents and works well with development teams to resolve blockers.
Bias for Action: Takes ownership of operational tickets and drives them to resolution without needing constant supervision.
Eagerness to Learn: Actively seeks to deepen knowledge of cloud architecture and DevOps best practices.