L2- Observability/AIOps (5 to 8 yrs exp). Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance. SRE is a mindset, and a set of engineering approaches focused on optimizing existing systems, building infrastructure, and eliminating work through automation. As a Site Reliability Engineer with focus on observability you will build and operate next generation observability platforms. As an SRE with Observability focus you will: ● Explore the complex IT estates of our clients to understand their observability/AIOps opportunities, identify the areas to improvise ● Collaborate to architect unified observability and AIOps strategies which employ leading AI technology ● Implement enterprise observability/AIOps technology and processes ● Amplify observability/AIOps outcomes by accelerating adoption across technology and business organizations Responsibilities include: ● Architect observability solutions to address the gaps in order to reduce organizational MTTD and MTTR objectives. ● Developing API-driven micro-services that combine into large and complex platforms ● Planning and executing highly parallel distributed object storage transformations and migrations ● Maintaining automated test suites using CI/CD tools ● Participating in collaborative projects with small software engineering teams ● Develop automation, processes, and tools designed to make our services simpler and more robust ● Participate in troubleshooting, capacity planning and analysis, performance analysis activities ● Advise management on service onboarding strategies and execution Critical Hiring Criteria What we are looking for: ● Entrepreneurs who seek challenging problems to solve ● Creativity, initiative and acute attention to detail ● Thirst for innovation and solving problems at lightning speed ● Passion for automating everything repetitive ● Obsession with software scalability and performance under high loads ● Love for using and contributing to open-source software Please bring to the table: ● Experience in architecting complex IT solutions ● Understanding of observability dimensions(Metrics, logs, traces) ● Excellent communication and stakeholder management skills ● Development experience, comfortable working in multiple languages(Python, Java, Go and Ruby a plus) ● Experience working in collaborative coding environments (peer review, continuous integration, etc) ● 7+ years of application development ● Experience working in distributed remote teams across multiple time zones ● Experience in large scale operations environments ● 7+ years of experience with Linux/Unix development or systems administration ● 3+ years of experience with networking systems and technologies ● Deep understanding of network performance and security ● Ability to identify tasks which require automation and implement required automation ● Configuration Management tools experience with Puppet, Chef, SaltStack ● Hands-on operational experience in a high-volume or critical production service environment - distributed systems, capacity planning, continuous deployment ● BA/BS in Computer Science preferred, or equivalent experience (advanced degrees preferred) We have opportunities to work with and learn: ● Object Storage - Minio/S3/etc ● Data Collection - OpenTelemetry/Grafana Alloy/etc ● Message Bus - Kafka/NSQ/etc ● Scaling Databases - Druid/Clickhouse/Cassandra/etc ● Relational database technologies at large scale - Timescale/Vitess/Postgres/etc ● Scheduling & Orchestration - Kubernetes/OpenShift/Docker ● Cloud Platforms - AWS/Azure
Job Title
Observability/AlOps