user avatar

Senior Cloud Observability Engineer - Data Dog

Artech Information Systems

Posted 1 week ago

Job Requirements

Washington, DC
Public Trust Polygraph Unspecified
Career Level not specified
Salary not specified
Join Premium to unlock estimated salaries

Job Description

Job title : Senior Cloud Observability Engineer - Data Dog
Location: Washington, D.C., 20549 (100 % Onsite)
Duration: 6 Months

Salary Range: $58.00 - $60.00/Hour on W2 (Without Benefits).
Applicants must be willing to work on W2.

Clearance : Ability to obtain and maintain SEC Public Trust (or higher if required).

Primary Responsibilities:
Observability Platform Engineering:
  • Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring.
  • Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise.
  • Instrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estate.
  • Develop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on call/paging workflows.
  • Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost controlled.
Cloud and Container Monitoring Engineering:
  • Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services.
  • Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces.
  • Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM.
  • Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD.
  • Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry.
Performance Engineering and Problem Solving:
  • Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate.
  • Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies.
  • Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence.
  • Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes.
  • Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps.
Capacity, Reliability, and Continuous Improvement:
  • Analyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiency.
  • Build and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders.
  • Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation.
  • Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations.
Required qualifications:
Education:
  • Bachelor's degree in a relevant field (e.g., Information Technology, Computer Science, Engineering).
Required Experience:
  • Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering.
  • Demonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered).
  • Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads.
  • Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation.
group id: artech

Similar Jobs


Job Category
IT - Database
Clearance Level
Public Trust