Software Engineer - Storage & Observability (Early Career)
Company: Together AI
Location: San Francisco
Posted on: April 1, 2026
|
|
|
Job Description:
About the Role Together AI is building the AI Acceleration Cloud
, an end-to-end platform for the full generative AI lifecycle. Our
AI Infrastructure team is at the forefront of scaling the
foundational systems that power this platform. We are looking for
an Early Career Software Engineer to join our Storage and
Observability team , where you will help design and maintain robust
distributed storage solutions and develop comprehensive
observability platforms. In this role, you will work on the systems
that provide critical insights into GPU utilization and system
performance, ensuring seamless data access for the world's largest
AI training and inference workloads. Responsibilities Build and
deploy scalable observability tools (metrics, logs, traces) using
the latest state of the art open source distributed telemetry, log
search and tracing systems Develop and implement
infrastructure-as-code for stack deployment using Terraform,
Ansible, and Helm Write clean, production-grade code in Go or
Python to create custom K8S operators, tools and automation Support
the operation of high-performance distributed storage systems (such
as Ceph, Weka/Vast ) and Kubernetes-native storage operators
Optimize storage systems for GPU clusters (10-50 GB/s per-node
throughput) and scale storage infrastructure to support thousands
of nodes Partner with senior engineers to enhance distributed
tracing and optimize data paths for AI workloads Minimum
Qualifications Experience: 1–3 years of professional experience in
Software Engineering or Cloud Operations with hyper scalers Cloud &
Containers: Solid understanding of Docker and Kubernetes
orchestration, as well as experience with cloud platforms like AWS,
GCP, or Azure Tooling: Familiarity with infrastructure-as-code
(Terraform or Helm) and version control (Git) Observability
Fundamentals: Experience using Prometheus and Grafana for system
monitoring Storage Systems: Experience with distributed storage
systems, like WekaFS, Vast, Ceph, MinIO, GPFS, Luster etc Problem
Solving: Strong debugging skills and a passion for automation and
operational excellence Preferred Qualifications Experience
monitoring AI/ML infrastructure, GPU clusters, and custom metrics
for model performance and training pipelines Background in
high-frequency, low-latency systems monitoring, chaos engineering,
and reliability testing Contributions to open-source projects,
preferably in the space of observability or storage Familiarity
with security monitoring and compliance frameworks About Together
AI Together AI is a research-driven artificial intelligence
company. We believe open and transparent AI systems will drive
innovation and create the best outcomes for society, and together
we are on a mission to significantly lower the cost of modern AI
systems by co-designing software, hardware, algorithms, and models.
We have contributed to leading open-source research, models, and
datasets to advance the frontier of AI, and our team has been
behind technological advancement such as FlashAttention, Hyena,
FlexGen, and RedPajama. We invite you to join a passionate group of
researchers in our journey in building the next generation AI
infrastructure. Compensation We offer competitive compensation,
startup equity, health insurance, and other benefits, as well as
flexibility in terms of remote work. The US base salary range for
this full-time position is: $165,000 - $200,000 equity benefits.
Our salary ranges are determined by location, level and role.
Individual compensation will be determined by experience, skills,
and job-related knowledge. Equal Opportunity Together AI is an
Equal Opportunity Employer and is proud to offer equal employment
opportunity to everyone regardless of race, color, ancestry,
religion, sex, national origin, sexual orientation, age,
citizenship, marital status, disability, gender identity, veteran
status, and more. Please see our privacy policy at
https://www.together.ai/privacy
Keywords: Together AI, Richmond , Software Engineer - Storage & Observability (Early Career), IT / Software / Systems , San Francisco, California