Coding & Software Engineering

Site Reliability Engineer

ContractorRemote$40 - $70/hour

About this role

Job Summary: In this role, you'll apply your expertise to help train next-generation AI systems. Your work will shape how models learn, reason, and perform through high-quality, real-world input. No prior experience in AI is required — your domain knowledge is what matters.

Skills

Terminal-Native Problem SolvingDynamic Infrastructure RecoveryContainerized Environment MasteryPython

Key responsibilities

  • Lead the deployment, monitoring, and recovery of complex, containerized AI training environments using advanced terminal techniques.
  • Proactively identify, diagnose, and resolve infrastructure bottlenecks and failures in long-running processes.
  • Orchestrate resilient system builds and infrastructure management, ensuring stability and optimal resource utilization.
  • Collaborate closely with engineering teams to refine CI/CD pipelines and automate routine operational tasks.
  • Manage and optimize filesystem structures, networked storage, and process scheduling in Dockerized sandboxes.
  • Conduct rapid mid-execution replanning during error states and unforeseen runtime issues.
  • Document best practices, emergent solutions, and contribute to knowledge transfer across the team.

Required skills & qualifications

  • Demonstrated expert proficiency with terminal-based problem solving and complex system administration.
  • Mastery of dynamic infrastructure recovery and long-running operational process management.
  • Deep expertise in containerized environments (e.g., Docker, Kubernetes) and sandbox orchestration.
  • Strong Python skills, with the ability to script, automate, and debug real-world production systems.
  • Proficiency in Bash and familiarity with JavaScript/TypeScript, Go, Rust, C/C++.
  • Experience with build systems, package managers, databases, version control, and cryptography tools.
  • Adept at troubleshooting, documenting, and replanning in high-velocity technical environments.

Preferred qualifications

  • Background in machine learning operations or AI infrastructure.
  • Familiarity with ML frameworks and distributed computing.
  • Experience supporting multi-phase, high-intensity engineering projects.
Apply on micro1 →

This role is posted on our partner platform. When you click Apply, you'll go to the posting, where the application, interview, skill validation, and onboarding all happen. lehico is an independent site that surfaces these opportunities — we don't process applications or guarantee acceptance.