Job description: 1. Lead design and development of scalable AI/ML infrastructure and production systems 2. Architect end-to-end solutions for model training, deployment, and serving at scale 3. Drive technical excellence through code quality, system design, and engineering best practices 4. Mentor engineers and shape technical roadmap for AI-powered products
Responsibilities: 1. Architect distributed systems for large-scale model training and inference 2. Design high-performance APIs, data pipelines, and service architectures 3. Create technical specifications and architecture decision records (ADRs) 4. Develop scalable training infrastructure supporting distributed GPU/TPU workloads 5. Implement model serving systems with low-latency requirements (REST/gRPC/GraphQL) 6. Create automated pipelines for CI/CD, model validation, and A/B testing 7. Ensure system reliability, scalability, and observability (SLA/SLO management) 8. Implement monitoring, logging, and alerting for ML services in production 9. Optimize performance through profiling, caching, and resource optimization 10. Lead incident response and post-mortem processes for production issues 11. Establish coding standards, review processes, and testing strategies 12. Drive refactoring initiatives and technical debt reduction 13. Champion software engineering best practices within AI/ML teams 14. Contribute to internal libraries, frameworks, and developer tooling 15. Mentor junior and mid-level engineers through pair programming and code reviews 16. Lead cross-functional initiatives with product, research, and operations teams 17. Represent engineering in technical discussions with stakeholders 18. Contribute to hiring processes and team growth strategies
Requirements: 1. B.S./M.S./Ph.D. in Computer Science, Engineering, or equivalent experience 2. 5+ years of software engineering experience is preferrable 3. Experience with specialized hardware: GPU clusters, TPUs, or edge AI accelerators 4. Background in compiler optimization or high-performance computing 5. Familiarity with MLOps platforms: Vertex AI, SageMaker, or Azure ML 6. Previous startup experience or 0-to-1 product development is preferrable
Novos Technogoly Limited specializes in AI + intelligent manufacturing, focusing on AI visual detection, machine learning, AIoT and cloud and local offline intelligent solutions for various industries. We empower industrial upgrading with cutting-edge technology, building an efficient, low-cost, and intelligent production system.
Team open and inclusive, encouraging technical innovation and engineering practice, sincerely inviting talents from AI, software development, cloud computing and other fields to join hands and build a smart future.