MLOps operation engineer
Posted within 3 months

No experience limit

No degree required

10.0 hrs/day, 5 days/wk
HK $30K-45K/Month
Job Highlights
Highly competitive remuneration package
Experience in AI training platform/HPC operations preferred
Bachelor's degree in Computer Science, Communications, Electronics or related field required
Job Description
<p>The appointee will be required to work for one of the constituent research units - Research Institute for Generative AI (RIGAI) (to be established) under the PolyU Academy for Artificial Intelligence (PAAI). The appointee will be required to:</p><p>(a) oversee the daily operations, the monitoring and inspection of the Large Language Model (LLM) GPU cluster to ensure stable and reliable operation of training tasks;</p><p>(b) handle GPU node failures, IB network anomalies, CUDA/NCCL errors and Kubernetes scheduling failures, perform rapid troubleshooting, recovery and Root Cause Analysis (RCA);</p><p>(c) operate and optimise the Kubernetes GPU scheduling system, including node management, resource quotas, queuing policies, image management and task governance;</p><p>(d) build and maintain the large model training environment, including CUDA, PyTorch, base images and container runtime environments (Docker/Containerd);</p><p>(e) maintain the monitoring and logging pipelines for the training platform, including Prometheus/Grafana, DCGM, Node Exporter and NCCL metric collection;</p><p>(f) participate in training-performance troubleshooting, including low GPU utilisation, NCCL communication bottlenecks, IB network congestion and Pod/Container resource limitations;</p><p>(g) support the model team's daily tasks, including environment preparation, task troubleshooting, running automated training evaluation tasks and resolving data access anomalies;</p><p>(h) provide technical support for platform scaling, migration and version upgrades, and participate in resource utilisation analysis and capacity planning;</p><p>(i) write operational automation scripts (Python/Shell) and daily operational SOPs to improve efficiency and platform reliability; and</p><p>(j) perform any other duties as assigned by the Director and the Executive Director of PAAI or his/her delegates.</p><p><strong>Qualifications</strong></p><p>Applicants should:</p><p>(a) have a bachelor’s degree or above in Computer Science, Communications, Electronics or other related field;</p><p>(b) have at least five years of work experience in the MLOps fields;</p><p>(c) be familiar with the LLM training process and have a basic understanding of the model training/evaluation pipeline;</p><p>(d) be proficient in Linux system administration and basic maintenance of GPU servers;</p><p>(e) have knowledge of the Kubernetes operation framework and the principles of GPU workload scheduling;</p><p>(f) be familiar with PyTorch, and have knowledge of NCCL communication issues during training and their troubleshooting methods;</p><p>(g) have knowledge of the basic principles of IB networking and methods for IB debugging (bandwidth, connectivity, fabric topology);</p><p>(h) have knowledge of the training environment, image building and container runtime environments such as Docker and Containerd;</p><p>(i) be proficient in Python or Shell and capable of developing automation scripts for operations;</p><p>(j) be fluent in both written and spoken English and Chinese;</p><p>(k) be familiar with distributed storage systems (JuiceFS/GPFS/HDFS) is a plus; and</p><p>(l) have good communication and collaboration skills, and a strong sense of responsibility.</p><p>Preference will be given to those with experience in AI training platform/HPC operations.</p><p></p><p></p><p><strong>Conditions of Service</strong></p><p>A highly competitive remuneration package will be offered. Initial appointment will be on a fixed-term gratuity-bearing contract. Re-engagement thereafter is subject to mutual agreement.</p>
View more
IT Infrastructure
Cantonese
English
Mandarin
HR WU
The Hong Kong Polytechnic University Academy of Advanced Artificial Intelligence (PAAI)·HR
Active recently
Company Overview


The Hong Kong Polytechnic University (English: The Hong Kong Polytechnic University, abbreviation: PolyU), commonly known as PolyU, is a public applied research university located in Kowloon Tong, Hong Kong. Its predecessor was the Hong Kong Government's High-level Industrial College established in 1937, which has undergone multiple development stages and was upgraded to a university in 1994, becoming one of the eight universities funded by the University Grants Committee (UGC). PolyU is one of Hong Kong's top universities and is also ranked among the world's top 100 universities in three global rankings - QS World University Rankings, Times Higher Education World University Rankings (THE), and U.S. News & World Report's Best Global Universities (U.S. News). Its establishment documents and charter refer to the Hong Kong Legislative Council's "Chapter 1075 The Hong Kong Polytechnic University Ordinance".
The Hong Kong Polytechnic University Artificial Intelligence High-level Research Institute (PAAI) is affiliated with the Hong Kong Polytechnic University and was established on April 1, 2025. Its inauguration ceremony was presided over by the Secretary for Innovation, Science and Technology and Industry, Sun Don, and the President of the University, Teng Jin-kong. The newly established research institute combines computer science, mathematics and data science within the university, hoping to strengthen international cooperation and assist in building Hong Kong into an AI innovation hub.
Be careful
Don’t provide your bank or credit card details when applying for jobs.