Large model infrastructure/operator engineer
Updated within 3 months

No experience limit

Master

9.0 hrs/day, 5 days/wk
HK $30K-60K/Month
<p>Job Responsibilities</p><ol><li>Responsible for building the underlying training and inference infrastructure for large models (Large Language Models / Multimodal Models), including GPU/CPU cluster running environment, communication layer and execution framework integration and optimization.</li><li>Design, develop and optimize deep learning operators (Operators / Kernels), covering CUDA/ROCm, C++, Python, etc., to improve model training and inference performance.</li><li>Participate in the performance tuning of distributed training and inference systems, including data parallelism, model parallelism, pipeline parallelism, and hybrid parallelism strategies.</li><li>Optimize computation graph, memory usage and communication efficiency, reduce GPU memory occupancy and cross-node communication costs.</li><li>Perform performance analysis, low-level optimization, and custom extension for mainstream deep learning frameworks (such as PyTorch, TensorFlow, JAX).</li><li>Collaborate closely with platform engineering, MLOps and model development teams to integrate operators and system optimization results into the actual training and inference processes.</li><li>Participate in the technical evaluation, adaptation, and performance validation of new hardware platforms (such as next-generation GPUs, accelerators, and interconnect technologies).</li><li></li></ol>
View more
IT Audit
Python (Programming Language)
English
Cantonese
Mandarin
吴先生
InfiG.ai Limited·HR
Active within 3 days
ELEERecognized degree in Information Technology
5 years' relevant post-qualification experience
Knowledge in PL/SQL, Java/J2EE, XML, Ajax, JBOSS, HTML5
Be careful
Don’t provide your bank or credit card details when applying for jobs.