63256
IT Infra Analyst
shanghai, SH China
Full Time
IT Infrastructure Production Support Engineer
Position Summary:
The IT Infrastructure Production Support Engineer provides advanced technical support and troubleshooting for Asia region enterprise infrastructure with deep expertise in virtualization, data storage, and networking technologies. This critical escalation role requires strong diagnostic skills to rapidly identify issues across multiple technology domains and coordinate with specialized teams to ensure swift resolution and minimal business impact.
Key Responsibilities:
Production Support & Incident Management:
* Serve as primary escalation point for critical production incidents affecting virtualization, Windows/Linux OS, storage infrastructure, and enterprise networking
* Perform rapid root cause analysis across infrastructure layers to identify and isolate issues
* Coordinate incident response and engage specialized teams (network, security, compute, application) based on technical assessment
* Monitor infrastructure health using tools (SolarWinds, LiveNX, Nagios) and proactively identify potential issues
* Maintain incident documentation and contribute to post-incident reviews
* Participate in 24/7 on-call rotation for production support coverage
Technical Troubleshooting & Problem Resolution:
* Troubleshoot complex issues spanning operating systems, storage arrays, backup solutions, and cloud platforms
* Diagnose and resolve performance issues related to compute, storage, and network infrastructure
* Perform break-fix activities and system performance tuning
* Identify network-related issues and coordinate with network engineering teams for resolution
* Execute disaster recovery procedures and business continuity plans when required
Cross-Functional Collaboration:
* Partner with Enterprise Infrastructure Compute, Security, Network, and Application teams
* Effectively communicate technical issues to both technical and non-technical stakeholders
* Identify patterns in incidents and work with engineering teams to implement permanent solutions
* Collaborate with project teams during infrastructure changes to ensure smooth transitions to production
Documentation & Knowledge Management:
* Create and maintain comprehensive system documentation including troubleshooting procedures and runbooks
* Document incident resolution steps and contribute to knowledge base
* Develop automation scripts to streamline support activities
Basic Qualifications/Professional Skills:
* B.S. degree in computer science, information technology, computer related discipline or 5-7+ years IT work experience in a multi-site global infrastructure environment
* Progressive advancement demonstrated proven troubleshooting and problem-solving abilities
* Fluent in English; Mandarin proficiency preferred
* Strong communication, collaboration, and interpersonal skills
* Self-motivated with keen attention to detail and excellent judgment under pressure
* Ability to manage multiple concurrent incidents in high-pressure situations
* Team player with customer-focused mindset
Technical Skills/Experience:
Virtualization and OS Systems (Strong/Required):
* Proven experience with VMware in large-scale virtualized environments
* Experience with virtual machine troubleshooting and performance optimization
* Strong troubleshooting skills for Windows/Linux operating system issues
* Deep understanding with Red Hat and other Linux versions (CentOS, RHEL, Oracle Linux, SUSE Linux)
* Experience with Red Hat Satellite and automation solutions such as Ansible or Puppet
* Proficiency in scripting languages including Shell, Ruby, and Perl for automation
Storage & Backup (Strong/Required):
* 5+ years of experience with enterprise storage and backup solutions
* Experience with multiple storage platforms including Dell/EMC, NetApp, and Pure
* Knowledge of image-level backups, array-based replication, and hypervisor-based replication
* Experience with storage configuration, volume management (LVM, MPIO, EMC PowerPath)
* Familiarity with SAN, NAS operations and monitoring tools
* Understanding of data lifecycle management and tiering strategies
Network Knowledge (Working Knowledge/Required):
* Strong understanding of network topology concepts and technologies
* Ability to identify network-related issues and determine appropriate escalation path
* Knowledge of core LAN/WAN network technologies
* Familiarity with Cisco networking technologies and basic troubleshooting
* Understanding of network security concepts and protocols
* Ability to work with network teams to diagnose connectivity and performance issues
* Knowledge of load balancers and network accelerators
Additional Technical Skills:
* Strong understanding of network and server security
* Experience with converged hardware platforms including DELL, HPE and Cisco
* Experience with system monitoring tools and techniques
Required Attributes:
* Problem Solver - Uses rigorous logic and systematic methods to diagnose and resolve complex technical issues quickly
* Communication - Can effectively communicate across all levels of the organization including technical and non-technical people, both verbally and in writing
* Collaborative - Effective at working with cross-functional teams globally to resolve incidents
* Calm Under Pressure - Maintains composure and clear thinking during critical production incidents
* Customer-Focused - Committed to minimizing business impact and ensuring positive user experience
Preferred Certifications:
* ITIL Foundation
* Red Hat Certified Engineer (RHCE)
* VMware VCP
* Cisco CCNA
* AWS Certified Solutions Architect or Azure Administrator
Job Location: Shanghai
* Serve as primary escalation point for critical production incidents affecting virtualization, Windows/Linux OS, storage infrastructure, and enterprise networking
* Perform rapid root cause analysis across infrastructure layers to identify and isolate issues
* Coordinate incident response and engage specialized teams (network, security, compute, application) based on technical assessment
* Monitor infrastructure health using tools (SolarWinds, LiveNX, Nagios) and proactively identify potential issues
* Maintain incident documentation and contribute to post-incident reviews
* Participate in 24/7 on-call rotation for production support coverage
Technical Troubleshooting & Problem Resolution:
* Troubleshoot complex issues spanning operating systems, storage arrays, backup solutions, and cloud platforms
* Diagnose and resolve performance issues related to compute, storage, and network infrastructure
* Perform break-fix activities and system performance tuning
* Identify network-related issues and coordinate with network engineering teams for resolution
* Execute disaster recovery procedures and business continuity plans when required
Cross-Functional Collaboration:
* Partner with Enterprise Infrastructure Compute, Security, Network, and Application teams
* Effectively communicate technical issues to both technical and non-technical stakeholders
* Identify patterns in incidents and work with engineering teams to implement permanent solutions
* Collaborate with project teams during infrastructure changes to ensure smooth transitions to production
Documentation & Knowledge Management:
* Create and maintain comprehensive system documentation including troubleshooting procedures and runbooks
* Document incident resolution steps and contribute to knowledge base
* Develop automation scripts to streamline support activities
Basic Qualifications/Professional Skills:
* B.S. degree in computer science, information technology, computer related discipline or 5-7+ years IT work experience in a multi-site global infrastructure environment
* Progressive advancement demonstrated proven troubleshooting and problem-solving abilities
* Fluent in English; Mandarin proficiency preferred
* Strong communication, collaboration, and interpersonal skills
* Self-motivated with keen attention to detail and excellent judgment under pressure
* Ability to manage multiple concurrent incidents in high-pressure situations
* Team player with customer-focused mindset
Technical Skills/Experience:
Virtualization and OS Systems (Strong/Required):
* Proven experience with VMware in large-scale virtualized environments
* Experience with virtual machine troubleshooting and performance optimization
* Strong troubleshooting skills for Windows/Linux operating system issues
* Deep understanding with Red Hat and other Linux versions (CentOS, RHEL, Oracle Linux, SUSE Linux)
* Experience with Red Hat Satellite and automation solutions such as Ansible or Puppet
* Proficiency in scripting languages including Shell, Ruby, and Perl for automation
Storage & Backup (Strong/Required):
* 5+ years of experience with enterprise storage and backup solutions
* Experience with multiple storage platforms including Dell/EMC, NetApp, and Pure
* Knowledge of image-level backups, array-based replication, and hypervisor-based replication
* Experience with storage configuration, volume management (LVM, MPIO, EMC PowerPath)
* Familiarity with SAN, NAS operations and monitoring tools
* Understanding of data lifecycle management and tiering strategies
Network Knowledge (Working Knowledge/Required):
* Strong understanding of network topology concepts and technologies
* Ability to identify network-related issues and determine appropriate escalation path
* Knowledge of core LAN/WAN network technologies
* Familiarity with Cisco networking technologies and basic troubleshooting
* Understanding of network security concepts and protocols
* Ability to work with network teams to diagnose connectivity and performance issues
* Knowledge of load balancers and network accelerators
Additional Technical Skills:
* Strong understanding of network and server security
* Experience with converged hardware platforms including DELL, HPE and Cisco
* Experience with system monitoring tools and techniques
Required Attributes:
* Problem Solver - Uses rigorous logic and systematic methods to diagnose and resolve complex technical issues quickly
* Communication - Can effectively communicate across all levels of the organization including technical and non-technical people, both verbally and in writing
* Collaborative - Effective at working with cross-functional teams globally to resolve incidents
* Calm Under Pressure - Maintains composure and clear thinking during critical production incidents
* Customer-Focused - Committed to minimizing business impact and ensuring positive user experience
Preferred Certifications:
* ITIL Foundation
* Red Hat Certified Engineer (RHCE)
* VMware VCP
* Cisco CCNA
* AWS Certified Solutions Architect or Azure Administrator
Based on Experience