Stefanini Group is seeking to welcome a skilled Middleware Engineer to strengthen our Application Technical Management Operations (ATM) team within our Hybrid Cloud & Infrastructure Cluster. The role provides Level 3 (L3) expert support across a broad portfolio of middleware and application platforms hosting the customer's business applications. This role is a key technical position focused on day-to-day operations, stability, and continuous improvement of these application platforms.
The Middleware Engineer will act as the highest escalation point for complex incidents and problems, lead platform lifecycle activities (upgrades, patching, performance tuning), and contribute to platform improvement initiatives - including automation, monitoring, and operational standardization across the supported application landscape.
The ideal candidate combines strong middleware and application platform skills, structured troubleshooting at L3 level, and the ability to work in a structured operational environment (ITIL / managed services with SLA commitments). Hands-on experience with at least one of the critical applications listed above is required; the broader the exposure across middleware and enterprise application platforms, the better.
A commitment to availability for on-call rotation for critical incidents is essential. Additionally, participation in planned maintenance windows, which may occur during evenings or weekends is necessary to ensure timely response and support when required.
The position is offered on an employment contract basis.
MAIN RESPONSIBILITIES:
Level 3 Operations & Technical Escalation (Core Responsibility)
- Act as the L3 escalation point for complex technical issues across the supported application portfolio, with particular focus on critical applications including:
- middleware runtimes hosting the supported applications (Apache Tomcat, JBoss, IIS, WebLogic / WebSphere)
- integrations with surrounding services (databases, AD, file shares, monitoring).
- Own and drive resolution of:
- Major Incidents (P1/P2) with deep technical investigation and rapid recovery focus
- recurring incidents through Problem Management (root cause analysis and permanent fixes).
- Lead deep troubleshooting activities:
- middleware crashes, JVM out-of-memory, garbage collection issues
- thread pool and connection pool exhaustion
- application performance degradation and response time issues
- integration failures between in-scope applications and dependent services
- failed deployments and configuration drift.
- Provide clear technical updates during incidents, including:
- impact assessment
- recovery plan / workaround
- risks and next steps.
Platform Lifecycle Management (Upgrades, Patching, Stability)
- Plan and execute lifecycle activities such as:
- major version upgrades
- middleware patching and security hardening
- certificate management and renewal processes.
- Validate platform readiness before changes: compatibility, capacity, performance, known issues, vendor advisories.
- Maintain high availability and resilience:
- HA configuration support (load balancing, clustering, session replication)
- backup / restore strategy validation in coordination with the Backup team
- disaster recovery readiness and operational runbooks.
- Ensure operational compliance with defined maintenance windows and change governance.
Performance, Capacity & Optimization
- Tune middleware and application platform performance:
- JVM tuning, thread management, caching, database connection optimization
- analysis of application-side query patterns and execution behavior.
- Conduct capacity planning and trend analysis based on monitoring data.
- Implement scaling and resource allocation strategies aligned with cloud and on-prem cost considerations.
- Support proactive problem management through performance baselining.
Standardization, Automation & Operational Improvement
- Develop and maintain operational documentation, including:
- troubleshooting guides
- standard operating procedures (SOPs)
- build standards and reference configurations
- operational runbooks for recurring tasks.
- Support automation initiatives using tools such as:
- Terraform and Ansible (infrastructure as code, configuration management)
- GitHub for version control and pipeline integration
- scripting (PowerShell / Bash / Python) to reduce manual operations.
- Mentor L2 administrators and progressively shift standard activities to lower support levels.
- Proactively identify improvements to increase:
- platform stability
- recovery speed (MTTR)
- repeatability and reduction of human error.
Monitoring, Observability & Performance Management
- Support and improve observability across the application platform, including:
- middleware-level metrics (JVM heap, thread pools, connection pools, response times)
- log management and log retention configuration.
- Define and refine alerting:
- thresholds for availability, performance and resource usage
- reduction of noise and false positives
- integration with ServiceNow for ticket creation.
- Improve operational dashboards and platform health reporting.
Security & Compliance Support
- Apply and maintain the application platform security baseline aligned with the customer's Information Security Policy:
- access control, hardening, encryption in transit, audit logging
- middleware-level account and role management.
- Lead vulnerability remediation activities for middleware runtimes and dependent libraries (CVE handling).
- Provide configuration evidence and access logs to support security audits and compliance requests.
- Collaborate with the Security team on hardening initiatives and risk reviews.
Vendor & Stakeholder Coordination
- Engage vendor support for issues requiring vendor escalation, challenge vendor responses when necessary.
- Work with customer application owners and SMEs to align on technical decisions, designs and trade-offs.
- Coordinate with adjacent teams (Database, OS, Storage, Network, Security, DevOps) for end-to-end issue resolution.