Project Overview:

Client is a leading multi-brand technology solutions provider to business, government, education and healthcare customers in the United States, the United Kingdom and Canada. A Fortune 500 company and member of the S&P 500 Index, Client was founded in 1984 and employs approximately 10,000 coworkers. For the trailing twelve months ended September 30, 2020, the company generated Net sales over $18 billion.

The Site Reliability Engineer (SRE) supports client’s engineering platforms and services to ensure operational stability and performance of one or more critical business services used by customers and employees. The SRE should engage the best engineering disciplines, combining tooling and systems to develop creative solutions to solve operations problems. The SRE prioritizes systems/applications reliability work using service level objectives (SLOs) to measure the performance of supported applications, cloud, system, or services based on service level indicators (SLIs), which measure the service level provided to customers.

Рекрутерка
Ганна Удуденко
Responsibilities:
  • Work with other members of their assigned value stream to ensure that in-scope applications/platforms are meeting performance and stability requirements, this includes managing major incidents to mitigation/resolution;
  • Improve monitoring capabilities to reduce outage frequency and duration;
  • Conduct post-incident reviews and seek for improving using right tooling and tech stack;
  • Respond quickly to identify and resolve issues involving business applications;
  • Respond to voicemail, emails, and electronic tickets directed to the support team on a timely basis and follow-up on issues until confirmed resolution;
  • Provide outstanding customer service to all stakeholders and maintain open and collaborative communication;
  • Perform post-incident reviews of all major incidents and determine action items required to avoid similar issues/minimize downtime for future incidents;
  • Provide primary operational support and engineering for multiple large, distributed software applications;
  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding;
  • Partner with software and DevOps engineering to ensure that assigned applications/platforms have appropriate monitoring and metrics in place to appropriately measure performance and stability;
  • Monitor and research communication messages;
  • Monitor, research, and respond to failed transactions;
  • Balance feature development speed and reliability with well-defined service level objectives;
  • Ensure that applications/platforms in the value stream are operationally ready for production, this includes annual review of all SOPs/knowledge articles;
  • Optimize and participate in on-call rotations & investigations related to the platform's availability;
  • Support Service Now incidents that include support for (and not limited to): production systems, application support, code build support, access requests, and general cloud support across resources, monitoring, and subscription requests;
  • Technically manage complex and large-scale project efforts in development, maintenance and enhancement of business system applications;
  • Improve reliability, availability and performance of the highly scaling cloud systems with recovery automation;
  • Automate operational tasks (e.g., IIS restarts, data purging) and or seek opportunities where automation can improve stability and uptime;
  • Identify issues that require more attention, and work to resolve issues based on an understanding of the business problem being solved;
  • Draw appropriate resources together in order to address technical issues;
  • Work afterhours and on weekends as necessary for special installations, implementations, system maintenance and upgrades;
  • Rotate pager support = follow the sun approach ( 11 am till 16 pm Kyiv time,  7 days per week); 
  • Working hours 2AM CST - 11AM CST. 
Requirements:
  • 5 years of Site Reliability Engineering or related systems support experience;
  • 3 years of experience creating pipelines in Azure DevOps, Terraform, and systems support experience;
  • 3 years of experience in Azure stack including AKS, Azure Function Apps, Logic Apps, Web-apps, and Mobile apps;
  • Undergraduate degree in Computer Science or equivalent working experience;
  • Direct experience scripting in two of the following languages: Terraform, Java/.Net, Ruby, Python, PowerShell, Bash;
  • Direct experience with APM and infrastructure monitoring tools such as SCCM, Splunk, Dynatrace, and DataDog;
  • Direct experience in two of the following data management and monitoring tools: DataDog, Dynatrace, Prometheus, Splunk, ELK, SquaredUp.
Nice to have:
  • Ability to debug, optimize code, and automate routine tasks;
  • Familiar with Windows and Linux operating systems and networking and administration;
  • Experience working in an Agile Scrum environment;
  • Experience with CI/CD pipelines and tool Azure  DevOps GitHub, GitHub Actions, Jenkins, Git, or Jira;
  • Experience supporting and managing Azure PaaS, SaaS.

#LI-HU1

Тебе також можуть зацікавити

Чому варто приєднатись до команди INTELLIAS

У нас ти знайдеш доброзичливе середовище та можливості навчатися й зростати щодня.

Можливості релокації в INTELLIAS

Отримуй новий досвід та відкривай нові горизонти, знаходячись лише в декількох годинах подорожі…

Підтримка здоров’я та спорту

Ми докладаємо максимум зусиль, щоб забезпечити комфортні умови для консультантів компанії, та піклуємося…

Як стати частиною команди INTELLIAS

Ми робимо все можливе, щоб спростити та прискорити твій шлях до нашої команди. Будемо раді бачити тебе...