Ivan Kovalev

About Me

I am a Site Reliability Engineer (SRE) passionate about designing scalable, reliable, and efficient systems. Mostly interested in self hosting and bare metal deployments. This is my personal website where I’ll showcase my work, experiments, and thoughts over time.

Work Experience

Optimizely – Senior Site Reliability Engineer

At Optimizely, I oversee the management and deployment of critical services across multiple cloud environments, including both Google Cloud Platform (GCP) and Amazon Web Services (AWS). My responsibilities include managing the full lifecycle of our cloud databases and developing comprehensive Terraform modules that streamline provisioning, configuration, and monitoring.

By defining key metrics and implementing robust monitoring strategies, I ensure proactive detection of issues and maintain optimal performance levels. Among my most notable achievements was the seamless migration of a high-traffic CDN, handling over 80,000 requests per second, from one cloud provider to another without any downtime. This project exemplified my commitment to reliability, scalability, and smooth operational transitions.

ING – Site Reliability Engineer

During my time at ING, I played a key role in maintaining and optimizing a self-hosted machine learning platform running on Kubernetes. This involved ensuring the platform met defined Service Level Agreements (SLAs) and Service Level Objectives (SLOs), as well as establishing and refining our incident management processes.

Beyond the day-to-day operations, I promoted the use of open-source solutions and best practices, enabling the team to leverage a wider ecosystem of tools and technologies. Through these efforts, the ML platform operated more reliably, ensuring timely, accurate insights that supported ING’s data-driven decisions.

Booking.com – Site Reliability Engineer

At Booking.com, I maintained critical infrastructure responsible for internal authentication and authorization. By enhancing the reliability of these systems, I ensured seamless access management for thousands of internal users.

I also introduced automated integration tests using GitLab CI, improving code quality and facilitating smoother deployments. Furthermore, I implemented rate limiting for distributed applications in Kubernetes to manage high-volume traffic effectively. Together, these initiatives increased operational efficiency and minimized downtime.

VK – Site Reliability Engineer

I began my professional career at VK, where I focused on ensuring the reliability, scalability, and high availability of various large-scale services. My portfolio included the social network Moi Mir, as well as Donationalerts, Boosty, and VKPay.

My responsibilities covered the full lifecycle of the infrastructure: from installing operating systems on bare-metal servers to configuring them for production use. I was deeply involved in monitoring system health, troubleshooting performance bottlenecks, and taking part in on-call rotations to address critical incidents swiftly. During periods of high load, I ensured services remained stable and efficient.

One of my most significant achievements was the successful deployment of MySQL Orchestrator. This solution streamlined our database operations, making it possible to relocate servers between data centers without data misalignment or downtime. As a result, we improved our system’s resilience against server outages and network failures, ensuring uninterrupted service for millions of users.

Public Projects

Fingerprint Driver for GPD Pocket 4

A libfprint driver for the FocalTech FT9362 fingerprint sensor (USB 2808:0752) used in the GPD Pocket 4.

Instead of the traditional NBIS-based minutiae matching approach, this driver uses a Siamese neural network for fingerprint comparison. This design choice was driven by the extremely small image size produced by the sensor, which makes reliable feature extraction impractical.

The neural-network-based approach proved to be robust in practice, achieving a low false-positive rate while maintaining acceptable matching accuracy.
eBPF Audit

An eBPF-based auditing tool for monitoring open files and network connections with minimal runtime overhead.

The system tracks which processes open specific files and establish network connections, streaming this data to users in real time. The event loop runs in under 15 µs, ensuring negligible performance impact.

It supports two operating modes:
- Collection mode — records observed files and connections into a local SQLite database.
- Analysis mode — compares live system activity against the stored baseline and reports deviations.
This approach allows system administrators to detect behavioral anomalies and identify potential security issues before they escalate.
Nix Local Cache

A Nix derivation builder for generating a local binary cache with user-managed signing keys.

The system supports multiple builders and multi-architecture builds, enabling scalable and reproducible package compilation across different target platforms.

A key feature is support for building directly from remote flakes, allowing users to populate a fresh local cache without maintaining local derivations. This makes it particularly well-suited for OTA NixOS updates, where end devices can fetch prebuilt artifacts instead of compiling locally.

The project exposes a simple HTTP API for scheduling builds and provides a lightweight web interface to monitor build status and failures.

About Me

Work Experience

Optimizely – Senior Site Reliability Engineer

ING – Site Reliability Engineer

Booking.com – Site Reliability Engineer

VK – Site Reliability Engineer

Public Projects

Fingerprint Driver for GPD Pocket 4

eBPF Audit

Nix Local Cache

Find Me Online