Thousands of AWS ECS applications stopped working due to a bug in RexRay-AWS interoperability

2023-06-22 570 words 3 minutes

Contents

A silent change to the AWS infrastructure caused thousands of ECS applications to become unusable. Applications that relied on EBS-backed volumes suddenly lost access to their storage. Read how we dealt with this issue and how you can prevent similar problems in your infrastructure.

Amazon Elastic Container Service is a popular solution for container deployment in the AWS cloud. It’s a great fit for stateless microservices, but it’s not so easy to run applications that have a need for persistent volumes.

It is possible to store data on EFS volumes. However, this comes with speed limitations. You have to rely on third-party plug-ins to take advantage of faster EBS-backed storage. One such plugin - RexRay - is recommended in AWS blog posts .

The RexRay plugin works on the ECS cluster instances and allows EBS disks to be created, attached and mounted as volumes for Docker containers.

The problem

While it appears that the problem occurred earlier in the US regions, users in the EU regions have been experiencing volume mounting issues since the 21st of June. First reports can be found on RexRay github issue #1282 . The plugin started to return:

docker: Error response from daemon: error while mounting volume '': 
VolumeDriver.Mount: docker-legacy: Mount: test: failed: problem with device discovery.

bdellegrazie has identified the root cause:

On some newer systems nvme id-ctrl /dev/nvme1n1 –raw-binary (used inside rexray) is returning different (unexpected) private metadata - this is why the volume isn’t found. Hopefully we can make it use ebsnvme-id instead (from here: https://github.com/amazonlinux/amazon-ec2-utils/blob/main/ebsnvme-id )

The problem has been caused by a quiet change in the infrastructure of AWS. The RexRay plugin was relying on the output of the nvme id-ctrl command, which had been changed by AWS. The plugin has not been updated to use the new command, and it appears that it may not be updated soon, as the project appears to be unmaintained.

The impact

Thousands of ECS applications were affected by this problem. The applications that relied on EBS-backed volumes stopped working. The problem was not limited to new deployments. Applications that had been running for months or years also stopped working.

The problem could be hotfixed by either downgrading the instance types used to older families (r4/c4/m4) which did not use nvme disks, but this meant sacrificing significant performance. Another option was to wrap the nvme binary with a script that would return the old output.

The solution

Two users have suggested a more permanent solution which is to use the AWS ec2-utils ebsnvme-id instead of nvme to get the correct device name from AWS. You can see the patch at https://github.com/joan-s-molas/rexray/commit/efe0dda26a5eb624608854a9638527915cd9871b .

Several users, including ourselves, have tested the patch and confirmed that it works. We have created a fork of the RexRay repository with the patch applied and published a new version of the plugin to the Docker Hub.

To use the patched version, update your RexRay installation scripts to use the following image:

EBS_USELARGEDEVICERANGE=true \
docker plugin install umsp/rexray-ebs:nvme-fix \
 --alias rexray/ebs REXRAY_PREEMPT=true \
 EBS_REGION=${aws_region} --grant-all-permissions

The lesson

This problem highlights the importance of the use of well maintained software. The RexRay plugin is an AWS recommendation, but it does not appear to be actively under maintenance.

The issue is a reminder that even a legacy system can be affected by a change in the underlying infrastructure. It is important to keep your infrastructure up to date and to test it regularly.