[Fargate/ECS] [Image caching]: provide image caching for Fargate. #696

matthewcummings · 2020-01-14T19:28:58Z

EDIT: as @ronkorving mentioned, image caching is available for EC2 backed ECS. I've updated this request to be specifically for Fargate.

What do you want us to build?
I've deployed scheduled Fargate tasks and been clobbered with high data transfer fees pulling down the image from ECR. Additionally, configuring a VPC endpoint for ECR is not for the faint of heart. The doc is a bit confusing.

It would be a big improvement if there were a resource (network/host) local to the instance where my containers run which could be used to load my docker images.

Which service(s) is this request for?
Fargate and ECR.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I don't want to be charged for pulling a Docker image every time my scheduled Fargate task runs.
On that note the VPC endpoint doc should be better too.

Are you currently working around this issue?
This was for a personal project, I instead deployed an EC2 instance running a cron job, which is not my preference. I would prefer using Docker and the ECS/Fargate ecosystem.

jtoberon · 2020-01-15T23:25:57Z

@matthewcummings can you clarify which doc you're talking about ("The doc is horrific")? Can you also clarify which regions your Fargate tasks and your ECR images are in?

matthewcummings · 2020-01-15T23:51:15Z

@jtoberon can we have these kinds of things in every region? I generally use us-east-1 and us-west-2 these days.

matthewcummings · 2020-01-15T23:56:52Z

It seems better now https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html. It has been updated from what I can see.

However, it still feels like a leaky abstraction. I'd argue that I shouldn't need to know/think about S3 here. Nowhere else in the ECS/EKS/ECR ecosystem do we really see mention of S3.

It would be great if the S3 details could be "abstracted away".

jtoberon · 2020-01-16T00:18:17Z

Regarding regions, I'm really asking whether you're doing cross-region pulls.

You're right: this is a leaky abstraction. The client (e.g. docker) doesn't care, but from a networking perspective you need to poke a hole to S3 right now.

Regarding making all of this easier, we plan to build cross-region replication, and we plan to simplify the registry URL so that you don't have to think as much about which region you're pulling from. #140 has more details and some discussion.

matthewcummings · 2020-01-16T00:34:16Z

Ha ha, thanks. Excuse my snarkiness. . . I am not doing cross-region pulls right now but that is something I may need to do.
Thank you!

matthewcummings · 2020-01-16T00:35:47Z

@jtoberon your call on whether this should be a separate request or folded into the other one.

ronkorving · 2020-01-17T06:09:03Z

Wait, aren't you really asking for ECS_IMAGE_PULL_BEHAVIOR control?

This was added (it seems) to ECS EC2 in 2018:
https://aws.amazon.com/about-aws/whats-new/2018/05/amazon-ecs-adds-options-to-speed-up-container-launch-times/

Agent config docs.

I get the impression Fargate does not give control over that, and does not have it set to prefer-cached or once. This is what we really need, isn't it?

matthewcummings · 2020-01-18T16:03:25Z

@ronkorving yes, that's exactly what I've requested. I wasn't aware of the ECS/EC2 feature. . . thanks for pointing me to that. However, a Fargate option would be great. I'm going to update the request.

koxon · 2020-01-24T11:10:13Z

much needed indeed this caching option for fargate

rametta · 2020-01-30T14:33:41Z

I would like to upvote this feature too.
I'm using Fargate at work and our images are ~1GB and it takes very long to start the task because it needs to redownload the image from ECR all the time. If there was some way to cache the image just like the way it's possible for ECS on EC2, then this would be extremely beneficial.

andrestone · 2020-02-17T16:38:47Z

How's this evolving?

There are many use cases where what you need is just a Lambda with unrestricted access to a kernel / filesystem. Having Fargate with cached / hot images perfectly fits this use case.

fitzn · 2020-02-21T14:40:34Z

@jtoberon @samuelkarp I realize that this is a more involved feature to build than it was on ECS with EC2 since the instances are changing underneath across AWS accounts, but are you able to provide any timeline on if and when this image caching would be available in Fargate? Lambda eventually fixed this same cold start issue with the short-term cache. This request is for the direct analog in Fargate.

Our use case: we run containers on-demand when our customers initiate an action and connect them to the container that we spin up. So, it's a real-time use case. Right now, we run these containers on ECS with EC2 and the launch times are perfectly acceptable (~1-3 seconds) because we cache the image on the EC2 box with PULL_BEHAVIOR.

We'd really like to move to Fargate but our testing shows our Fargate containers spend ~70 seconds in the PENDING state before moving to the RUNNING state. ECR reports our container at just under 900MB. Both ECR and the ECS cluster are in the same region, us-east-1.

We have to make some investments in the area soon so I am trying to get a sense for how much we should invest into optimizing our current EC2-based setup because we absolutely want to move to Fargate as soon as this cold start issue is resolved. As always, thank you for your communication.

Brother-Andy · 2020-03-10T14:52:43Z

I wish Fargate could have some sort of caching. Due to lack of environment variables my task just kept falling during all weekend. And every restart meant that new image will be downloaded from docker hub. In the end I've faced with horrible traffic usage, since Fargate had been deployed within private VPC.
Of course there is an endpoint (Fargate requires both ECR and S3 as I understood), but still some sort of caching would be much cheaper and predictable option.

pgarbe · 2020-03-17T06:36:42Z

@Brother-Andy For this use-case, I built cdk-ecr-sync which syncs specific images from DockerHub to ECR. Doesn't solve the caching part but might reduce your bill.

pyx7b · 2020-04-05T04:04:02Z

Ditto on the feature. We use containers to spin-off cyber ranges for students. Usage can fluctuate from 0 to thousands, Fargate is the best solution for ease of management, but the launch time is a challenge even with ECR. Caching is a much-needed feature.

narzero · 2020-04-25T16:51:52Z

+1

klatu201 · 2020-05-05T05:54:02Z

+1

rouralberto · 2020-05-20T02:52:48Z

Same here, I need to run multiple Fargate cross-region and it takes around a minute to pull the image. Once pulled, the task only takes 4 seconds to run. This completely stops us from using Fargate.

nmqanh · 2020-05-29T02:38:16Z

we had the same problem, the Fargate task should take only 10 seconds to run but it takes like a minute to pull the I image :(

congthang1 · 2020-06-06T10:13:33Z

Is that possible to use EFS file system to store image and the task just run this image? Or that is the same question of pulling from EFS to VPS which storing the container?

amunhoz · 2020-07-04T19:49:27Z

Azure is solving this problem in their plataform
https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/

nakulpathak3 · 2020-07-28T18:21:26Z

+1 we run a very large number of tasks and 1GB image. This would significantly speed up our deploys and would be a super helpful feature. We're considering moving to EC2 due to Fargate deployment slowness and this is one of the factors.

MattBred · 2020-08-05T22:09:04Z

Currently using Gitlab Runner Fargate driver which is great, except for the spinup time ~1-2 minutes for our image (> 1gb) because it has to pull it from ECS for every job. Not super great.

Would really like to see some sort of image caching.

benben · 2023-07-18T19:53:01Z

The alternative for that perf today is for us to use ECS on EC2.

@gregtws Can you confirm that it is really that fast on EC2 instances for ECS? We are currently run on Fargate and it takes ~60s for our ~1GB images. From everything I read so far, EC2 based ECS will not be much faster with scheduling, this is why I didn't consider this as a solution. If we can get from ~60s to below 10s that would be awesome and a move to EC2 worthwhile.

fitzn · 2023-07-18T20:10:44Z

@benben We run containers on ECS backed by EC2 instances. We use the setting to use cached container images on the EC2 instances. ECR reports our container image to be ~1.5GB. When we run ECS tasks with these containers, they start up in 1 second or so when cached. It takes 60 seconds or so to download an image fresh to the EC2 instance. So, we download the image as part of our start-up script for EC2 so that by the time the EC2 instance is added to the ECS cluster, the image is already cached.

benben · 2023-07-18T20:16:12Z

@fitzn thank you for sharing! That sounds awesome. I was under the impression that most of the time is spend on AWS internal scheduling 🔮 but if this is the case, I definitely move things to EC2. Would you be able to share that script which downloads the images on EC2 startup? Thanks again!

g-arjones · 2023-07-18T20:26:17Z

Zero excitement with this release. Quite the opposite. I was very disappointed when I understood what it was... I was misled by the fact the SOCI feature was posted in this issue, which is related to caching (something else entirely).

mfittko · 2023-07-18T20:57:44Z

I guess there are even people out there that would pay for a "real" caching solution with Fargate. How about a third type of capacity provider that has access to EFS backed file servers for the caching part? 🤔 I mean it seems to me that there is a huge architectural issue with offering a real pull through caching solution with Fargate. Maybe it would make things easier if the cache would only be needed for a rather limited amount of Fargate hosts.

gregtws · 2023-07-18T23:58:20Z

The alternative for that perf today is for us to use ECS on EC2.

@gregtws Can you confirm that it is really that fast on EC2 instances for ECS? We are currently run on Fargate and it takes ~60s for our ~1GB images. From everything I read so far, EC2 based ECS will not be much faster with scheduling, this is why I didn't consider this as a solution. If we can get from ~60s to below 10s that would be awesome and a move to EC2 worthwhile.

Yes. I just checked some of our ECS on EC2 services and I'm seeing sub 10s start to running when it hits a warm cache. The sampled container was ~1gb according to the ECR repo stats. The risk of course is randomly you hit a cold cache either because the image changed, its a new instance (autoscaling), or the cache was purged (unusual).

In your particular use case, YMMV since there are variables outside of control like whether you need to attach an ENI, your apps init time, etc. This is on a Nitro based system fwiw.

bearrito · 2023-07-19T15:50:24Z

@fitzn @benben

We have as similar solution. When our EC2 images start we pull our base image which is a chunky 20GB (AI/Robotics libs are big). Different images are built on top of that so different layers may or may or not then get pulled in automatically by the runtime.

I can't share out code because of IP but it boils down to something like the below inside cloud-init

runcmd:
 - /usr/bin/sleep 10
 - /usr/bin/docker pull chunk/image

fitzn · 2023-07-19T16:08:55Z

@benben Yeah it's something like this:

# Configure caching of images
cat <<'EOF' >> /etc/ecs/ecs.config
ECS_CLUSTER=ourcluster
ECS_IMAGE_PULL_BEHAVIOR=once
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m
ECS_IMAGE_CLEANUP_INTERVAL=24h
EOF

# Get the ECR sign-in credentials
curl -o docker-credential-ecr-login https://amazon-ecr-credential-helper-releases.s3.us-east-2.amazonaws.com/0.6.0/linux-amd64/docker-credential-ecr-login
chmod +x docker-credential-ecr-login
mv docker-credential-ecr-login /usr/bin/

# Tell Docker to use ECR login credentials
DockerDir="/home/ec2-user/.docker"
DockerConfig="${DockerDir}/config.json"

mkdir -p $DockerDir
echo "{ \"credsStore\": \"ecr-login\" } " > $DockerConfig
chmod a+rw $DockerConfig

# Download the image
# Optional: read "v1" from somewhere (or use the current date) to dynamically get a newer image version.
su -c "docker pull ourrepository.com/ourimage:v1" - ec2-user

benben · 2023-07-20T06:26:50Z

Thank you very much @bearrito & @fitzn !

benben · 2023-07-21T09:10:36Z

For everyone interested in SOCI: I added it to our github actions build pipeline and did not see any improvements.

Here are the measurements after deployment:

w/o SOCI: samples: 18	avg_start_time: 77.26s	 avg_pull_time: 41.86s
w/  SOCI: samples: 27	avg_start_time: 76.68s	 avg_pull_time: 43.57s

I had to give up our zstd compression which was another recommendation from AWS since it is not compatible yet and had to do quite some hoops to get there since soci is incompatible to the current docker version in github actions so I had to repull everything into containerd.

Kern-- · 2023-07-27T23:57:13Z

@benben can you share some general details about the tasks that you tested with? The pulls times look very similar with and without SOCI which is unusual. It could be that SOCI isn't being used for some reason. (For example, SOCI won't be used if you have a logging container in your task that doesn't have a SOCI index. Currently Fargate requires that all images in the task have a SOCI index to use SOCI). Either way it would be helpful for more details so we can investigate this.

benben · 2023-08-04T09:24:48Z

@Kern--

Thank you for that info! I must have missed that in the docs. We have another container running to route logs with firelens. I added SOCI there too and now it worked.

samples: 688        avg_start_time: 25.81s	 avg_pull_time: 6.19s
samples: 1864       avg_start_time: 56.73s	 avg_pull_time: 30.48s

PS: Sorry to everyone else for being slightly offtopic here.

robb666 · 2023-09-17T14:05:05Z

A caching feature is indeed needed in Fargate for me as well!

I have a streamlit app with yolo model. After trimming dependencies and compressing via zstd my docker container is circa 690MB on ECR and takes about 55s to run a task on ECS Fargate.

seivan · 2023-09-18T03:24:56Z

If CDK can one line abstract SOCI away for private ECRs, I'm all for it as an intermediate solution 👍🏼

asymfermion · 2023-11-09T22:58:11Z

We really have the same problem with Fargate. it cost us a lot of money. so it will be awesome if it can be done for Fargate.

wosiu · 2023-11-20T22:49:59Z

For everyone here needing cross-region pull to fargate - please vote on: ECR to ECR pull-through cache: #2208. This will address all use cases, not only fargate.

vchettur · 2024-04-06T15:22:31Z

This feature will be great. I just started working in Fargate and am developing an on-demand HLS streaming solution using gstreamer and a SaaS repository that stores the original mp3 and mp4s. The image pull from ECR is the biggest bottleneck otherwise the performance is terrific. I did a multi-stage build and SOCI helps a lot but still my image pull every time is around 15 seconds but a huge improvement over the 50+ seconds I was having.

underclockeddev · 2024-05-25T04:58:39Z

big need

SimonCatMew · 2024-06-18T13:23:00Z

drop some cache please

tmiklu · 2024-11-22T06:50:02Z

Almost five years and still not nothing. Big need.

fish-not-phish · 2024-12-01T05:39:27Z

SOCI cut down my launch time by nearly 50%, but it still takes around 1 minute for the task to launch. There should be a better option for folks with large images. I have stripped down my image as much as possible.

BwL1289 · 2025-03-08T16:46:26Z

Interested!

nikitahlushak · 2025-03-10T11:41:36Z

@fish-not-phish Hi, can you please share some details of what stack do you use?
I am very doubtful that we will benefit from SOCI when using Golang stack with precompiled binaries and almost nothing else to load asynchronously.

matthewcummings added the Proposed Community submitted issue label Jan 14, 2020

pavneeta added ECS Amazon Elastic Container Service Fargate AWS Fargate labels Jan 14, 2020

pavneeta assigned akshayram-wolverine Jan 14, 2020

jtoberon added the ECR Amazon Elastic Container Registry label Jan 15, 2020

matthewcummings changed the title ~~[Fargate/ECS] [Image caching]: provide image caching for ECS.~~ [Fargate/ECS] [Image caching]: provide image caching for Fargate. Jan 18, 2020

bencompton mentioned this issue Feb 20, 2020

[EKS + Fargate] [request]: Managed Knative (i.e., competitor to Google Cloud Run) #763

Open

haidaraM mentioned this issue Aug 27, 2023

Fargate SOCI to speed up agents start time haidaraM/terraform-jenkins-aws-fargate#6

Closed

nickjj mentioned this issue Nov 7, 2023

Dockerfile: Shrinking the image from 1.6 GB to 600 MB with Debian Slim and multi-stage builds rails/rails#46855

Closed

zaru added this to OSSトラック Feb 19, 2024

vibhav-ag assigned herrhound and AbhishekNautiyal and unassigned vaibhavkhunger Oct 23, 2024

mikestef9 removed the ECR Amazon Elastic Container Registry label Mar 20, 2025

[Fargate/ECS] [Image caching]: provide image caching for Fargate. #696

[Fargate/ECS] [Image caching]: provide image caching for Fargate. #696

Comments

matthewcummings commented Jan 14, 2020 • edited Loading

jtoberon commented Jan 15, 2020

matthewcummings commented Jan 15, 2020 • edited Loading

matthewcummings commented Jan 15, 2020 • edited Loading

jtoberon commented Jan 16, 2020

matthewcummings commented Jan 16, 2020

matthewcummings commented Jan 16, 2020

ronkorving commented Jan 17, 2020 • edited Loading

matthewcummings commented Jan 18, 2020

koxon commented Jan 24, 2020

rametta commented Jan 30, 2020

andrestone commented Feb 17, 2020

fitzn commented Feb 21, 2020

Brother-Andy commented Mar 10, 2020

pgarbe commented Mar 17, 2020

pyx7b commented Apr 5, 2020

narzero commented Apr 25, 2020

klatu201 commented May 5, 2020

rouralberto commented May 20, 2020

nmqanh commented May 29, 2020

congthang1 commented Jun 6, 2020

amunhoz commented Jul 4, 2020

nakulpathak3 commented Jul 28, 2020

MattBred commented Aug 5, 2020

benben commented Jul 18, 2023

fitzn commented Jul 18, 2023

benben commented Jul 18, 2023

g-arjones commented Jul 18, 2023

mfittko commented Jul 18, 2023 • edited Loading

gregtws commented Jul 18, 2023

bearrito commented Jul 19, 2023

fitzn commented Jul 19, 2023

benben commented Jul 20, 2023

benben commented Jul 21, 2023

Kern-- commented Jul 27, 2023

benben commented Aug 4, 2023

robb666 commented Sep 17, 2023

seivan commented Sep 18, 2023 • edited Loading

asymfermion commented Nov 9, 2023

wosiu commented Nov 20, 2023 • edited Loading

vchettur commented Apr 6, 2024 • edited Loading

underclockeddev commented May 25, 2024

SimonCatMew commented Jun 18, 2024

tmiklu commented Nov 22, 2024

fish-not-phish commented Dec 1, 2024 • edited Loading

BwL1289 commented Mar 8, 2025

nikitahlushak commented Mar 10, 2025 • edited Loading

matthewcummings commented Jan 14, 2020 •

edited

Loading

matthewcummings commented Jan 15, 2020 •

edited

Loading

matthewcummings commented Jan 15, 2020 •

edited

Loading

ronkorving commented Jan 17, 2020 •

edited

Loading

mfittko commented Jul 18, 2023 •

edited

Loading

seivan commented Sep 18, 2023 •

edited

Loading

wosiu commented Nov 20, 2023 •

edited

Loading

vchettur commented Apr 6, 2024 •

edited

Loading

fish-not-phish commented Dec 1, 2024 •

edited

Loading

nikitahlushak commented Mar 10, 2025 •

edited

Loading