Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fargate/ECS] [Image caching]: provide image caching for Fargate. #696

Open
matthewcummings opened this issue Jan 14, 2020 · 146 comments
Open
Assignees
Labels
ECS Amazon Elastic Container Service Fargate AWS Fargate Work in Progress

Comments

@matthewcummings
Copy link

matthewcummings commented Jan 14, 2020

EDIT: as @ronkorving mentioned, image caching is available for EC2 backed ECS. I've updated this request to be specifically for Fargate.

What do you want us to build?
I've deployed scheduled Fargate tasks and been clobbered with high data transfer fees pulling down the image from ECR. Additionally, configuring a VPC endpoint for ECR is not for the faint of heart. The doc is a bit confusing.

It would be a big improvement if there were a resource (network/host) local to the instance where my containers run which could be used to load my docker images.

Which service(s) is this request for?
Fargate and ECR.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I don't want to be charged for pulling a Docker image every time my scheduled Fargate task runs.
On that note the VPC endpoint doc should be better too.

Are you currently working around this issue?
This was for a personal project, I instead deployed an EC2 instance running a cron job, which is not my preference. I would prefer using Docker and the ECS/Fargate ecosystem.

@matthewcummings matthewcummings added the Proposed Community submitted issue label Jan 14, 2020
@pavneeta pavneeta added ECS Amazon Elastic Container Service Fargate AWS Fargate labels Jan 14, 2020
@jtoberon jtoberon added the ECR Amazon Elastic Container Registry label Jan 15, 2020
@jtoberon
Copy link

@matthewcummings can you clarify which doc you're talking about ("The doc is horrific")? Can you also clarify which regions your Fargate tasks and your ECR images are in?

@matthewcummings
Copy link
Author

matthewcummings commented Jan 15, 2020

@jtoberon can we have these kinds of things in every region? I generally use us-east-1 and us-west-2 these days.

@matthewcummings
Copy link
Author

matthewcummings commented Jan 15, 2020

It seems better now https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html. It has been updated from what I can see.

However, it still feels like a leaky abstraction. I'd argue that I shouldn't need to know/think about S3 here. Nowhere else in the ECS/EKS/ECR ecosystem do we really see mention of S3.

It would be great if the S3 details could be "abstracted away".

@jtoberon
Copy link

Regarding regions, I'm really asking whether you're doing cross-region pulls.

You're right: this is a leaky abstraction. The client (e.g. docker) doesn't care, but from a networking perspective you need to poke a hole to S3 right now.

Regarding making all of this easier, we plan to build cross-region replication, and we plan to simplify the registry URL so that you don't have to think as much about which region you're pulling from. #140 has more details and some discussion.

@matthewcummings
Copy link
Author

Ha ha, thanks. Excuse my snarkiness. . . I am not doing cross-region pulls right now but that is something I may need to do.
Thank you!

@matthewcummings
Copy link
Author

@jtoberon your call on whether this should be a separate request or folded into the other one.

@ronkorving
Copy link

ronkorving commented Jan 17, 2020

Wait, aren't you really asking for ECS_IMAGE_PULL_BEHAVIOR control?

This was added (it seems) to ECS EC2 in 2018:
https://aws.amazon.com/about-aws/whats-new/2018/05/amazon-ecs-adds-options-to-speed-up-container-launch-times/

Agent config docs.

I get the impression Fargate does not give control over that, and does not have it set to prefer-cached or once. This is what we really need, isn't it?

@matthewcummings matthewcummings changed the title [Fargate/ECS] [Image caching]: provide image caching for ECS. [Fargate/ECS] [Image caching]: provide image caching for Fargate. Jan 18, 2020
@matthewcummings
Copy link
Author

@ronkorving yes, that's exactly what I've requested. I wasn't aware of the ECS/EC2 feature. . . thanks for pointing me to that. However, a Fargate option would be great. I'm going to update the request.

@koxon
Copy link

koxon commented Jan 24, 2020

much needed indeed this caching option for fargate

@rametta
Copy link

rametta commented Jan 30, 2020

I would like to upvote this feature too.
I'm using Fargate at work and our images are ~1GB and it takes very long to start the task because it needs to redownload the image from ECR all the time. If there was some way to cache the image just like the way it's possible for ECS on EC2, then this would be extremely beneficial.

@andrestone
Copy link

How's this evolving?

There are many use cases where what you need is just a Lambda with unrestricted access to a kernel / filesystem. Having Fargate with cached / hot images perfectly fits this use case.

@fitzn
Copy link

fitzn commented Feb 21, 2020

@jtoberon @samuelkarp I realize that this is a more involved feature to build than it was on ECS with EC2 since the instances are changing underneath across AWS accounts, but are you able to provide any timeline on if and when this image caching would be available in Fargate? Lambda eventually fixed this same cold start issue with the short-term cache. This request is for the direct analog in Fargate.

Our use case: we run containers on-demand when our customers initiate an action and connect them to the container that we spin up. So, it's a real-time use case. Right now, we run these containers on ECS with EC2 and the launch times are perfectly acceptable (~1-3 seconds) because we cache the image on the EC2 box with PULL_BEHAVIOR.

We'd really like to move to Fargate but our testing shows our Fargate containers spend ~70 seconds in the PENDING state before moving to the RUNNING state. ECR reports our container at just under 900MB. Both ECR and the ECS cluster are in the same region, us-east-1.

We have to make some investments in the area soon so I am trying to get a sense for how much we should invest into optimizing our current EC2-based setup because we absolutely want to move to Fargate as soon as this cold start issue is resolved. As always, thank you for your communication.

@Brother-Andy
Copy link

I wish Fargate could have some sort of caching. Due to lack of environment variables my task just kept falling during all weekend. And every restart meant that new image will be downloaded from docker hub. In the end I've faced with horrible traffic usage, since Fargate had been deployed within private VPC.
Of course there is an endpoint (Fargate requires both ECR and S3 as I understood), but still some sort of caching would be much cheaper and predictable option.

@pgarbe
Copy link

pgarbe commented Mar 17, 2020

@Brother-Andy For this use-case, I built cdk-ecr-sync which syncs specific images from DockerHub to ECR. Doesn't solve the caching part but might reduce your bill.

@pyx7b
Copy link

pyx7b commented Apr 5, 2020

Ditto on the feature. We use containers to spin-off cyber ranges for students. Usage can fluctuate from 0 to thousands, Fargate is the best solution for ease of management, but the launch time is a challenge even with ECR. Caching is a much-needed feature.

@narzero
Copy link

narzero commented Apr 25, 2020

+1

1 similar comment
@klatu201
Copy link

klatu201 commented May 5, 2020

+1

@rouralberto
Copy link

Same here, I need to run multiple Fargate cross-region and it takes around a minute to pull the image. Once pulled, the task only takes 4 seconds to run. This completely stops us from using Fargate.

@nmqanh
Copy link

nmqanh commented May 29, 2020

we had the same problem, the Fargate task should take only 10 seconds to run but it takes like a minute to pull the I image :(

@congthang1
Copy link

Is that possible to use EFS file system to store image and the task just run this image? Or that is the same question of pulling from EFS to VPS which storing the container?

@amunhoz
Copy link

amunhoz commented Jul 4, 2020

Azure is solving this problem in their plataform
https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/

@nakulpathak3
Copy link

+1 we run a very large number of tasks and 1GB image. This would significantly speed up our deploys and would be a super helpful feature. We're considering moving to EC2 due to Fargate deployment slowness and this is one of the factors.

@MattBred
Copy link

MattBred commented Aug 5, 2020

Currently using Gitlab Runner Fargate driver which is great, except for the spinup time ~1-2 minutes for our image (> 1gb) because it has to pull it from ECS for every job. Not super great.

Would really like to see some sort of image caching.

@benben
Copy link

benben commented Jul 18, 2023

The alternative for that perf today is for us to use ECS on EC2.

@gregtws Can you confirm that it is really that fast on EC2 instances for ECS? We are currently run on Fargate and it takes ~60s for our ~1GB images. From everything I read so far, EC2 based ECS will not be much faster with scheduling, this is why I didn't consider this as a solution. If we can get from ~60s to below 10s that would be awesome and a move to EC2 worthwhile.

@fitzn
Copy link

fitzn commented Jul 18, 2023

@benben We run containers on ECS backed by EC2 instances. We use the setting to use cached container images on the EC2 instances. ECR reports our container image to be ~1.5GB. When we run ECS tasks with these containers, they start up in 1 second or so when cached. It takes 60 seconds or so to download an image fresh to the EC2 instance. So, we download the image as part of our start-up script for EC2 so that by the time the EC2 instance is added to the ECS cluster, the image is already cached.

@benben
Copy link

benben commented Jul 18, 2023

@fitzn thank you for sharing! That sounds awesome. I was under the impression that most of the time is spend on AWS internal scheduling 🔮 but if this is the case, I definitely move things to EC2. Would you be able to share that script which downloads the images on EC2 startup? Thanks again!

@g-arjones
Copy link

Zero excitement with this release. Quite the opposite. I was very disappointed when I understood what it was... I was misled by the fact the SOCI feature was posted in this issue, which is related to caching (something else entirely).

@mfittko
Copy link

mfittko commented Jul 18, 2023

I guess there are even people out there that would pay for a "real" caching solution with Fargate. How about a third type of capacity provider that has access to EFS backed file servers for the caching part? 🤔 I mean it seems to me that there is a huge architectural issue with offering a real pull through caching solution with Fargate. Maybe it would make things easier if the cache would only be needed for a rather limited amount of Fargate hosts.

@gregtws
Copy link

gregtws commented Jul 18, 2023

The alternative for that perf today is for us to use ECS on EC2.

@gregtws Can you confirm that it is really that fast on EC2 instances for ECS? We are currently run on Fargate and it takes ~60s for our ~1GB images. From everything I read so far, EC2 based ECS will not be much faster with scheduling, this is why I didn't consider this as a solution. If we can get from ~60s to below 10s that would be awesome and a move to EC2 worthwhile.

Yes. I just checked some of our ECS on EC2 services and I'm seeing sub 10s start to running when it hits a warm cache. The sampled container was ~1gb according to the ECR repo stats. The risk of course is randomly you hit a cold cache either because the image changed, its a new instance (autoscaling), or the cache was purged (unusual).

In your particular use case, YMMV since there are variables outside of control like whether you need to attach an ENI, your apps init time, etc. This is on a Nitro based system fwiw.

@bearrito
Copy link

@fitzn @benben

We have as similar solution. When our EC2 images start we pull our base image which is a chunky 20GB (AI/Robotics libs are big). Different images are built on top of that so different layers may or may or not then get pulled in automatically by the runtime.

I can't share out code because of IP but it boils down to something like the below inside cloud-init

runcmd:
 - /usr/bin/sleep 10
 - /usr/bin/docker pull chunk/image

@fitzn
Copy link

fitzn commented Jul 19, 2023

@benben Yeah it's something like this:

# Configure caching of images
cat <<'EOF' >> /etc/ecs/ecs.config
ECS_CLUSTER=ourcluster
ECS_IMAGE_PULL_BEHAVIOR=once
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m
ECS_IMAGE_CLEANUP_INTERVAL=24h
EOF

# Get the ECR sign-in credentials
curl -o docker-credential-ecr-login https://amazon-ecr-credential-helper-releases.s3.us-east-2.amazonaws.com/0.6.0/linux-amd64/docker-credential-ecr-login
chmod +x docker-credential-ecr-login
mv docker-credential-ecr-login /usr/bin/

# Tell Docker to use ECR login credentials
DockerDir="/home/ec2-user/.docker"
DockerConfig="${DockerDir}/config.json"

mkdir -p $DockerDir
echo "{ \"credsStore\": \"ecr-login\" } " > $DockerConfig
chmod a+rw $DockerConfig

# Download the image
# Optional: read "v1" from somewhere (or use the current date) to dynamically get a newer image version.
su -c "docker pull ourrepository.com/ourimage:v1" - ec2-user

@benben
Copy link

benben commented Jul 20, 2023

Thank you very much @bearrito & @fitzn !

@benben
Copy link

benben commented Jul 21, 2023

For everyone interested in SOCI: I added it to our github actions build pipeline and did not see any improvements.

Here are the measurements after deployment:

w/o SOCI: samples: 18	avg_start_time: 77.26s	 avg_pull_time: 41.86s
w/  SOCI: samples: 27	avg_start_time: 76.68s	 avg_pull_time: 43.57s

I had to give up our zstd compression which was another recommendation from AWS since it is not compatible yet and had to do quite some hoops to get there since soci is incompatible to the current docker version in github actions so I had to repull everything into containerd.

@Kern--
Copy link

Kern-- commented Jul 27, 2023

@benben can you share some general details about the tasks that you tested with? The pulls times look very similar with and without SOCI which is unusual. It could be that SOCI isn't being used for some reason. (For example, SOCI won't be used if you have a logging container in your task that doesn't have a SOCI index. Currently Fargate requires that all images in the task have a SOCI index to use SOCI). Either way it would be helpful for more details so we can investigate this.

@benben
Copy link

benben commented Aug 4, 2023

@Kern--

Thank you for that info! I must have missed that in the docs. We have another container running to route logs with firelens. I added SOCI there too and now it worked.

samples: 688        avg_start_time: 25.81s	 avg_pull_time: 6.19s
samples: 1864       avg_start_time: 56.73s	 avg_pull_time: 30.48s

PS: Sorry to everyone else for being slightly offtopic here.

@robb666
Copy link

robb666 commented Sep 17, 2023

A caching feature is indeed needed in Fargate for me as well!

I have a streamlit app with yolo model. After trimming dependencies and compressing via zstd my docker container is circa 690MB on ECR and takes about 55s to run a task on ECS Fargate.

@seivan
Copy link

seivan commented Sep 18, 2023

If CDK can one line abstract SOCI away for private ECRs, I'm all for it as an intermediate solution 👍🏼

@asymfermion
Copy link

We really have the same problem with Fargate. it cost us a lot of money. so it will be awesome if it can be done for Fargate.

@wosiu
Copy link

wosiu commented Nov 20, 2023

For everyone here needing cross-region pull to fargate - please vote on: ECR to ECR pull-through cache: #2208. This will address all use cases, not only fargate.

@vchettur
Copy link

vchettur commented Apr 6, 2024

This feature will be great. I just started working in Fargate and am developing an on-demand HLS streaming solution using gstreamer and a SaaS repository that stores the original mp3 and mp4s. The image pull from ECR is the biggest bottleneck otherwise the performance is terrific. I did a multi-stage build and SOCI helps a lot but still my image pull every time is around 15 seconds but a huge improvement over the 50+ seconds I was having.

@underclockeddev
Copy link

big need

@SimonCatMew
Copy link

drop some cache please

@tmiklu
Copy link

tmiklu commented Nov 22, 2024

Almost five years and still not nothing. Big need.

@fish-not-phish
Copy link

fish-not-phish commented Dec 1, 2024

SOCI cut down my launch time by nearly 50%, but it still takes around 1 minute for the task to launch. There should be a better option for folks with large images. I have stripped down my image as much as possible.

@BwL1289
Copy link

BwL1289 commented Mar 8, 2025

Interested!

@nikitahlushak
Copy link

nikitahlushak commented Mar 10, 2025

@fish-not-phish Hi, can you please share some details of what stack do you use?
I am very doubtful that we will benefit from SOCI when using Golang stack with precompiled binaries and almost nothing else to load asynchronously.

@mikestef9 mikestef9 removed the ECR Amazon Elastic Container Registry label Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECS Amazon Elastic Container Service Fargate AWS Fargate Work in Progress
Projects
None yet
Development

No branches or pull requests