Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating an alert with Slack notifications errors out due to rate-limiting in large Slack workspaces #32480

Open
3 tasks done
Usiel opened this issue Mar 3, 2025 · 7 comments
Labels
alert-reports Namespace | Anything related to the Alert & Reports feature

Comments

@Usiel
Copy link
Contributor

Usiel commented Mar 3, 2025

Bug description

Preconditions: Be connected to a Slack workspace with thousands of channels and even more archived channels.

  1. Make sure Slack is connected (v1 or v2)
  2. Edit an alert and open the "Notification method" section
  3. Select Slack

Actual: The request to /api/v1/report/slack_channels/?... errors out with a rate-limit error from the Slack client.
Expected: User can select a channel/group without much delay.

Screenshots/recordings

Image

Superset version

4.1.1

Python version

3.11

Node version

Not applicable

Browser

Chrome

Additional context

Why?

From what I could see this is due to Slack's implementation of the conversation.list endpoint. When the connected workspace has lots of archived channels, then the cursor will return far less conversations than specified in the request (using limit=999). This means we end up doing a lot of requests to fully iterate all conversations, eventually running into the rate-limit.

I applied a diff to properly handle rate-limit errors. This allows the request to eventually finish (6 minutes for our workspace) but it's not a great user experience 😆 (It may still be useful to add the rate-limit handler to preempt other Slack API issues).

diff --git a/superset/utils/slack.py b/superset/utils/slack.py
index 468429fb60..5d0cbfe507 100644
--- a/superset/utils/slack.py
+++ b/superset/utils/slack.py
@@ -22,6 +22,7 @@ from typing import Optional
 from flask import current_app
 from slack_sdk import WebClient
 from slack_sdk.errors import SlackApiError
+from slack_sdk.http_retry.builtin_handlers import RateLimitErrorRetryHandler

 from superset import feature_flag_manager
 from superset.exceptions import SupersetException
@@ -43,7 +44,12 @@ def get_slack_client() -> WebClient:
     token: str = current_app.config["SLACK_API_TOKEN"]
     if callable(token):
         token = token()
-    return WebClient(token=token, proxy=current_app.config["SLACK_PROXY"])
+    client = WebClient(token=token, proxy=current_app.config["SLACK_PROXY"])
+
+    rate_limit_handler = RateLimitErrorRetryHandler(max_retry_count=2)
+    client.retry_handlers.append(rate_limit_handler)
+
+    return client


 def get_channels_with_search(

Options

  • Update the Slack conversations asynchronously with a configurable Celery task and then use the cached conversation list when a user makes it to the edit alert window? Make sure rate-limiting errors are handled.
  • Allow users to enter any string as recipient and map to a channel ID asynchronously?
  • ...?

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
Copy link

dosubot bot commented Mar 3, 2025

Hey there, @Usiel! I'm here to assist you with any bugs, questions, or contributions you have for Superset. Let's tackle this issue together!

To address the rate-limiting issue when retrieving Slack channels in large workspaces, you can consider the following approaches:

  1. Implement Rate-Limit Handling: As you've already done, adding a RateLimitErrorRetryHandler to the Slack client is a good start. This will help manage rate-limit errors by retrying requests a limited number of times. You can adjust the max_retry_count to balance between retry attempts and user experience.

  2. Asynchronous Retrieval with Caching: Implementing a Celery task to fetch Slack channels asynchronously and cache the results can significantly improve performance. This way, when a user accesses the alert configuration, the application can quickly retrieve the cached list of channels without hitting the Slack API repeatedly. Ensure that the cache is updated periodically to reflect any changes in the Slack workspace.

  3. User Input for Channel Selection: Allow users to manually enter a channel name or ID, which can then be validated and mapped to the correct channel asynchronously. This approach reduces the need to fetch all channels upfront and can be particularly useful in large workspaces.

  4. Optimize API Calls: Consider optimizing the API calls by reducing the number of requests. For example, you can fetch only active channels or limit the number of channels retrieved per request to avoid hitting the rate limit quickly.

These strategies can help mitigate the rate-limiting issue and improve the user experience when configuring Slack notifications in Superset.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@dosubot dosubot bot added the alert-reports Namespace | Anything related to the Alert & Reports feature label Mar 3, 2025
@sadpandajoe
Copy link
Member

@Usiel how many channels does your slack workspace have. Unfortunately some of this is limited by the slack api itself since they only allow 1k in their api response with pagination and from the internet, it sounds like it may be rate limited to about 20 requests per minute. I think we don't currently look at archived channels but if we do, we can probably just add a flag, but if we aren't doing that a quick fix might be to add the flag as true according to the doc: https://api.slack.com/methods/conversations.list

@Usiel
Copy link
Contributor Author

Usiel commented Mar 4, 2025

We have around 20k channels, half of them archived. Interestingly, just now when I wanted to do some more probing, iterating all conversations took just a few seconds and only 16 requests. I can see in the logs that yesterday iterating all channels took 6 minutes and more than 200 (successful) requests, each with a distinct cursor (so these were not retried due to connectivity or similar). We saw this issue on 2 different Superset deployments in different datacenters owned by different teams (but connecting to the same Slack workspace).

Assuming this was just a temporary issue with Slack's API, the more extensive changes listed under "Options" are probably not necessary. Imo, the improvement to Superset's slack.utils.get_slack_client using Slack's rate-limit handler may still be a good addition. I can see about opening a PR for that.

@Usiel
Copy link
Contributor Author

Usiel commented Mar 5, 2025

I had an hourly celery task running Superset's get_channels_with_search(...) running since yesterday. Runtimes are between 10-30 seconds, indicating this may have been just a temporary issue.

I'm OK to close this issue unless someone else encountered the same degradation.

I will see about contributing the Slack rate-limit handler improvement soon-ish.

Usiel added a commit to Usiel/superset that referenced this issue Mar 5, 2025

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
While investigating apache#32480 I noticed that Superset's slack client does not handle rate-limiting. Although the original issue may have been intermittent, I still believe it will be useful to handle rate-limiting to make Slack operations with Superset more robust.

The rate-limit is easily triggered by 2 users trying to change a Slack alert with a large Slack workspace (multiple `conversations.list` requests with rate-limit 20 per minute), or lots of reports firing at the same time.

The `RateLimitErrorRetryHandler` feature is available since slack_sdk>=3.9.0 (Superset requires at least 3.19.0).
Usiel added a commit to Usiel/superset that referenced this issue Mar 6, 2025

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
While investigating apache#32480 I noticed that Superset's slack client does not handle rate-limiting. Although the original issue may have been intermittent, I still believe it will be useful to handle rate-limiting to make Slack operations with Superset more robust.

The rate-limit is easily triggered by 2 users trying to change a Slack alert with a large Slack workspace (multiple `conversations.list` requests with rate-limit 20 per minute), or lots of reports firing at the same time.

The `RateLimitErrorRetryHandler` feature is available since slack_sdk>=3.9.0 (Superset requires at least 3.19.0).
@Usiel
Copy link
Contributor Author

Usiel commented Mar 6, 2025

I encountered the problem again and realized my test (hourly celery task) inadvertently only requested public channels. I can reliably reproduce the issue when requesting private and public channels. The bot is member of just a single private channel. Iterating the full list takes again around 5-7 minutes.

For further testing I ran the get_channels_with_search(...) code different parameters. When private channels are requested the full iteration becomes incredibly slow and needs to pause multiple times due to rate-limiting.

exclude_archived=False, types='None'                           
    => requests=<order of 10>, len(channels)=<order of 10k>
exclude_archived=True,  types='None'                           
    => requests=<order of 10>, len(channels)=<order of 10k>

exclude_archived=False, types='public_channel,private_channel' 
    => requests=<order of 100>, len(channels)=<order of 10k> + 1
exclude_archived=True,  types='public_channel,private_channel' 
    => requests=<order of 100>, len(channels)=<order of 10k> + 1

Imo, this is an issue, I just don't know whether this affects many other Superset users :) I doubt Slack will optimize their API.

Usiel added a commit to Usiel/superset that referenced this issue Mar 6, 2025

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
See apache#32480

This changes introduces a cached `get_channels` function for Slack conversations. Large workspaces require 5-10 minutes to retrieve all conversations.
@sadpandajoe
Copy link
Member

Potential fix for this might be: #32529

@Usiel
Copy link
Contributor Author

Usiel commented Mar 7, 2025

Potential fix for this might be: #32529

Thank you, yes that looks promising and fixes it pretty much (well, not for the first user - they have to wait a few minutes 😆). I think we'd be golden if we exposed the caching function as a celery task, so it can run be part of the beat schedule. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alert-reports Namespace | Anything related to the Alert & Reports feature
Projects
None yet
Development

No branches or pull requests

2 participants