Skip to content

Improved detection of timeout based vs direct cancellation of cancel scopes #591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 24, 2023

Conversation

agronholm
Copy link
Owner

Fixes #589.

@smurfix
Copy link
Collaborator

smurfix commented Jul 18, 2023

Unfortunately this patch doesn't fix anything on my system. Tested against anyio/fix-timeout-after-success and Trio/v0.22.2.

Test run:

Crash TRIO=False Owch 33 34
Crash TRIO=None Owch 7015 7016
Crash TRIO=True Owch 76 77
OK with TRIO=2

using this code:

import sys
import anyio
import trio
import random

T = 0.0001
N = int(0.1/T)

async def eat(r):
    a=0
    b=0
    while True:
        msg = None
        try:
            with (trio if TRIO==2 else anyio).fail_after(T):
                msg = await r.receive()
        except (trio.TooSlowError,TimeoutError):
            if msg is not None:
                raise RuntimeError(f"Owch {a} {b}")
            b += 1
        else:
            a += 1
        if a%N == 0 or b%N == 0:
            sys.stderr.write(f"  {a} {b}\r"); sys.stderr.flush()

async def main():
    s,r = (trio.open_memory_channel if TRIO else anyio.create_memory_object_stream)(10)
    async with (trio.open_nursery if TRIO else anyio.create_task_group)() as tg:
        tg.start_soon(eat,r)
        for _ in range(100*N):
            await (trio if TRIO else anyio).sleep(T*(random.random() * 0.7 + 0.9))
            await s.send(True)
        tg.cancel_scope.cancel()

for TRIO in (False,None,True,2):
    try:
        if TRIO:
            trio.run(main)
        else:
            anyio.run(main, backend="asyncio" if TRIO is None else "trio")
    except Exception as exc:
        if hasattr(exc,"exceptions"):
            exc = exc.exceptions[0]
        print(f"Crash TRIO={TRIO} {exc}")
    else:
        print(f"OK with TRIO={TRIO}")

@smurfix
Copy link
Collaborator

smurfix commented Jul 18, 2023

The test program does not crash (or at least not reliably enough …) if I increase T to 0.01 on my test machine, which suggests that if you tried my previous version and didn't see the problem you need to decrease T somewhat; after doing that, you might have to adjust the randomization so that both values increase at roughly the same speed. If the first value stays at zero then T is too low.

@agronholm
Copy link
Owner Author

I was not able to 100% fix this on trio because it's unable to tell the difference between timeout based cancellation vs manual cancellation + deadline exceeded later, unlike the asyncio backend where the timer expiring event is explicitly recorded and checked against. Are you able to make the test program fail with asyncio though? If so, that then would indicate a problem.

@agronholm
Copy link
Owner Author

I wrote that too soon. After reading the code, it looks like your new test snippet indeed tests against the asyncio backend too. I'll see if I can sort this out before heading to EuroPython.

@agronholm
Copy link
Owner Author

With the changes I pushed, I can no longer repro the problem with your script. Can you confirm?

@smurfix
Copy link
Collaborator

smurfix commented Jul 22, 2023

With the current changes (commit 95d1f97) there is no longer a problem with the Trio backend.

However, I found a different problem with asyncio. The inner test loop terminates way too early.

To reproduce:

  • replace the outer loop with for TRIO in (None,):
  • play with the value of T; on my system I can reliably trigger this with 0.00005.

The problem: The sleep call in the for _ in range(100*N) loop runs a few times, then its cancellation escapes and silently terminates the taskgroup.

Thus the code, changed to

    async with (trio.open_nursery if TRIO else anyio.create_task_group)() as tg:
        tg.start_soon(eat,r)
        for _ in range(100*N):
            try:
                await (trio if TRIO else anyio).sleep(T*(random.random() * 0.7 + 0.9))
            except BaseException as exc:
                raise RuntimeError("What?") from exc
            await s.send(True)
        tg.cancel_scope.cancel()


ends up emitting Crash TRIO=None What? which is kindof an unsatisfactory result. :-/

@agronholm
Copy link
Owner Author

Back to the drawing board I guess ☹️
I'll let you know once everything works again.

@agronholm
Copy link
Owner Author

I have played around with the code without being able to repro the problem. Would you mind pasting your entire script here with which you repro'd the new issue, so I can be sure I'm doing it right?

@smurfix
Copy link
Collaborator

smurfix commented Jul 22, 2023

Sure.

Python 3.11.2, standard Debian installation. The error shows up in a new venv, so it's unlikely to be related to any strange module I installed.

Removing the fail_after call in eat makes the error go away.

import sys
import anyio
import asyncio
import trio
import random
TRIO=None

T = 0.00005
N = int(0.1/T)

async def eat(r):
    a=0
    b=0
    while True:
        msg = None
        try:
            with (trio if TRIO==2 else anyio).fail_after(T):
                msg = await r.receive()
        except (trio.TooSlowError,TimeoutError):
            if msg is not None:
                raise RuntimeError(f"Owch {a} {b}")
            b += 1
        else:
            a += 1
        if a%N == 0 or b%N == 0:
            sys.stderr.write(f"  {a} {b}\r"); sys.stderr.flush()

async def main():
    s,r = (trio.open_memory_channel if TRIO else anyio.create_memory_object_stream)(10)
    async with (trio.open_nursery if TRIO else anyio.create_task_group)() as tg:
        tg.start_soon(eat,r)
        for _ in range(10*N):
            try:
                await (trio if TRIO else asyncio if TRIO is None else anyio).sleep(T*(random.random() * 0.7 + 0.9))
            except BaseException as exc:
                raise RuntimeError("What?") from exc
            await s.send(True)
        tg.cancel_scope.cancel()


for TRIO in (None,):
#for TRIO in (False,None,True,2):
    try:
        if TRIO:
            trio.run(main)
        else:
            anyio.run(main, backend="asyncio" if TRIO is None else "trio")
    except Exception as exc:
        if hasattr(exc,"exceptions"):
            exc = exc.exceptions[0]
        raise exc
    else:
        print(f"OK with TRIO={TRIO}")

$ PYTHONPATH=/src/anyio/src python3 /tmp/t.py 
Traceback (most recent call last):
  File "/tmp/t.py", line 34, in main
    await (trio if TRIO else asyncio if TRIO is None else anyio).sleep(T*(random.random() * 0.7 + 0.9))
  File "/usr/lib/python3.11/asyncio/tasks.py", line 639, in sleep
    return await future
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/t.py", line 51, in <module>
    raise exc
  File "/tmp/t.py", line 36, in main
    raise RuntimeError("What?") from exc
RuntimeError: What?

@agronholm
Copy link
Owner Author

Thanks, I was now able to repro the problem locally.

@agronholm
Copy link
Owner Author

What seems to be happening is that the task is cancelled by a cancel scope, but the cancellation is only delivered to the task after a new scope has been entered. The new scope is exited with CancelledError, and sees that it didn't cancel the task, so it lets the exception escape.

@agronholm
Copy link
Owner Author

I checked and it seems that the stdlib async with asyncio.timeout(...): suffers from the exact same issue, and for the exactly same reason.

@agronholm
Copy link
Owner Author

I went to talk to Łukasz Langa here at the EP2023 sprints and agreed that I should report this as an issue.

But given that this is a fundamental issue in asyncio, I don't think this should delay the 4.0 release any longer. Would you agree that this PR substantially improves the semantics even though it doesn't fully resolve all the issues? We can work on improving things in later releases.

@smurfix
Copy link
Collaborator

smurfix commented Jul 23, 2023

Ah. Frankly I already suspected that.

Sure, go ahead with 4.0. This issue is not a problem for me, I'm using Trio. ;-) and even if it was it's not anyio's fault and even if you can work around it in anyio, given that it affects non-anyio code, doing so would not really help much.

@agronholm
Copy link
Owner Author

One issue I noticed on py3.11 is that when this happens, the cancel scope doesn't uncancel the host task since that is currently only triggered when a CancelledError is caught by the CM. I think I'll at least fix that before merging.

@agronholm
Copy link
Owner Author

The tricky part is constructing a test to consistently repro this situation.

@agronholm
Copy link
Owner Author

@graingert and I both came to the conclusion that AnyIO's memory object streams are the fundamental problem here. Neither of us could reproduce the problem with asyncio queues.

@smurfix
Copy link
Collaborator

smurfix commented Jul 23, 2023

Are they? AFAICT the only async thing memory streams do is wait on (and set, and (probably most relevant here) cancel tasks waiting on) perfectly normal Events … no idea why that should be a "fundamental problem".

Thus the next step I'd consider would be to copy the relevant parts into your test program, maybe inline some of the code in question, and then boil that down to a somewhat-minimal reproducer.

@agronholm
Copy link
Owner Author

Memory object streams cheat a little bit as far as cancellation goes, see here. We suspected that this is the cause of your troubles. I'm working on an alternative solution.

@smurfix
Copy link
Collaborator

smurfix commented Jul 23, 2023

Owch. Yeah, I seem to have missed that one.

I agree that you really shouldn't do that.

@agronholm
Copy link
Owner Author

That code was added to fix #146.

@agronholm
Copy link
Owner Author

agronholm commented Jul 24, 2023

After a lengthy discussion on the Gitter Trio room, I decided to back out of deadline_reached and instead implement CancelScope.cancelled_caught from Trio. We discussed ways to improve the detection of cancellation reason/source, but those improvements will have to be deferred to a later feature release. The memory object stream receive bug still needs to be addressed before v4.0 final, but I may do that in a separate PR.

@agronholm agronholm merged commit 6f0f14d into master Jul 24, 2023
@agronholm agronholm deleted the fix-timeout-after-success branch July 24, 2023 10:11
@agronholm
Copy link
Owner Author

As I made some significant tweaks before merging, it might be prudent to test that master still works for you. I did add more tests to ensure that there are no regressions, but...you never know 😛

@smurfix
Copy link
Collaborator

smurfix commented Jul 24, 2023

Well, not master, but dont_cancel_finished_futures (21804f7) does pass. Thanks for all that work!

I'm going to run my dev tests with it but I really don't expect any more surprises. This bug didn't show up in dev tests either – it raised its ugly head because my home's temperature monitors died every week with no good reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeout error after success
2 participants