Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition on slow disk #349

Open
cariaso opened this issue Jul 29, 2021 · 3 comments
Open

race condition on slow disk #349

cariaso opened this issue Jul 29, 2021 · 3 comments

Comments

@cariaso
Copy link

cariaso commented Jul 29, 2021

https://github.com/magic-wormhole/magic-wormhole
uses txtorcon.
if I run
wormhole receive 3-some-code --tor --launch-tor
it will call into txtorcon.

However in my current environment 100% of the time it will quickly crash with the message

launching a new Tor process, this may take a while..
 Unhandled Error
 Traceback (most recent call last):
 Failure: twisted.internet.error.ConnectError: An error occurred while connecting: 2: No such file or directory.

however if I add a time.sleep(0.5) after the txtorcon/controller.py line 360 call

 transport = reactor.spawnProcess(
        process_protocol,
        tor_binary,
        args=args,
        env={'HOME': data_directory},
        path=data_directory if os.path.exists(data_directory) else None,  # XXX error if it doesn't exist?
    )

The problem goes away. (Smaller sleeps seem to work, but I've not measured the exact threshold). I expect this is somehow related to the fact that I'm running off of networked storage.

Can anyone offer deeper insight into this, and perhaps a suitable solution.

cariaso added a commit to cariaso/txtorcon that referenced this issue Jul 29, 2021
@meejah
Copy link
Owner

meejah commented Jul 29, 2021

Hmmm!

Very interesting ... from the error I assume this is an error while trying to connect to a unix-based control socket. By "networked storage" you mean NFS or ...? (I have no idea how unix-sockets might work on such storage ;) )

@cariaso
Copy link
Author

cariaso commented Aug 2, 2021

I assume this is an error while trying to connect to a unix-based control socket.

yes

By "networked storage" you mean NFS or ...?

AWS EBS gp3 is mounted as the storage for a docker container

It may sound complicated, but works surprisingly well. Across many applications this is the first issue I've encountered.

https://github.com/cariaso/txtorcon/commits/main
has been sufficient for my needs.

@meejah
Copy link
Owner

meejah commented Aug 10, 2021

Obviously a delay isn't ever going to be the right thing (and, for Twisted code, time.sleep(...) is definitely not the right way to delay).

So, I think what's really going on here is this: when Tor is launched, it takes some amount of time until we can connect to the control socket. Currently, that is determined by watching Tor's logs (e.g. https://github.com/meejah/txtorcon/blob/main/txtorcon/controller.py#L1280 looks for the "Opening control ..." line).

I suspect what's happening is that on your "slow" disk, Tor is writing the control socket, printing that line to stdout, but the actual file hasn't been sync'd (or whatever) yet? So then immediately after that, txtorcon tries to connect, but there's no socket.

So I can think of two "more proper" fixes:

  • wait until the file actually exists (only-if it's a unix-connection, etc)
  • re-work the Controller code in TorProcessProtocol a little so that instead of parsing stdout lines and really hoping 🤞 that it's actually listening, instead we just attempt connects with a slight delay before "failing". e.g. connect up to 20 times, waiting 0.1s between each attempt.

The latter will make things more-robust, but also might fail slightly slower in some cases (oh well). I like that the latter thing doesn't have any special-case code (e.g. "is it a unix-socket?", parse file, etc, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants