Skip to content

Conversation

@last-genius
Copy link
Collaborator

Since _SHELL_MIN_FD=100, osh fails to build some Alpine packages (#2335):

$ osh -c 'ulimit -n 64; echo hi >out'
[ -c flag ]:1: I/O error applying redirect: Invalid argument'

(This is because osh attempts to save stdout's fd before redirection to a descriptor >=100, which ulimit -n 64 doesn't allow)

Currently we can't drop _SHELL_MIN_FD lower, because scripts like this crash osh (users are allowed to use fds<=99):

$ cat test.osh
exec 25>out
echo hello>&25
cat out
$ bin/osh test.osh
hello
oils I/O error (main): Bad file descriptor

This happens due to osh mishandling "internal" file descriptors, in this case the script file itself, as can be seen from the following strace log:

openat(AT_FDCWD, "test.osh", O_RDONLY)  = 3 <<<<
fcntl(3, F_DUPFD, 25)                   = 25 <<<<
close(3)                                = 0
openat(AT_FDCWD, "out", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fcntl(25, F_DUPFD, 25)                  = 26 <<<<<
close(25)                               = 0
dup2(3, 25)                             = 25
close(3)                                = 0
close(26)                               = 0 <<<< it's closed immediately after being saved
fcntl(1, F_DUPFD, 25)                   = 26
close(1)                                = 0
dup2(25, 1)                             = 1
dup2(26, 1)                             = 1
close(26)                               = 0
openat(AT_FDCWD, "out", O_RDONLY)       = 3
close(3)                                = 0
hello
read(25, 0x55bfa557e240, 4096)          = -1 EBADF (Bad file descriptor) <<< still 25
oils I/O error (main): Bad file descriptor
close(25)                               = 0

So, add some bookkeeping to FdState - it now tracks whether a file descriptor is persistent and invokes a callback when such persistent file descriptors are moved by osh.

There is currently only one usage of a persistent fd with a callback - FileLineReader for the script file, since that's what triggers the bug above. Additional users might need to be added as more bugs are discovered.

Drop _SHELL_MIN_FD to 25 for now.

With these changes, the above test.osh script works as intended, with internal fds being shifted away from user-requested fds. This potentially allows us to drop _SHELL_MIN_FD further in the future and untangle it from the limit on fds users can use in redirects.

@last-genius last-genius force-pushed the asv/better-fd-shifting branch 5 times, most recently from 013a8cc to 445ed77 Compare October 4, 2025 16:53
@last-genius
Copy link
Collaborator Author

Marked another internal file as persistent, fixing this issue:

$ cat source.osh
exec 25>out
echo hello>&25
cat out
$ bin/osh -c '. ./source.osh'
hello
[ -c flag ]:1: . builtin I/O error: Bad file descriptor

@last-genius last-genius force-pushed the asv/better-fd-shifting branch 10 times, most recently from 02102af to f5789b9 Compare October 6, 2025 20:11
@andychu
Copy link
Contributor

andychu commented Oct 7, 2025

Thanks for working on this! My first comment is that it's great the test passes, and nothing else fails!

I'm trying to think of any other counterexamples - Let me look at the Open() calls

I think _SHELL_MIN_FD would no longer makes as much sense as a name, but let's see

@andychu
Copy link
Contributor

andychu commented Oct 7, 2025

OK yeah there are some other Open() calls

$ bin/osh --eval source.osh 
hello
oils I/O error (main): Bad file descriptor

It might be that EVERY fd with Open() is persistent, in which case we don't need the param

$ fgrep -n 'Open(' */*.py
builtin/func_hay.py:50:            f, _ = self.fd_state.Open(path)
builtin/meta_oils.py:199:            f, fd = self.fd_state.Open(fs_path, persistent=True)
core/main_loop.py:462:        f, _ = fd_state.Open(fs_path)
core/process.py:213:    def Open(self, path, persistent=False):
core/process.py:225:        f, fd = self._Open(path, 'r', fd_mode, persistent)
core/process.py:234:        f, _ = self._Open(path, 'w', fd_mode, persistent)
core/process.py:239:    def _Open(self, path, c_mode, fd_mode, persistent):
core/process.py:753:                f, _ = self.fd_state.Open(argv0_path)
core/process_test.py:303:    def testOpen(self):
core/shell.py:154:        f, _ = fd_state.Open(rc_path)
core/shell.py:1070:                f, fd = fd_state.Open(script_name, persistent=True)

In that case we could get rid of the param, and the values for self.persistent are always True


I want to look at some strace now too

And try other things other than source.osh


Sorry you had the hiccup with the tell() stuff ... It's a bit surprising that had to be added

I think it may have been due to the pre-existing smell I mentioned -- the bad cast (which only accidentally works)


def ReplaceFd(self, fd):
# type: (mylib.LineReader) -> None
tell = self.f.tell()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I actually don't see why this works? Why do we need tell() and seek() ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially - we want to restore the position we've read the previous file up to.

the position that DUP preserves doesn't really cover this use case - it's a bit confusing because what gets preserved automatically is how far Python has internally read the file into a buffer, and not how much we have actually read from the file. Here, without restoring our position explicitly, we lose most of the file because to the OS, we've already read it to the end:

>>> import posix
>>> import fcntl
>>> fd = posix.open('test.osh', posix.O_RDONLY, 0o666)
>>> f = posix.fdopen(fd)
>>> f.readline()
'# first line\n'
>>> f.readline()
'# second line\n'
>>> f.tell()
27
>>> new_fd = fcntl.fcntl(fd, fcntl.F_DUPFD, 25)
>>> new_fd
25
>>> g = posix.fdopen(new_fd)
>>> g.tell()
3229
>>> g.readline()
'' <<< (should be '# third line', instead it's EOF)

(it's a bit different in the C++ implementation, but it's the same principle)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, that is interesting, but based on experience, I'm a bit "scared" of doing this

Actually one thing that became apparent to me is that shells PREDATE libc, and they don't use libc I/O -- they use POSIX I/O

Oils is unusual in that it uses libc I/O (which has buffering), and I might want to change that

that is

  • libc: FILE* fwrite, fread, fseek, ftell
  • posix: read, write, lseek, fcntl

i.e. it might be a bit bad that we are mixiing fcntl and fwrite/fwrite -- most shells don't do that

There can be buffering bugs


That said, this is a great change to explore the problem

I'm glad you were able to fix the test so quickly, and also understand mycpp so quickly! (even with some unfortunate warts)


I think we should look into the busybox ash algorithm, as mentioned on Zulip

A funny thing is that it's basically the dash source concatenated into a big 13K line file

https://github.com/brgl/busybox/blob/master/shell/ash.c

(btw one thing I did once is just upload this whole file into Claude, and then ask it questions ... it does hallucinate and get confused of course, but I think it was net helpful at least a few times ... I have to be real skeptical)

@andychu
Copy link
Contributor

andychu commented Oct 7, 2025

Hm to compare this PR against other shells, how about we change _SHELL_MIN_FD to 10:

$ bash -c 'ulimit -n 11; echo hi >out; cat out'
hi

$ dash -c 'ulimit -n 11; echo hi >out; cat out'
hi

$ bin/osh -c 'ulimit -n 11; echo hi >out; cat out'
[ -c flag ]:1: I/O error applying redirect: Invalid argument

Right now it still fails


And then I would like to compare the strace with what busybox ash does

I'm not sure how they compare right now, nor exactly what algorithm busybox uses !

@last-genius last-genius force-pushed the asv/better-fd-shifting branch from f5789b9 to 278d550 Compare October 7, 2025 08:52
@last-genius
Copy link
Collaborator Author

Hm to compare this PR against other shells, how about we change _SHELL_MIN_FD to 10:

$ bash -c 'ulimit -n 11; echo hi >out; cat out'
hi

I've dropped it from 25 to 10.

Bash actually handles ulimit < 10 as well:

$ strace -ff -e open,fcntl,openat,dup2,dup,close,read,lseek bash -c 'ulimit -n 9; echo hi > out; cat out'
openat(AT_FDCWD, "out", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fcntl(1, F_GETFD)                       = 0
fcntl(1, F_DUPFD, 10)                   = -1 EINVAL (Invalid argument)
fcntl(1, F_DUPFD, 10)                   = -1 EINVAL (Invalid argument)
fcntl(1, F_DUPFD, 0)                    = 4 <<< gives up DUP-ing above 10, and DUP-s anywhere
fcntl(1, F_GETFD)                       = 0
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
dup2(3, 1)                              = 1

source:

  new_fd = fcntl (fd, F_DUPFD, (fdbase < SHELL_FD_BASE) ? SHELL_FD_BASE : fdbase+1);
  if (new_fd < 0)
    new_fd = fcntl (fd, F_DUPFD, SHELL_FD_BASE);
  if (new_fd < 0)
    {
      new_fd = fcntl (fd, F_DUPFD, 0);
      savefd_flag = 1;
    }

@last-genius last-genius force-pushed the asv/better-fd-shifting branch from 278d550 to e3335e2 Compare October 7, 2025 13:09
@last-genius
Copy link
Collaborator Author

OK yeah there are some other Open() calls

$ bin/osh --eval source.osh 
hello
oils I/O error (main): Bad file descriptor

It might be that EVERY fd with Open() is persistent, in which case we don't need the param

In that case we could get rid of the param, and the values for self.persistent are always True

I've gone through all the uses of fd_state.Open(). Only one of them doesn't need to be persistent, so I've kept the parameter and changed the rest in separate commits, noting what each of them fixes. I'd rather we keep persistency as an explicit choice, since persistent users also need to provide a callback.

@last-genius
Copy link
Collaborator Author

Added spec tests for the issues covered in this PR (--eval, source, rcfile)

@andychu
Copy link
Contributor

andychu commented Oct 9, 2025

I wrote a script to compare the strace of different shells. Here is dash versus OSH on this branch for descriptor 8

left is dash, right is OSH

$ demo/fd-strace.sh side-by-side _tmp/fd-strace/8-{dash,osh}.txt
openat(AT_FDCWD, "demo/fd-number.sh openat(AT_FDCWD, "demo/fd-number.sh
fcntl(3, F_DUPFD, 10)               fcntl(3, F_DUPFD, 10)
close(3)                            fcntl(10, F_SETFD, FD_CLOEXEC)
fcntl(10, F_SETFD, FD_CLOEXEC)      close(3)
openat(AT_FDCWD, "_tmp/hello", O_WR fcntl(10, F_GETFL)
fcntl(8, F_DUPFD, 10)               openat(AT_FDCWD, "_tmp/hello", O_WR
dup2(3, 8)                          fcntl(3, F_GETFD)
close(3)                            fcntl(8, F_DUPFD, 10)
fcntl(1, F_DUPFD, 10)               dup2(3, 8)
close(1)                            close(3)
fcntl(11, F_SETFD, FD_CLOEXEC)      fcntl(8, F_GETFD)
dup2(8, 1)                          fcntl(1, F_DUPFD, 10)
dup2(11, 1)                         fcntl(11, F_SETFD, FD_CLOEXEC)
close(11)                           close(1)
using fd 8                          dup2(8, 1)
--- SIGCHLD {si_signo=SIGCHLD, si_c dup2(11, 1)
+++ exited with 0 +++               close(11)
                                    openat(AT_FDCWD, "_tmp/hello", O_RD
                                    close(3)
                                    using fd 8
                                    +++ exited with 0 +++

So it looks like we have 2 openat() calls now, which I don't think we should have

@andychu
Copy link
Contributor

andychu commented Oct 9, 2025

and then here is busybox ash versus OSH on this branch for descriptor 10

$ demo/fd-strace.sh side-by-side _tmp/fd-strace/10-{ash,osh}.txt
openat(AT_FDCWD, "demo/fd-number.sh openat(AT_FDCWD, "demo/fd-number.sh
fcntl(3, F_DUPFD_CLOEXEC, 10)       fcntl(3, F_DUPFD, 10)
close(3)                            fcntl(10, F_SETFD, FD_CLOEXEC)
openat(AT_FDCWD, "_tmp/hello", O_WR close(3)
fcntl(10, F_DUPFD_CLOEXEC, 10)      fcntl(10, F_GETFL)
close(10)                           openat(AT_FDCWD, "_tmp/hello", O_WR
dup2(3, 10)                         fcntl(3, F_GETFD)
close(3)                            fcntl(10, F_DUPFD, 10)
fcntl(1, F_DUPFD_CLOEXEC, 11)       fcntl(11, F_SETFD, FD_CLOEXEC)
dup2(10, 1)                         fcntl(11, F_GETFL)
dup2(12, 1)                         lseek(10, 0, SEEK_CUR)
close(12)                           lseek(11, 0, SEEK_SET)
using fd 10                         close(10)
--- SIGCHLD {si_signo=SIGCHLD, si_c dup2(3, 10)
+++ exited with 0 +++               close(3)
                                    fcntl(10, F_GETFD)
                                    fcntl(1, F_DUPFD, 10)
                                    fcntl(12, F_SETFD, FD_CLOEXEC)
                                    close(1)
                                    dup2(10, 1)
                                    dup2(12, 1)
                                    close(12)
                                    openat(AT_FDCWD, "_tmp/hello", O_RD
                                    close(3)
                                    using fd 10
                                    lseek(10, -42, SEEK_CUR)

Now you can see the lseek() too, which busybox doesn't have


Let me look at the trap PR first, and then let's circle back to this one

I think that we have a lot more info than we did , and we are much closer !

But we should do something more like busybox

(a minor difference between dash/OSH and busybox ash is that busybox uses F_DUPFD_CLOEXEC -- not sure if that is a Linux thing or not)

@andychu
Copy link
Contributor

andychu commented Oct 9, 2025

Aside: I noticed that the flag busybox uses is in POSIX, so OSH could use it too

https://linux.die.net/man/2/fcntl

F_DUPFD_CLOEXEC is specified in POSIX.1-2008. (To get this definition, define _POSIX_C_SOURCE with the value 200809L or greater, or _XOPEN_SOURCE with the value 700 or greater.)

@andychu
Copy link
Contributor

andychu commented Oct 9, 2025

FWIW I submitted the script I used to generate those traces: f10f21e


And of course we can figure out what busybox is doing together -- that is still a bit of a mystery to me

I tried to get it to "break" by messing with its FD, but it doesn't

@andychu
Copy link
Contributor

andychu commented Oct 12, 2025

BTW I realized this extra part of the OSH trace is not related to this PR

                                    close(12)
                                    openat(AT_FDCWD, "_tmp/hello", O_RD
                                    close(3)

This is related ot the fact that cat _tmp/hello is internal in OSH

I was thrown off by that a bit, and changed the demo/fd-number.sh script to avoid it

Since _SHELL_MIN_FD=100, osh fails to build some Alpine packages:

    $ osh -c 'ulimit -n 64; echo hi >out'
    [ -c flag ]:1: I/O error applying redirect: Invalid argument'

(This is because osh attempts to save stdout's fd before redirection to
a descriptor >=100, which 'ulimit -n 64' doesn't allow)

Currently we can't drop _SHELL_MIN_FD lower, because scripts like this
currently crash osh (users are allowed to use fds<=99):

    $ cat test.osh
    exec 10>out
    echo hello>&10
    cat out
    $ bin/osh test.osh
    hello
    oils I/O error (main): Bad file descriptor

This happens due to osh mishandling "internal" file descriptors, in this
case the script file itself, as can be seen from the following strace
log:

    openat(AT_FDCWD, "test.osh", O_RDONLY)  = 3 <<<<
    fcntl(3, F_DUPFD, 10)                   = 10 <<<<
    close(3)                                = 0
    openat(AT_FDCWD, "out", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
    fcntl(10, F_DUPFD, 10)                  = 11 <<<<<
    close(10)                               = 0
    dup2(3, 10)                             = 10
    close(3)                                = 0
    close(11)                               = 0 <<<< it's closed immediately after being saved
    fcntl(1, F_DUPFD, 10)                   = 11
    close(1)                                = 0
    dup2(10, 1)                             = 1
    dup2(11, 1)                             = 1
    close(11)                               = 0
    openat(AT_FDCWD, "out", O_RDONLY)       = 3
    close(3)                                = 0
    hello
    read(10, 0x55bfa557e240, 4096)          = -1 EBADF (Bad file descriptor) <<< still 10
    oils I/O error (main): Bad file descriptor
    close(10)                               = 0

So, add some bookkeeping to FdState - it now tracks whether a file
descriptor is persistent and invokes a callback when such persistent
file descriptors are moved by osh.

Add a callback to FileLineReader - it seek()-s in the new file to
restore the position in the current file and changes file descriptors.

There is currently only one usage of a persistent fd with a callback -
FileLineReader for the script file, since that's what triggers the bug
above. Additional users might need to be changed as more bugs are
discovered - these are currently ignoring the second return value from
fd_state.Open()

Drop _SHELL_MIN_FD to 10, justify the reasoning in a comment.

With these changes, the above test.osh script works as intended, with
internal fds being shifted away from user-requested fds. This
potentially allows us to drop _SHELL_MIN_FD further in the future and
untangle it from the limit on fds users can use in redirects.

Signed-off-by: Andrii Sultanov <[email protected]>
This fixes the following issue:

    $ cat source.osh
    exec 25>out
    echo hello>&25
    cat out
    $ bin/osh -c '. ./source.osh'
    hello
    [ -c flag ]:1: . builtin I/O error: Bad file descriptor

Signed-off-by: Andrii Sultanov <[email protected]>
This fixes the following issue:

    $ cat source.osh
    exec 10>out
    echo hello>&10
    cat out
    $ bin/osh --eval source.osh
    hello
    oils I/O error (main): Bad file descriptor

Signed-off-by: Andrii Sultanov <[email protected]>
This fixes the following issue:

    $ cat ~/.config/oils/oshrc
    exec 10>out
    echo hello>&10
    cat out
    $ bin/osh
    hello
    oils I/O error (main): Bad file descriptor

Signed-off-by: Andrii Sultanov <[email protected]>
@last-genius last-genius force-pushed the asv/better-fd-shifting branch from 91bc0f0 to 56df453 Compare October 13, 2025 15:01
Since this PR dropped _SHELL_MIN_FD below 64, the ulimit test started
passing.

Signed-off-by: Andrii Sultanov <[email protected]>
@last-genius last-genius force-pushed the asv/better-fd-shifting branch from 56df453 to 81b1dac Compare October 13, 2025 15:15
@last-genius
Copy link
Collaborator Author

I've rebased on top of master to fix the merge conflict. Are there any remaining blockers here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants