-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make NativeUnixDirectory pure java now that direct IO is possible [LUCENE-8982] #10025
Comments
Uwe Schindler (@uschindler) (migrated from JIRA) We should rename the directory to something more generic. It's also supported on Windows! In addition, the hardcoded alignment of 512 should be determined on directory creation (reading it from FileStore.getBlockSize(). Code should also create FileChannel directly not via forbidden Java.io.File classes. That was just a workaround to bring the FileDescriptor into the game. |
Uwe Schindler (@uschindler) (migrated from JIRA) If you have a direct buffer, you can now also get an aligned slice. |
Uwe Schindler (@uschindler) (migrated from JIRA) And it's Java 10 where this was introduced. So it's fine for master. |
Uwe Schindler (@uschindler) (migrated from JIRA) Here is the full JDK patch including examples in test (how to align direct buffers and how to open channels): http://hg.openjdk.java.net/jdk10/master/rev/d72d7d55c765 |
Michael McCandless (@mikemccand) (migrated from JIRA)
+1,
+1, I think we can do something similar to
+1 |
Zach Chen (@zacharymorn) (migrated from JIRA) hi @mikemccand, I’m new to contributing to Lucene and have been searching around for small scope issues to take on. After reading the description and discussion of this issue, this seems to be a straightforward one (not sure about the new benchmarking part though). Is it ok that I take this one and open a PR ? |
Michael McCandless (@mikemccand) (migrated from JIRA) Yes please! Feel free to tackle this! I can help w/ benchmarking. |
Zach Chen (@zacharymorn) (migrated from JIRA) Thanks Michael! I've opened an initial PR with the proposed changes, but also have a few follow-up questions:
The PR also needs to include some more updates to comment to reflect the new way of doing direct IO. I can work on that once the changes are very much ready.
|
Mike Drob (@madrob) (migrated from JIRA) Sort of related - the class comments currently say to do |
Zach Chen (@zacharymorn) (migrated from JIRA) @madrob It does seems ant build-native-unix is no longer available, and running ./gradlew helpAnt doesn't help either. I'm not super familiar with cpp development, but to facilitate this I've done some research and after some trial and errors, come up with a separate PR to build the cpp code (as decoupling the two changes might help benchmarking) apache/lucene-solr#2068 . Could you please take a look and let me know if this is a good direction to take? Please note that in the PR, I created a new native module to host NativefPosixUtil.cpp, as there's compatibility issue between cpp-library and java-library Gradle plugins. For details, please see gradle/gradle-native#352 (comment) |
Dawid Weiss (@dweiss) (migrated from JIRA) Hi Zach. I don't think we've ported this particular build fragment from ant - thank you for contributing it. It can be a separate module but I think it shouldn't be forced upon everyone on default convention tasks (assemble, check) - I didn't check if it's currently the case. Native builds require extra setup and tools that some platforms may lack (Windows, Mac). If it's the case that these tasks run by default is then we'll need to add either an explicit property (-Pbuild.native=true?) or a dedicated task (native) and make the build conditional on whether it is explicitly defined. I can take a look if you have a problem with this, let me know. |
Mike Drob (@madrob) (migrated from JIRA) The new task appeared to build fine on my Mac, although I didn't test the resulting lib. |
Zach Chen (@zacharymorn) (migrated from JIRA) @dweiss Thanks for the suggestion and it makes sense! I've pushed an update to condition the native build / include on "-Pbuild.native=true" , could you please take a look and and let me know if it looks good?
|
ASF subversion and git services (migrated from JIRA) Commit ebc87a8 in lucene-solr's branch refs/heads/master from zacharymorn LUCENE-8982: Separate out native code to another module to allow cpp build with gradle (#2068)
|
ASF subversion and git services (migrated from JIRA) Commit ebc87a8 in lucene-solr's branch refs/heads/master from zacharymorn LUCENE-8982: Separate out native code to another module to allow cpp build with gradle (#2068)
|
Dawid Weiss (@dweiss) (migrated from JIRA) I've committed in the extracted native library part. I still think it'd be good to see if we can get the same performance with just plain java (and new direct flags). |
Dawid Weiss (@dweiss) (migrated from JIRA) Yeah... like I suspected - CI does complain about missing toolchains. * What went wrong:
Execution failed for task ':lucene:misc:native:compileDebugLinuxCpp'.
> No tool chain is available to build C++ for host operating system 'Linux' architecture 'x86-64':
- Tool chain 'visualCpp' (Visual Studio):
- Visual Studio is not available on this operating system.
- Tool chain 'gcc' (GNU GCC):
- Could not find C++ compiler 'g++' in system path.
- Tool chain 'clang' (Clang):
- Could not find C++ compiler 'clang++' in system path. I think we'll switch native builds to be optionally enabled (rather than disabled)? |
ASF subversion and git services (migrated from JIRA) Commit fd3ffd0 in lucene-solr's branch refs/heads/master from Dawid Weiss LUCENE-8982: make native builds disabled by default (CI complains). |
Uwe Schindler (@uschindler) (migrated from JIRA) Hi, My biggest problem is still: How to handle releases of binary artifacts? Do we really want to make this dependent of the release manager's local system? |
Dawid Weiss (@dweiss) (migrated from JIRA) gradle will pick up the toolchain suitable for the platform if it finds any - the CI just doesn't have it.
This is what I actually raised in the issue's comment too. I don't think we should include binaries in releases. We should strive to make it java-only (and if somebody really wants to, they can compile native modules locally, for a given platform). |
Uwe Schindler (@uschindler) (migrated from JIRA)
That's fine. My idea was: Can't we build the native dependencies, if the toolchain is there, but don't fail if not an just print a warning? |
Uwe Schindler (@uschindler) (migrated from JIRA) In short: change the default to "true" if toolchain is there. |
Dawid Weiss (@dweiss) (migrated from JIRA) You can set -Pbuild.native=true manually on those VMs you know the tools are available... I don't think duplicating whatever logic gradle uses to detect those toolschains automatically is worth the effort, to be honest. The logic is probably in gradle's sources somewhere. I don't know if it can be done easier than by copying from their code. |
Zach Chen (@zacharymorn) (migrated from JIRA) Sorry for the late response, just got off work and saw this. From the discussion it seems the assumption / reality here is that cpp toolchain may or may not be available in the VMs. However, since Lucene does have native code and scheduled build can discover any change that breaks the native-java integration early on (there was actually one commit before this that broke it), should the CI builds in general require cpp toolchain to be there in the VMs (and add them if they are missing) to execute the compilation and tests, but in gradle still have -Pbuild.native=false as default to not break builds for others in development, and have a few CI VMs with cpp toolchain intentionally left out to test for compatibility? |
Uwe Schindler (@uschindler) (migrated from JIRA) I am now merging the PR. Thanks Zach! |
ASF subversion and git services (migrated from JIRA) Commit a7747b6 in lucene-solr's branch refs/heads/master from zacharymorn LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory (#2052) LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory |
ASF subversion and git services (migrated from JIRA) Commit a7747b6 in lucene-solr's branch refs/heads/master from zacharymorn LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory (#2052) LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory |
Uwe Schindler (@uschindler) (migrated from JIRA) Thanks Zach! |
ASF subversion and git services (migrated from JIRA) Commit 4b508ae in lucene-solr's branch refs/heads/master from Uwe Schindler LUCENE-8982: Add a note to MIGRATE.md |
Zach Chen (@zacharymorn) (migrated from JIRA) No problem! Thanks Uwe, Michael and Dawid for all the help and guidance there! |
Adrien Grand (@jpountz) (migrated from JIRA) Closing after the 9.0.0 release |
Just curious.. are there any benchmarking results that can be shared with this enabled? |
Oh hello, sorry, no I never managed to do any benchmarking here. Did you? I'd still be curious about the results ... direct IO is an interesting low-level optimization (and which Linus notably is not a fan of! Not sure if his thinking has changed...) and it's not clear (to me) where it's actually helpful in Lucene. The original theory / use-case for this directory was to ensure merging segments would write the newly merged segment straight to the storage device, bypassing the OS's write cache, and leaving more free RAM to hold hot pages for searching, reducing page faults for searching while heavy merging is going on. At Amazon Product Search we use Lucene with near-real-time segment replication to efficiently distribute index updates to many replicas (to scale to high QPS) for each shard. Long ago, we tried fixing that segment replication copy to use direct IO, on the same theory that copying in many bytes for new segments might evict hot pages used for searching. But what we found is that direct IO caused even more page faults once the replica lit (switched over to them for searching) the new segments as the OS now had to page in the very cold bytes for the newly copied segments on the synchronous query hot path. We are now wondering if some sort of bandwidth cap/budget on merge policy or scheduler might be a better approach. This is just an anecdote from our production experience, not a full/clean A/B benchmark, but at least it's one data point :) |
NativeUnixDirectory
is aDirectory
implementation that uses direct IO to write newly merged segments. Direct IO bypasses the kernel's buffer cache and write cache, making merge writes "invisible" to the kernel, though the reads for merging the N segments are still going through the kernel.But today,
NativeUnixDirectory
uses a small JNI wrapper to access theO_DIRECT
flag toopen
... since JDK9 we can now pass that flag in pure java code, so we should now fixNativeUnixDirectory
to not use JNI anymore.We should also run some more realistic benchmarks seeing if this option really helps nodes that are doing concurrent indexing (merging) and searching.
Migrated from LUCENE-8982 by Michael McCandless (@mikemccand), resolved Jan 17 2021
Pull requests: apache/lucene-solr#2068
The text was updated successfully, but these errors were encountered: