Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE 729 - Streaming Imports from ODA #739

Merged
merged 61 commits into from
Mar 18, 2025
Merged

ISSUE 729 - Streaming Imports from ODA #739

merged 61 commits into from
Mar 18, 2025

Conversation

sebjf
Copy link
Contributor

@sebjf sebjf commented Feb 12, 2025

This fixes #729 - streaming imports from ODA.

Overview

The main goal of this PR is to change how files are imported, using a pattern where nodes are offloaded to the database as soon as possible to keep the memory consumption down.

The PR makes a number of changes to achieve this for ODA, and prepare to swap other importers too. In addition, a number of performance optimisations have been made, along with updates to support the gcc-11&c++20 toolchain.

Importer Re-write

The biggest change is that the ODA import process has been "inverted".

Previously all importers would set up layers in GeometryCollector. These layers were containers that would receive primitives from ODA as a view was vectorized. At the end the collection of primitives would be turned into a scene graph.

By building the graph post-hoc, the GeometryCollector could do things such as filter empty transforms, and apply the world offset with a global view of the scene - but with the streaming import, this global view no longer exists.

The way the ODA importers work now is that each takes explicit control over the container that ODA will write primitives into (called the 'draw context' object).

As ODA vectorizes a scene, it makes recursive calls to the draw()/doDraw() overrides. Within each implementation, the importer decides whether a new layer/transform should be created for that geometry, and if so updates the draw context. The draw context itself exists on that call's stack frame. When the stack unwinds back to that frame, the method checks if the context object has any geometry, and if so only then does it create the transformation & metadata nodes.

Global properties such as the world offsets are set beforehand in the File Processors, using APIs specific to each format to get scene bounds from whatever information is available in the file headers.

Each Data Processor builds the tree in a different way, so how the necessary information is stored with the context depends on the file format. Typically though, as the trees are quite simple, they are mostly just local variables in the frame that owns the context.

For performance, metadata is only collected when a node has geometry.

This behaviour is specific to RVT, DGN, and DWG, as these are the only importers that use the vectorization approach. NWD uses a more traditional tree traversal, where we can add our own parameters to the recursive method, and get geometry from a direct function call.

The inheritance of the ODA importers now looks like the following:

NWD -> RepoSceneBuilder
RVT -> GeometryCollector -> RepoSceneBuilder
DWG -> DataProcessor -> GeometryCollector -> RepoSceneBuilder
DGN -> DataProcessor -> GeometryCollector -> RepoSceneBuilder

To better reflect how logic is actually shared between these types.

Optimisations and Upgrades

While these changes were being made, bouncer was tested with a number of recent files that were failing on production. Guided by the Visual Studio Profiler while importing these files, a number of changes to things like caching behaviour, the ODA APIs used and control flow were made in order to reduce the import times and memory.

Changes

This PR makes the following specific changes:

  1. Add a new type RepoSceneBuilder to the modelutility namespace. RepoSceneBuilder accepts Repo Nodes and graph operations (addParent), and uses them to populate a collection asynchronously via a worker thread. RepoSceneBuilder is intended to be used in place of the RepoScene constructor for streaming-enabled importers.
  2. Add a new type RepoMeshBuilder to the odaHelper namespace. RepoMeshBuilder receives faces (from an ODA tessellation object) and outputs MeshNodes. This is the mesh building part of the previous GeometryCollector.
  3. Completely re-write GeometryCollector. Instead of the nested dictionaries, the new type uses a stack of Context objects that is managed by the callers. Transformation and Metadata node management is separated from the context management, with the owner being responsible for connecting the two. Transformation and Metadata nodes are created immediately using RepoSceneBuilder.
  4. Scene offsets are now computed in the File Processors of each format. This is because ODA may initialise Data Processors multiple times for a given vectorisation, and computing bounds ahead of time is not assumed to be a cheap operation.
  5. The Revit importer can now use the ODA file unload feature. This is turned off by default though because it did not demonstrate any real memory benefits.
  6. A new RepoQuery, AddParent, has been introduced, which when run as an update adds a set of UUIDs to a document's parents array.[1]
  7. A new type, BulkWriteContext, has been added to the database namespace. This new type allows owners to make multiple insert and update calls, and have them automatically dispatched in bulk. The Mongo database handler has an implementation of this object. The abstract database handler has a new method to return such an object.
  8. The type alias repo_face_t has been replaced with an actual type, that mimics the API of the old one (a std::vector), but which does not perform any of its own allocations. This is because the profiler was showing heap allocations to be a significant part of the hot path. Corresponding face types types have been added for GeometryCollector and RepoSceneBuilder.
  9. The Revit importer now caches materials based on the ODA object handle, saving rebuilding the same material multiple times.
  10. A new colour type, repo_color3d_t has been introduced. repo_material_t now uses repo_color_3d_t instead of float vectors to store colours. This reduces heap allocations and also fixes the size of the colours in a repo_material_t, disambiguating how transparency is handled. Native support for this new type has been added to RepoBSONBuilder.
  11. repo_material_t::checksum() has been updated to use a std::hash of built-in primitives, instead of computing a CRC of the string representation, as profiling was showing the string operations to be a significant part of the hot path.
  12. VertexMap no longer performs vertex indexing, but simply maintains arrays. Vertex indexing can now be performed by MeshNode instances on themselves (removeDuplicateVertices()). This removes indexing from the hot path, and also makes it available to all importers. RepoSceneBuilder calls removeDuplicateVertices() on all Mesh Nodes in its worker thread.
  13. TransformReductionOptimizer has been decomissioned.
  14. IFCUtilsParser has been updated to absorb Transformation Nodes where possible, on import.
  15. The RepoQuery implementation has been updated to be easier to follow & enable proper abstractions through the use of visitors. Variants are used to declare the potential types for the visitors, as per the standard library. The use of variants to define specific types allows making different sets of Repo Queries that are supported by different database methods. The use of the visitor pattern puts the implementation in the database handler module, where it should be.
  16. All the structs in repo_structs.h have been moved into the repo namespace. This is because anonymous namespaces are unique between translation units (such as libraries), but the structs should be fungible (if a type is not used across translation units, it should probably not be in repo_structs.h...)
  17. Unused static toString methods have been removed from repo_structs.h.
  18. NWD sometimes missed tree entries. This is now fixed.
  19. This bug for the DGN processor has been fixed by ODA, so this snippet has been reinstated: https://account.opendesign.com/support/issue-tracking/DGN-2274
  20. RepoScene can now take a project and database name in a constructor, in order to represent just a pointer to a revisioned scene that already exists as a collection. This is the way it is intended to work with RepoSceneBuilder.
  21. RepoBSON::replaceBinaryWithReference() is now true to its name and deletes the BinMapping once the BSON has been updated.
  22. RepoNode has a new virtual member, getSize(), that is intended to return the total allocated memory owned by that node. This has overrides for TransformationNode and MeshNode.
  23. Revit importer now gets all user parameters using the getParameters method, instead of getParamsList, so only populated parameters are imported. This skips unpopulated parameters, such as shared project parameters, which would be ignored by the previous logic anyway.
  24. Scene project and database name are now set in ImportFromFile as some callers assume they are set.
  25. The tests have been given a new type, SceneUtils, for querying the scene graph as-imported for the purposes of the unit tests.
  26. A new unit test file, ut_repo_model_import_oda.cpp has been added to contain any ODA specific regression tests.
  27. The projectHasMetaNodesWithPaths & projectHasGeometryWithMetadata test functions have been removed (in favour of SceneUtils), and the tree validation performed for ODA types in the system tests have been moved into ut_repo_model_import_oda.cpp.
  28. The SRC exporter has been decommissioned.
  29. Any simple template declarators have been removed from constructor declarations, as per a change in the standard.
  30. UploadTestNWDProtected no longer tests if the project exist, because RepoSceneBuilder will make the project beforehand (it would be up to the user via io to destroy the collection, if they didnt want to attempt an upload again)
  31. RepoScene unit tests have been updated as RepoScene constructor no longer allows two root nodes.
  32. RepoLog has been moved into its own shared library, so the singleton can be shared between bouncers various static and dynamic dependencies. The RepoLog API and convenience preprocessor defines have been updated to expose a standard ostringstream, and hide the boost implementation inside the repo_log.cpp module. This is because the version of boost on Ubuntu 20.04 will not compile under C++20, and C++20 is required for the new threading behaviour.

Dependencies

This PR adds a third party dependency

  1. https://github.com/cameron314/readerwriterqueue

This is BSD licensed and as included as source (header only)

Footnotes

[1] This is a minor abstraction leak, in the sense that Database operations should not know what the parents array is, however the alternative is every instance stores the same string, and given we expect this operation be used a lot, and the whole point of this ticket is memory performance, it was considered the leak was an acceptable trade-off.

Comments/Future Work

  1. I have had issues with generic server errors, so far it seems these are genuinely mongo errors, and there is nothing to be done client side, however we should keep an eye on it.

  2. For this ticket, we should bear in mind that the survey and base point contribute to the bounds if visible, and can undermine the revision world offset.

  3. Mongo's bulk_write performance can be improved with unordered writes, but we'd need to ensure ourselves that all inserts took place before updates. So far it seems the performance is good enough without it.

  4. Regarding upgrading Revit files in-place, the following snippet has been tested and is successful. However, the implications are more nuanced than thought. The act of saving the file can consume a lot of memory (2x as much), so actually saving and reloading within a single process does not reduce the highwater mark, and would have to be run on a much bigger machine as a separate processing stage. We'd need to understand better the gains of loading upgraded files before deciding this is worth it.

if (pDb->latestFileVersion() > pDb->getOriginalFileVersion())
{
	auto filename = "D:/3drepo/3drepobouncer_ISSUE729/temp/" + svcs.getTempFileName() + ".rvt ...";
	repoInfo << "Saving upgraded Revit file to " << convertToStdString(filename);
	svcs.writeFile(filename, pDb, false);
	pDb = nullptr; // Set this to null first, to prompt the cleanup of the database, before we read in the new file.
	pDb = svcs.readFile(filename);
	repoInfo << "...done.";
}
  1. Two other opportunities for performance improvements, for which there is not enough evidence to justify the cost for now, include:
    a. Asynchronous file writing from BlobFilesHandler
    b. A third thread to perform the serialisation to BSONs.

  2. It is possible to turn on multithreaded rendering for Revit, in theory, but this doesn't have any effect in practice: https://forum.opendesign.com/showthread.php?23889-Inquiry-Regarding-Reading-Performance-Optimization-for-ODBM

  3. Our imports differ from Navis in a number of ways which we know of (by design), but may be picked up by the cBIM team. These are:
    a. Entities take the metadata of their parents, which is most noticeable for Element Ids. This is usually intuitive, but we can get a situation where the Element Ids for instanced groups are overridden, which is not what cBIM users would expect.
    b. Navis views are always imported shaded, but the file can specify realistic.
    c. Navis importer ignores the Hidden state of objects.

Link dump

https://forum.opendesign.com/showthread.php?19803-Why-memory-is-so-different-in-REVIT
https://forum.opendesign.com/showthread.php?19004-Optimizing-load
https://forum.opendesign.com/showthread.php?19668-Performance-with-large-nwd-files
https://docs.opendesign.com/tbim/bimrv_unload.html
https://www.mongodb.com/docs/manual/core/aggregation-pipeline-limits/

sebjf added 30 commits January 7, 2025 14:59
…ompare unit test with some comparisons deactivated.
… thats the only way really on models with lots of identical names. fixed ignore mechanisms.
…of concept now import via ModelImportManager
…er, and made tweaks so rest of bouncer can load a RepoSceneBuilder scene to do full import
…rms with mesh data. added support to textures for streaming import. support for min bounding box and shared coordinates.
…y collector and file processor, and updated the rvt, dwg and dgn importers.
…into ISSUE_729

# Conflicts:
#	bouncer/src/repo/manipulator/repo_manipulator.cpp
#	bouncer/src/repo/manipulator/repo_manipulator.h
#	bouncer/src/repo/repo_controller.cpp
#	bouncer/src/repo/repo_controller.cpp.inl
#	bouncer/src/repo/repo_controller.h
#	bouncer/src/repo/repo_controller_internal.cpp.inl
… getModelBounds to use header info instead of geometry for speed
…der and updated RepoSceneBuilder to use it
…l into the worker thread of reposcenebuilder
@sebjf sebjf requested a review from carmenfan February 18, 2025 09:05
@carmenfan carmenfan self-assigned this Feb 18, 2025
@carmenfan carmenfan changed the base branch from master to staging February 18, 2025 14:36
# Conflicts:
#	tools/bouncer_worker/config_example.json
#	tools/bouncer_worker/src/lib/config.js
@carmenfan carmenfan merged commit 64ab771 into staging Mar 18, 2025
7 of 8 checks passed
@carmenfan carmenfan deleted the ISSUE_729 branch March 18, 2025 13:46
@carmenfan carmenfan removed their assignment Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ODA importer should stream into database instead of holding scene in memory
2 participants