Skip to content

Conversation

jcobis
Copy link
Collaborator

@jcobis jcobis commented Aug 15, 2025

Description

Converts MongoDB raw schema analysis object into a flat, LLM-friendly format for the Mock Data Generator feature.

  • Nested documents are represented with dot notation (e.g., user.profile.name)
  • Uses bracket notation for arrays (e.g., users[], matrix[][])
  • Maintains field sample values

Notation examples:

  • Nested documents: user.profile.name (dot notation)
  • Array: users[] (bracket notation)
  • Nested arrays: matrix[][] (multiple brackets)
  • Nested array of documents fields: users[].name (brackets + dots)

Checklist

  • New tests and/or benchmarks are included
  • Documentation is changed or added
  • If this change updates the UI, screenshots/videos are added and a design review is requested

Motivation and Context

The existing mongodb-schema structure is overly verbose for our feature and contains nested structures that are both difficult for LLMs to parse and do not correspond to the requirements of LLM structured outputs (eg. no optional fields).

Types of changes

  • Backport Needed
  • Patch (non-breaking change which fixes an issue)
  • Minor (non-breaking change which adds functionality)
  • Major (fix or feature that would cause existing functionality to change)

@jcobis jcobis requested a review from a team as a code owner August 19, 2025 17:10
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.


const result = processSchema(schema);

expect(result).to.deep.equal({
Copy link
Collaborator

@kpamaran kpamaran Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this is a good example of how the dot notation combined with the constraints we've placed simplifies the parsing and surface area for the LLM to write to

{
name: 'Array',
bsonType: 'Array',
path: ['cube'],
Copy link
Collaborator

@kpamaran kpamaran Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm my understanding, the path stays as ['cube'] because the named field/path captures arrays-within-arrays at all levels (and until there's a document)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from my understanding


const result = processSchema(schema);

expect(result).to.deep.equal({
Copy link
Collaborator

@kpamaran kpamaran Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good example here and below 👍🏼

fieldProbability?: number,
arraySampleValues?: unknown[]
): void {
if (type.name === 'Array' || type.bsonType === 'Array') {
Copy link
Collaborator

@kpamaran kpamaran Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use the type-predicate validators like isArraySchemaType in compass-schema/src/components/field.tsx

Type guards will give you the branch in a program enough type information to prevent casts like type as ArraySchemaType

See https://www.typescriptlang.org/docs/handbook/2/narrowing.html#using-type-predicates


const arrayPath = `${currentPath}[]`;
const sampleValues =
arraySampleValues || getSampleValues(arrayType).slice(0, 3); // Limit full-context array sample values to 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of the magic 3, use a constant

Copy link
Collaborator

@kpamaran kpamaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had some minor feedback; lgtm overall

@@ -30,9 +29,16 @@ export type SchemaAnalysisErrorState = {
error: SchemaAnalysisError;
};

export interface FieldInfo {
type: string; // MongoDB type (eg. String, Double, Array, Document)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could type be a string union type?

@@ -66,7 +66,8 @@
"react": "^17.0.2",
"react-redux": "^8.1.3",
"redux": "^4.2.1",
"redux-thunk": "^2.4.2"
"redux-thunk": "^2.4.2",
"bson": "^6.10.1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since bson is only used for its types, it should be a dev dependency so it doesn't get bundled to prod

Copy link
Collaborator Author

@jcobis jcobis Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency checker was still complaining even with using import type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding 'bson' to the ignores in packages/compass-collection/.depcheckrc. But not sure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


function isPrimitiveSchemaType(type: SchemaType): type is PrimitiveSchemaType {
return (
!isConstantSchemaType(type) &&
Copy link
Collaborator

@kpamaran kpamaran Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] null and undefined classify as primitives too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@kpamaran kpamaran Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, let's go with consistency then. I disagree with that source's typing but it may be working around some type issue

@jcobis jcobis requested a review from ncarbon August 21, 2025 17:39
Copy link
Member

@Anemy Anemy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment on where this helper lands. Not really a blocker. Approving incase this blocks other work, but it would be nice to have this in the shared spot if possible and y'all think its worth having there.

* Transforms a raw mongodb-schema Schema into a flat Record<string, FieldInfo>
* using dot notation for nested fields and bracket notation for arrays.
*/
export function processSchema(schema: Schema): Record<string, FieldInfo> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this to the https://github.com/mongodb-js/mongodb-schema package?
That way places like the VSCode extension, and other parts of Compass can use it more easily.

Looks like there's already a getSchemaPaths we have on the schema analysis that we're using in Compass' import-export already. It's similar but not the same.
https://github.com/mongodb-js/mongodb-schema/blob/6f22cb72aaaf97a0bbbb6ae92a4fdadc4290db67/src/schema-analyzer.ts#L640

In import-export:

Maybe we add some options to the getSchemaPaths or have it as a separate function on the SchemaAnalyzer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. How does this sound: if/when this code could be re-used effectively, we can consider whether moving it to the https://github.com/mongodb-js/mongodb-schema package would be good. (Note also that we are releasing this feature as an experiment (A/B test) that we may iterate on)

@jcobis jcobis merged commit 97a35b4 into main Aug 21, 2025
57 of 59 checks passed
@jcobis jcobis deleted the CLOUDP-337090_v2 branch August 21, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat no release notes Fix or feature not for release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants