Skip to content

Conversation

@vbelouso
Copy link
Collaborator

@vbelouso vbelouso commented Oct 28, 2025

Summary

This PR replaces the legacy regex-based Go segmenter with a native Tree-Sitter parser (GoSegmenterExtended), enabling syntax-aware extraction of Go functions, methods, anonymous functions, and types with deterministic and reproducible results.

Rationale

  • The previous segmenter was purely heuristic and text-based, which led to:
  • Incomplete parsing of modern Go syntax (e.g. generics, inline funcs, nested declarations).
  • Non-deterministic document order and inconsistent call-chain results.
  • Fragile regex logic that required frequent maintenance as Go evolved.

Implementation highlights

  • Introduced GoSegmenterExtended, powered by the official Tree-Sitter Go grammar.
  • Integrated it into ExtendedLanguageParser as the default segmenter for .go files.
  • Added deterministic sorting for reproducible dependency-chain resolution.
  • Reworked tests to cover generics, pointer receivers, multiline methods, and anonymous functions.

Architectural impact

The segmentation layer now uses a structured syntax tree (Tree-Sitter) instead of regex parsing.

Downstream modules such as ChainOfCallsRetriever and function analyzers still operate on text chunks, but now those chunks are syntactically well-formed and consistent across runs.

This refactor lays the foundation for future AST-based semantic analysis (e.g. variable/type inference, symbol resolution).

Benchmark

Tested on https://github.com/openshift/origin with 35001 Go files

Metric Legacy (Regex) New (Tree-Sitter) Δ
Total runtime 145.11 s 93.59 s −35 %
Extracted chunks 411 732 593 793 +44 % coverage
Function calls (profiling) 15.2 M 9.8 M −36 % overhead

@vbelouso vbelouso self-assigned this Oct 28, 2025
@vbelouso vbelouso added the enhancement New feature or request label Oct 28, 2025
@vbelouso vbelouso force-pushed the go-extended-parser branch 2 times, most recently from 4b877d9 to a627169 Compare October 28, 2025 13:39
@vbelouso vbelouso requested a review from zvigrinberg October 28, 2025 13:40
@zvigrinberg
Copy link
Collaborator

Hi @vbelouso, Can you please rebase and resolve conflicts before i'm starting reviewing it?
Thanks.

…itter implementation

Signed-off-by: Vladimir Belousov <[email protected]>
@vbelouso
Copy link
Collaborator Author

vbelouso commented Nov 2, 2025

Hi @vbelouso, Can you please rebase and resolve conflicts before i'm starting reviewing it? Thanks.

Done

return re.search("[A-Z][a-z0-9-]*", function_name)
return bool(re.search("[A-Z][a-z0-9-]*", function_name))

def get_function_name(self, function: Document) -> str:
Copy link
Collaborator

@zvigrinberg zvigrinberg Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbelouso This is an example of something that is not working correctly ( the example test is failing) , get_function_name should return the variable name containing the anonymous function.

@pytest.mark.asyncio
async def test_transitive_search_golang_generic():
    parser = GoLanguageFunctionsParser()
    doc1 = Document(page_content=("greet := func() { // Assigning anonymous function to a variable 'greet'\n"
                                  "		fmt.Println(\"Greetings from a variable-assigned anonymous function!\")\n"
                                  "	}"))
    name = parser.get_function_name(doc1)
    print(f"name_of_function={name}")
    assert name == "greet"

Your revised GoSegmenter with TreeSitter parse the anonymous function assigned to a variable correctly,
But instead of taking the name of the variable in this case, it return :=, which is incorrect , please check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zvigrinberg
Updated.
I also increased the number of test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants