Skip to content

Conversation

@NripeshN
Copy link
Owner

@NripeshN NripeshN commented Jan 20, 2026

Note

Engine/GPU: Integrates Apple Silicon GPU NNUE in Eval with a toggle; replaces legacy batch_ops/gpu_accumulator with gpu_accumulator_cache (Finny tables, sparse helpers); extends GPU backend API (hardware capabilities, batch sizing, shutdown); modernizes CUDA backend and implements comprehensive NNUE CUDA kernels.

Build/CMake: Updates GPU and MCTS sources (new parallel_hybrid_search, apple_silicon_mcts), compiles nnue.metal, links Accelerate on macOS, and cleans up CUDA target properties.

CI/Tournament: Tweaks workflow defaults (games/time control), enables cancel-in-progress, installs chess, adds Berserk clone/build and Stockfish opening book extraction, introduces MetalFish MCTS wrapper rename, verifies/packages new artifacts (including engines_config.json, book, Berserk), and sets exec perms.

Docs/Tools: Overhauls README with Lc0-derived MCTS/hybrid details, commands, structure, and tournament info; updates GPU benchmark to use NNUE manager; adds results/ to .gitignore.

Written by Cursor Bugbot for commit 993de57. This will update automatically on new commits. Configure here.

- Updated HybridSearchBridge to utilize the Stockfish engine for move verification and search operations, improving accuracy and performance.
- Added support for synchronous search in the Engine class, allowing for real-time move evaluations.
- Enhanced fallback mechanisms to ensure functionality when the engine is not available, maintaining robustness in MCTS operations.
- Updated documentation to reflect the integration of the Stockfish engine and its impact on search capabilities.
cursor[bot]

This comment was marked as outdated.

…-to-move

- Updated score calculations in verify_with_alphabeta to correctly negate scores based on the player's perspective (white vs. black).
- Added comments for clarity on the score negation logic, improving code readability and maintainability.
- Removed redundant updates to ab_nodes in HybridSearchBridge, clarifying the statistics handling.
…e verification

- Modified the initialize method to accept an Engine parameter, enabling the use of Stockfish for more accurate alpha-beta verification.
- Enhanced the verify_with_alphabeta function to utilize the engine for move evaluations, providing a fallback to NNUE when the engine is unavailable.
- Updated the create_enhanced_hybrid_search factory function to pass the engine instance, ensuring seamless integration.
- Improved comments and code structure for better readability and maintainability.
cursor[bot]

This comment was marked as outdated.

- Removed EnhancedHybridSearch and its associated files, streamlining the codebase.
- Added ParallelHybridSearch, which integrates MCTS and Alpha-Beta search for improved performance.
- Updated CMakeLists.txt to include new source files and removed references to the deleted enhanced hybrid search.
- Modified UCI commands to support the new parallel hybrid search functionality.
- Adjusted CI workflow and wrapper scripts to accommodate the changes in search commands and engine configurations.

These updates enhance the search capabilities of MetalFish, optimizing for both strategic and tactical play through parallel processing.
- Introduced new tests for ABSearchResult, ABSearchConfig, and ABSearcher to validate functionality and default configurations.
- Added TacticalAnalyzer and HybridSearchBridge tests to ensure proper behavior and initialization.
- Updated test_mcts.cpp to include these new tests, enhancing coverage for the MCTS module and ensuring robustness in search capabilities.
cursor[bot]

This comment was marked as outdated.

- Removed legacy GPU components including nnue_eval, batch_ops, and persistent_pipeline to streamline the codebase.
- Introduced new Apple Silicon-specific optimizations in MCTS, including unified memory handling and SIMD-accelerated computations.
- Updated CMakeLists.txt to reflect the removal of obsolete files and added new sources for Apple Silicon optimizations.
- Enhanced the MCTS algorithms with Lc0-inspired techniques, improving performance and efficiency in search operations.
- Adjusted CI workflows to accommodate the changes in GPU and MCTS implementations, ensuring compatibility and robustness.

These updates significantly enhance the performance of MetalFish on Apple Silicon, optimizing both memory usage and computational efficiency.
…lHybridSearch

- Added an initialization check in the start_search method to prevent search execution without proper setup, ensuring stability.
- Refined the decision-making logic in make_final_decision to better weigh confidence and score differences, improving move selection accuracy.
- Updated async evaluation submission to safely capture batch count, preventing potential use-after-free issues.

These changes enhance the robustness and effectiveness of the ParallelHybridSearch component, optimizing its performance in various scenarios.
cursor[bot]

This comment was marked as outdated.

- Introduced a custom deleter for posix_memalign-allocated memory to ensure proper deallocation using free() instead of delete[].
- Updated memory handling in the MCTS implementation to utilize unique_ptr with the new AlignedDeleter on Apple platforms, enhancing memory management and compatibility.

These changes improve memory safety and performance on Apple Silicon, aligning with recent optimizations in the codebase.
cursor[bot]

This comment was marked as outdated.

The FPU calculation was using -parent_q instead of parent_q, which inverted
the exploration behavior. This caused unvisited nodes to be over-punished in
winning positions and incorrectly preferred in losing positions.

Fixed by changing the formula from:
  fpu = -parent_q - reduction
to:
  fpu = parent_q - reduction

This fix was applied in:
- src/mcts/hybrid_search.cpp
- src/mcts/thread_safe_mcts.cpp
- src/mcts/lc0_mcts_core.h (ComputeFpu and ComputeFpuSimple)

The correct formula ensures unvisited nodes are slightly pessimistic compared
to visited children, as intended by the FPU reduction strategy.
@cursor
Copy link

cursor bot commented Jan 20, 2026

Bugbot Autofix resolved the bug found in the latest run.

  • ✅ Fixed: FPU formula has inverted sign causing wrong exploration
    • Changed FPU formula from -parent_q - reduction to parent_q - reduction in all four locations (hybrid_search.cpp, thread_safe_mcts.cpp, and lc0_mcts_core.h) to correctly implement pessimistic unvisited node values.

@NripeshN
Copy link
Owner Author

@cursor review

cursor[bot]

This comment was marked as outdated.

…max functions

- Bug 1: Use config_.cpuct_base and config_.cpuct_factor instead of hardcoded local constants in select_child_puct
- Bug 2: Handle temperature==0 as argmax in compute_softmax_simd to prevent division by zero
- Bug 3: Handle temperature==0 as argmax in expand_node policy softmax to prevent division by zero
@cursor
Copy link

cursor bot commented Jan 20, 2026

Bugbot Autofix resolved all 3 of the 3 bugs found in the latest run.

  • ✅ Fixed: Config parameters cpuct_base and cpuct_factor are ignored
    • Replaced hardcoded local constants with config_.cpuct_base and config_.cpuct_factor in select_child_puct function.
  • ✅ Fixed: Division by zero when temperature is zero in softmax
    • Added temperature==0 check to handle argmax case before division in compute_softmax_simd function.
  • ✅ Fixed: Division by zero in expand_node policy softmax
    • Added temperature==0 check to handle argmax case before division in expand_node policy softmax.

- Introduced a global cleanup function to ensure proper shutdown of GPU resources before program exit, preventing potential crashes during static destruction.
- Registered the cleanup function with atexit to guarantee execution at the end of the program.
- Implemented shutdown functions for MCTS components, GPU feature extractor, and GPU NNUE manager to streamline resource management.
- Updated the main function to explicitly handle cleanup and ensure all GPU operations are synchronized before destruction.

These changes improve the stability and reliability of the MetalFish engine, particularly in scenarios involving GPU resources.
…ce CPU backend shutdown handling

- Modified the GitHub Actions workflow to trigger on pull request events, allowing automatic updates of tournament results.
- Removed the manual input for PR number, streamlining the process by directly using the pull request context.
- Added no-op shutdown functions for the CPU backend to improve code clarity and maintainability, ensuring safe shutdown behavior in CPU fallback mode.

These changes enhance the automation of tournament result posting and improve the overall structure of the backend code.
cursor[bot]

This comment was marked as outdated.

…n MCTS

- Add nullptr check in AppleSiliconNodePool::allocate() to handle posix_memalign failures
- Add bounds checking when accessing GPU batch results to prevent out-of-bounds access
- Matches the defensive pattern used in hybrid_search.cpp
@cursor
Copy link

cursor bot commented Jan 21, 2026

Bugbot Autofix resolved 2 of the 2 bugs found in the latest run.

  • ✅ Fixed: Memory allocation failure causes undefined pointer arithmetic
    • Added nullptr check at the start of allocate() to return nullptr when posix_memalign fails in constructor, preventing undefined behavior from pointer arithmetic on null pointer.
  • ✅ Fixed: Missing bounds check on GPU batch results access
    • Added bounds checking before accessing gpu_batch.psqt_scores[i] and gpu_batch.positional_scores[i], matching the defensive pattern used in hybrid_search.cpp to handle partial GPU evaluation failures.

- Introduced PGN parsing capabilities to extract individual game results and details, improving the tournament result reporting.
- Enhanced output formatting for game results, including color-coded results and move counts.
- Updated match summary display to include final scores and detailed game outputs.
- Added new shell scripts for MetalFish engine wrappers to facilitate hybrid and multi-threaded MCTS execution.

These changes significantly improve the usability and clarity of tournament results, while also streamlining engine interactions.
- Integrated the Berserk engine into the Elo tournament workflow, allowing it to be cloned and built alongside other engines.
- Updated the tournament logic to include Berserk as a configurable engine, with an expected Elo rating of 3550.
- Enhanced the output to check for the presence of the Berserk binary and report its status during the tournament execution.

These changes improve the tournament's engine diversity and provide additional options for performance evaluation.
cursor[bot]

This comment was marked as outdated.

- Introduced a new JSON configuration file for engine settings, allowing for dynamic loading of engine parameters such as expected Elo ratings and options.
- Updated the `elo_tournament.py` script to load engine configurations from the new `engines_config.json`, enhancing flexibility and maintainability.
- Added support for multiple engines including MetalFish variants, Patricia, Berserk, and Lc0, with their respective configurations.
- Improved the tournament setup process by integrating the new configuration loading mechanism, ensuring a more streamlined and customizable tournament experience.

These changes significantly enhance the configurability of the Elo tournament, allowing for easier adjustments and additions of new engines in the future.
- 16 games × 20 min = 5h 20min (worst case)
- 8 openings × 2 (color swap) = 16 games
- Most games end earlier due to resignation/checkmate
- Convert UCI moves to Standard Algebraic Notation (e.g., Nf3 instead of g1-f3)
- Show live score [W-D-L] and estimated Elo difference in header
- Elo calculated from win rate using standard formula
- Replace manual UCI-to-SAN conversion with python-chess library
- Proper disambiguation (only when needed, not always)
- Includes check (+) and checkmate (#) symbols
- Handles all edge cases: castling, en passant, promotions
- Much cleaner and more reliable code
Major cleanup - removed 250+ lines of manual chess logic:
- Removed _fen_to_board, _can_piece_reach, _is_path_clear, _parse_move
- Removed manual castling, en passant, promotion handling
- Removed duplicate reset method
- Removed sync_board_to_visualizer helper

Now using python-chess internally:
- ChessBoardVisualizer.chess_board for all game state
- apply_move() accepts both SAN and UCI notation
- get_san() converts UCI to proper SAN notation
- _sync_board() syncs python-chess state to display array

Benefits:
- 100% correct move validation and handling
- Proper disambiguation, check/checkmate symbols
- Much simpler and more maintainable code
Install 'chess' package in all jobs that run elo_tournament.py:
- build job: pip3 install chess (alongside meson ninja)
- run-matches job: pip3 install chess
- aggregate job: pip3 install chess

This ensures the ChessBoardVisualizer works correctly in CI.
1. Simplified move display: '8... b5' instead of 'Move 16: 8... b5'
   - Removed redundant 'Move X:' prefix since the chess notation already shows the move number

2. Fixed header box alignment:
   - Calculate padding using visible character length, not including ANSI codes
   - Apply colors after calculating spacing to ensure proper alignment

The Elo showing ±0 is correct - it only updates after games finish.
- Parse 'info ... score cp X' and 'score mate X' from engine output
- Display eval in pawns format: +0.5, -1.2, M3, M-5
- Score is shown from White's perspective (negated for Black's moves)
- Format: [W-D-L] Eval: +0.5

The Elo difference is only meaningful after games complete,
but live eval shows the current position assessment.
Live eval was problematic:
- Would need a separate engine to evaluate (overhead)
- Using playing engine's eval is confusing (alternates between engines)

Now just shows clean [W-D-L] format: [0-0-0], [1-0-0], etc.
Both colors now use ♚♛♜♝♞♟ (filled pieces)
Differentiated by text color: cream for white, dark for black
Looks more consistent and visible than outlined pieces
cursor[bot]

This comment was marked as outdated.

Add defensive null check for positions_data() result before dereferencing.
This prevents potential crash if positions_buffer_ is uninitialized.
@cursor
Copy link

cursor bot commented Jan 21, 2026

Bugbot Autofix resolved the bug found in the latest run.

  • ✅ Fixed: Null pointer dereference in batch position addition
    • Added defensive null check for positions_data() result before dereferencing to prevent potential crash if positions_buffer_ is uninitialized.

The S3 bucket (berserk-networks.s3.amazonaws.com) returns Access Denied.
Now downloading from GitHub releases instead:
- Network: berserk-d43206fe90e4.nn (Berserk 13 release)
- URL: https://github.com/jhonnold/berserk/releases/download/13/
- Added file size verification (must be >1MB)
- Using 'make all' instead of 'make build' to skip makefile's download
cursor[bot]

This comment was marked as outdated.

…results file

- Add null check in hybrid_bridge() to prevent dereferencing after shutdown
- Delete accidentally committed test results file
- Add results/ directory to .gitignore
@cursor
Copy link

cursor bot commented Jan 21, 2026

Bugbot Autofix resolved 2 of the 2 bugs found in the latest run.

  • ✅ Fixed: Null dereference after shutdown in hybrid_bridge accessor
    • Added null check in hybrid_bridge() to throw exception when accessed after shutdown, preventing null pointer dereference.
  • ✅ Fixed: Test results file with empty data accidentally committed
    • Deleted the test results file and added results/ directory to .gitignore to prevent future generated output from being committed.

- Detect CI via CI or GITHUB_ACTIONS environment variables
- In CI: show simple text output (game start, results)
- Locally: show beautiful real-time chess board visualization
- This prevents crazy ANSI escape codes in GitHub Actions logs
- Display points (1 for win, 0.5 for draw) next to each engine name
- Format: 'Engine1 (2.5) vs (1.5) Engine2'
- Points update correctly regardless of which side each engine plays
- Much clearer than W-D-L when sides swap between games
- Updated GPU NNUE integration to support a unified implementation for Apple Silicon.
- Introduced new Metal shaders for NNUE evaluation, optimizing for unified memory.
- Replaced the previous GPU accumulator with a new cache system using Finny tables for efficient incremental updates.
- Enhanced batch processing capabilities for feature extraction and evaluation.
- Removed deprecated GPU accumulator files and updated related tests to reflect the new architecture.
- Added control options for enabling/disabling GPU NNUE in the engine configuration.
cursor[bot]

This comment was marked as outdated.

The GPU path was incorrectly skipping re-evaluation with the big network
when a position initially used the small network but produced an uncertain
evaluation (within 277 centipawns). The GPU now properly re-evaluates with
the big network, matching the CPU path behavior.
@cursor
Copy link

cursor bot commented Jan 21, 2026

Bugbot Autofix resolved the bug found in the latest run.

  • ✅ Fixed: GPU path skips big network re-evaluation for uncertain positions
    • Added GPU re-evaluation with big network to match CPU path behavior, ensuring consistent evaluation quality for uncertain positions.

@NripeshN
Copy link
Owner Author

bugbot run

cursor[bot]

This comment was marked as outdated.

- Introduced cleanup function for parallel hybrid search to ensure proper GPU resource management before shutdown.
- Updated hybrid search to clear callbacks to prevent crashes during cleanup.
- Added NNUE benchmarking functionality to compare CPU and GPU performance, including detailed output for evaluation times and consistency checks.
- Refactored parallel hybrid search initialization to reuse a persistent object, avoiding repeated construction/destruction issues.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is ON. A Cloud Agent has been kicked off to fix the reported issue.

- Updated workflow_dispatch input defaults to match env fallbacks
- games_per_match: 20 -> 16
- time_control: 10+0.1 -> 600+0.1
- Ensures consistent tournament behavior between PR and manual triggers
@cursor
Copy link

cursor bot commented Jan 21, 2026

Bugbot Autofix resolved the bug found in the latest run.

  • ✅ Fixed: Inconsistent workflow defaults cause different tournament behaviors
    • Updated workflow_dispatch input defaults to match env fallback values, ensuring consistent tournament behavior (16 games, 600+0.1 time control) across PR and manual triggers.

- Removed deprecated hybrid search and batch evaluator files to streamline the codebase.
- Introduced ThreadSafeMCTS for enhanced stability in parallel search operations.
- Updated GPU backend to include new hardware detection APIs and improved memory management.
- Enhanced NNUE evaluation with better logging and error handling.
- Adjusted UCI options to enable hybrid search directly, eliminating the need for wrapper scripts.
- Improved test coverage for MCTS components and GPU functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants