Add additional module metadata to the dataset #14

boomanaiden154 · 2024-03-04T22:12:14Z

As part of efforts to associate additional data with rows in the dataset (like inputs), we need a way to identify rows in the dataset across different versions. The plan to achieve this is to add the following columns to the dataset:

Add package_hash top level field in the build manifest to represent the project versioning information. #25
rebuild dataset with new package_hash field.
Add a path/target name/otherwise unique ID for that specific module within the project.
Add project versioning information (like a commit SHA)

The text was updated successfully, but these errors were encountered:

This commit adds a builder that can execute arbitrary commands for more custom apps that don't use a standard build system. This allows for more easily integrating projects like the Linux kernel and Chromium that aren't really standard like a typical CMake or Rust project.

ksandeep18 · 2025-01-19T01:59:13Z

Hi @boomanaiden154,

I’m interested in contributing to this issue and would like some clarification and guidance to ensure I approach the solution correctly. I've few questions and requests:

Current Dataset Structure:
Could you share details or an example of the current dataset structure? This will help me understand how to integrate the package_hash and other metadata fields effectively.

Preferred Hashing Algorithm:
Is there a preferred hashing algorithm (e.g., SHA256) for generating the package_hash, or should I propose one based on common practices?

Version Control Details:
For adding project versioning information, should I directly extract details like the commit SHA and timestamp from a Git repository, or is there an existing mechanism for this in the project?

Module Identifiers:
Should the unique module identifiers be a combination of path and target_name, or are there other fields I should consider for uniqueness?

Resources:
Are there any existing scripts, tools, or documentation I should review before starting on this issue?

Once I have these details, I can proceed to propose a detailed implementation plan and start contributing. Looking forward to your response!

Best regards,
[sandeep]

boomanaiden154 · 2025-01-23T23:57:32Z

This isn't really a great starter issue.

We don't really have good testing infrastructure, so proper testing requires building the whole dataset which requires distributed computing/HPC resources, and the specifications aren't really fleshed out that well.

If you want something better to hack on, looking at good first issues in the LLVM monorepo (https://github.com/llvmllvm-project) will probably be a better bet.

boomanaiden154 self-assigned this Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional module metadata to the dataset #14

Add additional module metadata to the dataset #14

boomanaiden154 commented Mar 4, 2024 •

edited

Loading

ksandeep18 commented Jan 19, 2025

boomanaiden154 commented Jan 23, 2025

Add additional module metadata to the dataset #14

Add additional module metadata to the dataset #14

Comments

boomanaiden154 commented Mar 4, 2024 • edited Loading

ksandeep18 commented Jan 19, 2025

boomanaiden154 commented Jan 23, 2025

boomanaiden154 commented Mar 4, 2024 •

edited

Loading