Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional module metadata to the dataset #14

Open
4 tasks
boomanaiden154 opened this issue Mar 4, 2024 · 2 comments
Open
4 tasks

Add additional module metadata to the dataset #14

boomanaiden154 opened this issue Mar 4, 2024 · 2 comments
Assignees

Comments

@boomanaiden154
Copy link
Contributor

boomanaiden154 commented Mar 4, 2024

As part of efforts to associate additional data with rows in the dataset (like inputs), we need a way to identify rows in the dataset across different versions. The plan to achieve this is to add the following columns to the dataset:

@boomanaiden154 boomanaiden154 self-assigned this Mar 4, 2024
khoing0810 pushed a commit to khoing0810/llvm-ir-dataset-utils that referenced this issue May 2, 2024
This commit adds a builder that can execute arbitrary commands for more
custom apps that don't use a standard build system. This allows for more
easily integrating projects like the Linux kernel and Chromium that
aren't really standard like a typical CMake or Rust project.
@ksandeep18
Copy link

Hi @boomanaiden154,

I’m interested in contributing to this issue and would like some clarification and guidance to ensure I approach the solution correctly. I've few questions and requests:

Current Dataset Structure:
Could you share details or an example of the current dataset structure? This will help me understand how to integrate the package_hash and other metadata fields effectively.

Preferred Hashing Algorithm:
Is there a preferred hashing algorithm (e.g., SHA256) for generating the package_hash, or should I propose one based on common practices?

Version Control Details:
For adding project versioning information, should I directly extract details like the commit SHA and timestamp from a Git repository, or is there an existing mechanism for this in the project?

Module Identifiers:
Should the unique module identifiers be a combination of path and target_name, or are there other fields I should consider for uniqueness?

Resources:
Are there any existing scripts, tools, or documentation I should review before starting on this issue?

Once I have these details, I can proceed to propose a detailed implementation plan and start contributing. Looking forward to your response!

Best regards,
[sandeep]

@boomanaiden154
Copy link
Contributor Author

This isn't really a great starter issue.

We don't really have good testing infrastructure, so proper testing requires building the whole dataset which requires distributed computing/HPC resources, and the specifications aren't really fleshed out that well.

If you want something better to hack on, looking at good first issues in the LLVM monorepo (https://github.com/llvmllvm-project) will probably be a better bet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants