New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Added dataset download files to write the dataset from the HF api to … #37

Open

andrewkallai wants to merge 6 commits into llvm-ml:main from andrewkallai:data_download_tools

andrewkallai commented Aug 10, 2024 •

edited

Loading

Added dataset download files to write the dataset from the HF api to disk. Also added bash script to create tar files from the IR files on disk.


          Added dataset download files to write the dataset from the HF api to …

986538b

…disk. Also added bash script to create tar files from the IR files on disk.

vercel bot commented Aug 10, 2024 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
llvm-ir-dataset-utils	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Sep 15, 2024 10:26pm

vercel bot deployed to Preview

August 10, 2024 09:57

View deployment

boomanaiden154 reviewed

View reviewed changes

Contributor

boomanaiden154 left a comment

Initial comments.

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/write_data_files.py Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/write_data_files.py Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/write_data_files.py Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/write_data_files.py Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/write_data_files.py Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/create_tar.sh Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/write_data_files.py Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/write_data_files.py Outdated Show resolved Hide resolved


          Adding changes to file for writing dataset files. These changes inclu…

cab77f3

…de argparse functionality, elimination of global variables, and script execution layout.

vercel bot deployed to Preview

August 25, 2024 18:27

View deployment

andrewkallai requested a review from boomanaiden154

August 25, 2024 18:27

boomanaiden154 reviewed

View reviewed changes

Contributor

boomanaiden154 left a comment •

edited

Loading

Are you able to add a README.md or more info into the docstring in write_data_files.py with an example of how to use this?

Also, please take a look at the CI failures.

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/write_data_files.py Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/write_data_files.py Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/datatset_download/write_data_files.py Outdated Show resolved Hide resolved


          Adding changes for writing dataset files to tar in python.

52f0049

andrewkallai requested a review from boomanaiden154

September 1, 2024 02:33

vercel bot deployed to Preview

September 1, 2024 02:33

View deployment


          Removed unecessary batch size argument.

5fa24ff

vercel bot deployed to Preview

September 1, 2024 14:34

View deployment


          Fixed spelling mistake in directory name.

5b8ee64

vercel bot deployed to Preview

September 1, 2024 15:56

View deployment

boomanaiden154 reviewed

View reviewed changes

Contributor

boomanaiden154 left a comment

Couple stylistic things.

llvm_ir_dataset_utils/compile_time_analysis_tools/dataset_download/write_data_files.py Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/dataset_download/write_data_files.py Outdated Show resolved Hide resolved

llvm_ir_dataset_utils/compile_time_analysis_tools/dataset_download/write_data_files.py Outdated Show resolved Hide resolved


          Modified docstrings and import statement locations.

1b936ee

vercel bot deployed to Preview

September 15, 2024 22:26

View deployment

andrewkallai requested a review from boomanaiden154

September 15, 2024 22:26

boomanaiden154 reviewed

View reviewed changes

llvm_ir_dataset_utils/compile_time_analysis_tools/dataset_download/write_data_files.py

+              def get_args():
+                """Function to return the provided storage argument for the script.
+                Returns: argparse.Namespace

Contributor

boomanaiden154 Sep 26, 2024

Nit: Can we have a type annotation for this?

llvm_ir_dataset_utils/compile_time_analysis_tools/dataset_download/write_data_files.py

+                      tarinfo = tarfile.TarInfo(name=f'bc_files/file{x[0]+1+start_index}.bc')
+                      file_obj = BytesIO(x[1])
+                      tarinfo.size = file_obj.getbuffer().nbytes
+                      tarinfo.mtime = time()

Contributor

boomanaiden154 Sep 26, 2024

Why do we need to set the time here?

Author

andrewkallai Jan 25, 2025

The tar file will otherwise be created with an irregular modification time (e.g. the year 1969) and the tar will complain that the modification time is irregular.

llvm_ir_dataset_utils/compile_time_analysis_tools/dataset_download/write_data_files.py

+                      tarinfo.mtime = time()
+                      tar.addfile(tarinfo, fileobj=file_obj)
+                with parallel.parallel_backend('spark'):

Contributor

boomanaiden154 Sep 26, 2024

Can you add a comment on the performance benefits of using the parallel backend?

llvm_ir_dataset_utils/compile_time_analysis_tools/dataset_download/write_data_files.py

+                  end_index = file_indices[i]["end_index"]
+                  dir_name = f'{storage}/{file_indices[i]["language"]}'
+                  makedirs(dir_name, exist_ok=True)
+                  thread = threading.Thread(

Contributor

boomanaiden154 Sep 26, 2024

It would probably be more natural to use a ThreadPoolExecutor here, submit jobs, and get back futures?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet