py2dataset analyzes source code to generate structured datasets describing code content and behavior. It extracts information from Python files and converts it into JSON-formatted datasets. These datasets can help you understand Python software or train AI systems.
py2dataset flexibly and extensibly analyzes source code to generate descriptive datasets for understanding and learning Python code.
py2dataset performs the following functions:
- Obtain questions from the
py2dataset_questions.json
file or from the questions filename provided in the command line - Obtain a file listing of all of Python files at given local directory or github repository
- Parse each Python file using the Abstract Syntax Tree (AST) and visitor design pattern to build a representation of the code structure, functions, classes, methods and variables
- Generate information for the output datasets in instruction-input-output triplets using code snippets as context to answer the questions about the Python file characteristics
- (Optional) Invoke a language model to generate responses to questions about the purpose of each file, function, class, method, and variable
- Output
<filename>.instruct.json
datasets, and a<filename>.details.yaml
containing the AST analysis to the local./dataset
directory or the command line specified output directory - Create image of the code call graph for the functions and classes within each python file and save in the same output directory
- Combine the datasets for all individual Python files together to produce a final
instruct.json
in the same output directory
Clone the repository and install dependencies:
```bash
git clone https://github.com/jeffmeloy/py2dataset.git
pip install -r requirements.txt
```
If using CUDA then:
```bash
pip install ctransformers[cuda]
```
Example usage:
```bash
python py2dataset.py <optional arguments>
```
Optional arguments:
--start <starting directory>
: Starting directory or GitHub repository Python files. Default: current working directory.--output_dir <output directory>
: Directory to write the output files. Default: ./dataset/.--questions_pathname <question file pathname>
: path & filename of questions file. Default: ./py2dataset_questions.json.--model_config_pathname <model configuration file pathname>
: Path and filename of the model configuration file. Default: ./py2dataset_model_config.yaml.--use_llm
: Use llm for generating JSON answers. Default: False.--quiet
: Limit logging output. Default: False.--single_process
: Use a single process to process Python files. Default: False.--detailed
: Include detailed analysis. Default: False.--html
: Generate HTML output. Default: False.--I
: Interactive mode. Default: False.--skip_regen
: Skip regeneration of existing instruct.json files. Default: False.--help
: Display main() docstring.
The following questions are answered by parsing the AST:
- Dependencies of Python file: '{filename}'?
- Call code graph of Python file: '{filename}'?
- Functions defined in Python file: '{filename}'?
- Classes defined in Python file: '{filename}'?
- Inputs to function: '{function_name}' in Python file: '{filename}'?
- Docstring of function: '{function_name}' in Python file: '{filename}'?
- Calls made in function: '{function_name}' in Python file: '{filename}'?
- Variables defined in function: '{function_name}' in Python file: '{filename}'?
- Returns from function: '{function_name}' in Python file: '{filename}'?
- Methods defined in class: '{class_name}' in Python file: '{filename}'?
- Docstring of class: '{class_name}' in Python file: '{filename}'?
- Attributes of class: '{class_name}' in Python file: '{filename}'?
- Variables defined in class: '{class_name}' in Python file: '{filename}'?
- Inheritance of class: '{class_name}' in Python file: '{filename}'?
- Inputs to method: '{method_name}' in Python file: '{filename}'?
- Docstring of method: '{method_name}' in Python file: '{filename}'?
- Calls made in method: '{method_name}' in Python file: '{filename}'?
- Returns from method: '{method_name}' in Python file: '{filename}'?
If --use_llm, the dataset includes the llm response to the file_purpose question in the --questions_pathname
file:
I) Describe the Purpose and Processing Approach for Python file: `{filename}`; II) Define detailed Requirements, API Signatures, and Logic for all Functions and Class Methods; III) Explain the purpose of the inputs, variables, call, and returns in the code.
If --use_llm and --detailed, the dataset includes the purpose and signicance of each code object
py2dataset.py
- Main scriptget_params.py
- Validates parameter path and file name arguments, returns questions and modelget_python_file_details.py
- Extracts details from Python files using ASTget_code_graph.py
- Obtain call code graphget_python_datasets.py
- Generates question-answer pairs and instructionssave_output.py
- Save the output datasets, optional graph images and .html filespy2dataset_questions.json
- Standard questions for Python files, functions, classespy2dataset_model_config.yaml
- Configuration for language model
Currently configured to use ctransformers with the default configuration defined in py2dataset_model_config.yaml
```yaml
# context is the Python source code
system_prompt: "Lang: English. Output Format: unformatted, outline. Task: Create detailed software documentation for publication using this entire code module Context:\n'{context}'\n"
# query is the "text" value from py2dataset_question.json for the "id":"file_purpose" question, code_objects are the code objects obtained from the AST
instruction_prompt: "Analyze Context considering these objects:\n'{code_objects}'\n to comply with this instruction:\n'{query}'\n"
# System / User / Assistant
prompt_template: "SYSTEM: {system_prompt} USER: {instruction_prompt} ASSISTANT:"
inference_model:
model_import_path: "ctransformers.AutoModelForCausalLM"
model_inference_function: "from_pretrained"
model_params:
# MODEL PATH - adjust for model location and type (remote or local)
model_path: "jeffmeloy/WestLake-7B-v2.Q8_0.gguf"
local_files_only: false
#model_path: "models/WestLake-7B-v2.Q8_0.gguf"
#local_files_only: true
model_type: "mistral"
## MODEL CONFIGURATION PARAMETERS - adjust for compute resources
#lib: "avx2"
threads: 16
batch_size: 512
context_length: 40000
max_new_tokens: 20000
gpu_layers: 100
reset: true
```
For each Python file assessed, the script saves the following to the output directory:
<filename>.details.yaml
- Entire breakdown of all Python file details<filename>.details.yaml
- Organized summary of all Python fiel code objects<filename>.instruct.json
- Instructions JSON file<filename>.entire_code_graph.png
- Code relationship graph
The script then creates composite datasets by combining the files above and saves the following to the output directory:
instruct.json
- instruction/input/output datasetdocument_code.json
- document/code dataset, if --use_llmsharegpt.json
- system/human/gpt dataset, if --use_llmcode_datails.yaml
- yaml formatted file, if --use_llm- corresponding .html files for each of the .json files, if --use_html;
The sharegpt.json includes a list of conversations. Each turn in a conversation has two dictionaries, a "from" field, which denotes the role of that turn, and a "value" field which contains the actual text. Here is an example of a sharegpt.json entry for each python code file:
{
"conversations": [
{
"from": "system",
"value": "Use the provided documentation to output the corresponding Python code."
},
{
"from": "human",
"value": "Create Python code based on this documentation: <py2dataset documentation>
},
{
"from": "gpt",
"value": <python code file listing>
}
],
"nbytes": <size of conversation in bytes>
"source": <source code path and filename>
},
If an output directory is not specified, the files will be saved in a ./datasets directory within the current working directory. Directory will be created if it does not exist.
The ./example_datasets directory contains the py2dataset output generated on itself.
```bash
> python .\py2dataset.py --start ..\ --use_llm --detailed
```
- Python >= 3.10
- networkx library for defining code graphs
- ctransformers library for large language model support
- PyYAML library for configuration and output files
- matplotlib library for saving code graphs
- GitPython library for getting Python modules from github