Skip to content

Commit 7555624

Browse files
authored
fix: Resolve XML validation error in OutputGitRepoXML function (#16)
* fix: Resolve XML validation error in OutputGitRepoXML function - Fixed XML generation to properly handle special characters and CDATA sections - Added protection against premature CDATA termination by escaping "]]>" sequences - Improved XML formatting with consistent indentation and structure - Simplified token placeholder replacement without breaking formatting * feat: Add .gptinclude functionality for selective file inclusion This commit adds support for a .gptinclude file, which allows users to explicitly specify which files should be included in the repository export. The feature complements the existing .gptignore functionality: - When both .gptinclude and .gptignore exist, files are first filtered by the include patterns, then any matching ignore patterns are excluded - Added new command-line flag: -I/--include to specify a custom path to the .gptinclude file - Default behavior looks for .gptinclude in repository root - Added comprehensive tests for the new functionality - Updated README.md with documentation and examples With this change, users gain more fine-grained control over which parts of their repositories are processed by git2gpt, making it easier to focus on specific areas when working with AI language models. * fix: Properly handle CDATA sections in XML output This commit fixes an issue where the XML export would fail with "unexpected EOF in CDATA section" errors when file content contained the CDATA end marker sequence ']]>'. The fix implements a proper CDATA handling strategy that: - Detects all occurrences of ']]>' in file content - Splits the content around these markers - Creates properly nested CDATA sections to preserve the original content - Ensures all XML output is well-formed regardless of source content This approach maintains the efficiency of CDATA for storing large code blocks while ensuring compatibility with all possible file content. Fixes the XML validation error that would occur when processing files containing CDATA end marker sequences.
1 parent 4734528 commit 7555624

File tree

5 files changed

+341
-132
lines changed

5 files changed

+341
-132
lines changed

.gptinclude

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
prompt/

README.md

+39-3
Original file line numberDiff line numberDiff line change
@@ -24,18 +24,54 @@ To use the git2gpt utility, run the following command:
2424
git2gpt [flags] /path/to/git/repository
2525
```
2626

27-
### Ignoring Files
27+
### Including and Ignoring Files
2828

29-
By default, your `.git` directory and your `.gitignore` files are ignored. Any files in your `.gitignore` are also skipped. If you want to change this behavior, you should add a `.gptignore` file to your repository. The `.gptignore` file should contain a list of files and directories to ignore, one per line. The `.gptignore` file should be in the same directory as your `.gitignore` file. Please note that this overwrites the default ignore list, so you should include the default ignore list in your `.gptignore` file if you want to keep it.
29+
By default, your `.git` directory and your `.gitignore` files are ignored. Any files in your `.gitignore` are also skipped. You can customize the files to include or ignore in several ways:
3030

31-
### Flags
31+
### Including Only Specific Files (.gptinclude)
32+
33+
Add a `.gptinclude` file to your repository to specify which files should be included in the output. Each line in the file should contain a glob pattern of files or directories to include. If a `.gptinclude` file is present, only files that match these patterns will be included.
34+
35+
Example `.gptinclude` file:
36+
```
37+
# Include only these file types
38+
*.go
39+
*.js
40+
*.html
41+
*.css
42+
43+
# Include specific directories
44+
src/**
45+
docs/api/**
46+
```
47+
48+
### Ignoring Specific Files (.gptignore)
49+
50+
Add a `.gptignore` file to your repository to specify which files should be ignored. This works similar to `.gitignore`, but is specific to git2gpt. The `.gptignore` file should contain a list of files and directories to ignore, one per line.
51+
52+
Example `.gptignore` file:
53+
```
54+
# Ignore these file types
55+
*.log
56+
*.tmp
57+
*.bak
58+
59+
# Ignore specific directories
60+
node_modules/**
61+
build/**
62+
```
63+
64+
**Note**: When both `.gptinclude` and `.gptignore` files exist, git2gpt will first include files matching the `.gptinclude` patterns, and then exclude any of those files that also match `.gptignore` patterns.
65+
66+
## Command Line Options
3267

3368
* `-p`, `--preamble`: Path to a text file containing a preamble to include at the beginning of the output file.
3469
* `-o`, `--output`: Path to the output file. If not specified, will print to standard output.
3570
* `-e`, `--estimate`: Estimate the tokens of the output file. If not specified, does not estimate.
3671
* `-j`, `--json`: Output to JSON rather than plain text. Use with `-o` to specify the output file.
3772
* `-x`, `--xml`: Output to XML rather than plain text. Use with `-o` to specify the output file.
3873
* `-i`, `--ignore`: Path to the `.gptignore` file. If not specified, will look for a `.gptignore` file in the same directory as the `.gitignore` file.
74+
* `-I`, `--include`: Path to the `.gptinclude` file. If not specified, will look for a `.gptinclude` file in the repository root.
3975
* `-g`, `--ignore-gitignore`: Ignore the `.gitignore` file.
4076
* `-s`, `--scrub-comments`: Remove comments from the output file to save tokens.
4177

cmd/root.go

+5-31
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,40 @@
11
package cmd
2-
32
import (
43
"fmt"
54
"os"
6-
75
"github.com/chand1012/git2gpt/prompt"
86
"github.com/spf13/cobra"
97
)
10-
118
var repoPath string
129
var preambleFile string
1310
var outputFile string
1411
var estimateTokens bool
1512
var ignoreFilePath string
13+
var includeFilePath string // New: Add variable for include file path
1614
var ignoreGitignore bool
1715
var outputJSON bool
1816
var outputXML bool
1917
var debug bool
2018
var scrubComments bool
21-
2219
var rootCmd = &cobra.Command{
2320
Use: "git2gpt [flags] /path/to/git/repository [/path/to/another/repository ...]",
2421
Short: "git2gpt is a utility to convert one or more Git repositories to a text file for input into an LLM",
2522
Args: cobra.MinimumNArgs(1),
2623
Run: func(cmd *cobra.Command, args []string) {
27-
// Create a combined repository to hold all files
2824
combinedRepo := &prompt.GitRepo{
2925
Files: []prompt.GitFile{},
3026
}
31-
32-
// Process each repository path
3327
for _, path := range args {
3428
repoPath = path
3529
ignoreList := prompt.GenerateIgnoreList(repoPath, ignoreFilePath, !ignoreGitignore)
36-
repo, err := prompt.ProcessGitRepo(repoPath, ignoreList)
30+
includeList := prompt.GenerateIncludeList(repoPath, includeFilePath) // New: Generate include list
31+
repo, err := prompt.ProcessGitRepo(repoPath, includeList, ignoreList) // Modified: Pass includeList
3732
if err != nil {
3833
fmt.Printf("Error processing %s: %s\n", repoPath, err)
3934
os.Exit(1)
4035
}
41-
42-
// Add files from this repo to the combined repo
4336
combinedRepo.Files = append(combinedRepo.Files, repo.Files...)
4437
}
45-
46-
// Update the file count
4738
combinedRepo.FileCount = len(combinedRepo.Files)
4839
if outputJSON {
4940
output, err := prompt.MarshalRepo(combinedRepo, scrubComments)
@@ -52,7 +43,6 @@ var rootCmd = &cobra.Command{
5243
os.Exit(1)
5344
}
5445
if outputFile != "" {
55-
// if output file exists, throw error
5646
if _, err := os.Stat(outputFile); err == nil {
5747
fmt.Printf("Error: output file %s already exists\n", outputFile)
5848
os.Exit(1)
@@ -75,15 +65,11 @@ var rootCmd = &cobra.Command{
7565
fmt.Printf("Error: %s\n", err)
7666
os.Exit(1)
7767
}
78-
79-
// Validate the XML output
8068
if err := prompt.ValidateXML(output); err != nil {
8169
fmt.Printf("Error: %s\n", err)
8270
os.Exit(1)
8371
}
84-
8572
if outputFile != "" {
86-
// if output file exists, throw error
8773
if _, err := os.Stat(outputFile); err == nil {
8874
fmt.Printf("Error: output file %s already exists\n", outputFile)
8975
os.Exit(1)
@@ -106,7 +92,6 @@ var rootCmd = &cobra.Command{
10692
os.Exit(1)
10793
}
10894
if outputFile != "" {
109-
// if output file exists, throw error
11095
if _, err := os.Stat(outputFile); err == nil {
11196
fmt.Printf("Error: output file %s already exists\n", outputFile)
11297
os.Exit(1)
@@ -126,33 +111,22 @@ var rootCmd = &cobra.Command{
126111
}
127112
},
128113
}
129-
130114
func init() {
131115
rootCmd.Flags().StringVarP(&preambleFile, "preamble", "p", "", "path to preamble text file")
132-
// output to file flag. Should be a string
133116
rootCmd.Flags().StringVarP(&outputFile, "output", "o", "", "path to output file")
134-
// estimate tokens. Should be a bool
135117
rootCmd.Flags().BoolVarP(&estimateTokens, "estimate", "e", false, "estimate the number of tokens in the output")
136-
// ignore file path. Should be a string
137118
rootCmd.Flags().StringVarP(&ignoreFilePath, "ignore", "i", "", "path to .gptignore file")
138-
// ignore gitignore. Should be a bool
119+
rootCmd.Flags().StringVarP(&includeFilePath, "include", "I", "", "path to .gptinclude file") // New: Add flag for include file
139120
rootCmd.Flags().BoolVarP(&ignoreGitignore, "ignore-gitignore", "g", false, "ignore .gitignore file")
140-
// output JSON. Should be a bool
141121
rootCmd.Flags().BoolVarP(&outputJSON, "json", "j", false, "output JSON")
142-
// output XML. Should be a bool
143122
rootCmd.Flags().BoolVarP(&outputXML, "xml", "x", false, "output XML")
144-
// debug. Should be a bool
145123
rootCmd.Flags().BoolVarP(&debug, "debug", "d", false, "debug mode. Do not output to standard output")
146-
// scrub comments. Should be a bool
147124
rootCmd.Flags().BoolVarP(&scrubComments, "scrub-comments", "s", false, "scrub comments from the output. Decreases token count")
148-
149-
// Update the example usage to show multiple paths
150125
rootCmd.Example = " git2gpt /path/to/repo1 /path/to/repo2\n git2gpt -o output.txt /path/to/repo1 /path/to/repo2"
151126
}
152-
153127
func Execute() {
154128
if err := rootCmd.Execute(); err != nil {
155129
fmt.Println(err)
156130
os.Exit(1)
157131
}
158-
}
132+
}

prompt/gptinclude_test.go

+140
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
package prompt
2+
3+
import (
4+
"os"
5+
"path/filepath"
6+
"testing"
7+
)
8+
9+
func TestGptIncludeAndIgnore(t *testing.T) {
10+
// Create a temporary directory structure for testing
11+
tempDir, err := os.MkdirTemp("", "git2gpt-test")
12+
if err != nil {
13+
t.Fatalf("Failed to create temp directory: %v", err)
14+
}
15+
defer os.RemoveAll(tempDir)
16+
17+
// Create test files
18+
testFiles := []struct {
19+
path string
20+
contents string
21+
}{
22+
{"file1.txt", "Content of file1"},
23+
{"file2.txt", "Content of file2"},
24+
{"file3.txt", "Content of file3"},
25+
{"src/main.go", "package main\nfunc main() {}"},
26+
{"src/lib/util.go", "package lib\nfunc Util() {}"},
27+
{"docs/README.md", "# Documentation"},
28+
}
29+
30+
for _, tf := range testFiles {
31+
fullPath := filepath.Join(tempDir, tf.path)
32+
// Create directory if it doesn't exist
33+
dir := filepath.Dir(fullPath)
34+
if err := os.MkdirAll(dir, 0755); err != nil {
35+
t.Fatalf("Failed to create directory %s: %v", dir, err)
36+
}
37+
// Write the file
38+
if err := os.WriteFile(fullPath, []byte(tf.contents), 0644); err != nil {
39+
t.Fatalf("Failed to write file %s: %v", fullPath, err)
40+
}
41+
}
42+
43+
// Test cases
44+
testCases := []struct {
45+
name string
46+
includeContent string
47+
ignoreContent string
48+
expectedFiles []string
49+
unexpectedFiles []string
50+
}{
51+
{
52+
name: "Only include src directory",
53+
includeContent: "src/**",
54+
ignoreContent: "",
55+
expectedFiles: []string{"src/main.go", "src/lib/util.go"},
56+
unexpectedFiles: []string{"file1.txt", "file2.txt", "file3.txt", "docs/README.md"},
57+
},
58+
{
59+
name: "Include all, but ignore .txt files",
60+
includeContent: "**",
61+
ignoreContent: "*.txt",
62+
expectedFiles: []string{"src/main.go", "src/lib/util.go", "docs/README.md"},
63+
unexpectedFiles: []string{"file1.txt", "file2.txt", "file3.txt"},
64+
},
65+
{
66+
name: "Include src and docs, but ignore lib directory",
67+
includeContent: "src/**\ndocs/**",
68+
ignoreContent: "src/lib/**",
69+
expectedFiles: []string{"src/main.go", "docs/README.md"},
70+
unexpectedFiles: []string{"file1.txt", "file2.txt", "file3.txt", "src/lib/util.go"},
71+
},
72+
{
73+
name: "No include file (should include all), ignore .txt files",
74+
includeContent: "",
75+
ignoreContent: "*.txt",
76+
expectedFiles: []string{"src/main.go", "src/lib/util.go", "docs/README.md"},
77+
unexpectedFiles: []string{"file1.txt", "file2.txt", "file3.txt"},
78+
},
79+
}
80+
81+
for _, tc := range testCases {
82+
t.Run(tc.name, func(t *testing.T) {
83+
// Create .gptinclude file if needed
84+
includeFilePath := filepath.Join(tempDir, ".gptinclude")
85+
if tc.includeContent != "" {
86+
if err := os.WriteFile(includeFilePath, []byte(tc.includeContent), 0644); err != nil {
87+
t.Fatalf("Failed to write .gptinclude file: %v", err)
88+
}
89+
} else {
90+
// Ensure no .gptinclude file exists
91+
os.Remove(includeFilePath)
92+
}
93+
94+
// Create .gptignore file if needed
95+
ignoreFilePath := filepath.Join(tempDir, ".gptignore")
96+
if tc.ignoreContent != "" {
97+
if err := os.WriteFile(ignoreFilePath, []byte(tc.ignoreContent), 0644); err != nil {
98+
t.Fatalf("Failed to write .gptignore file: %v", err)
99+
}
100+
} else {
101+
// Ensure no .gptignore file exists
102+
os.Remove(ignoreFilePath)
103+
}
104+
105+
// Generate include and ignore lists
106+
includeList := GenerateIncludeList(tempDir, "")
107+
ignoreList := GenerateIgnoreList(tempDir, "", false)
108+
109+
// Process the repository
110+
repo, err := ProcessGitRepo(tempDir, includeList, ignoreList)
111+
if err != nil {
112+
t.Fatalf("Failed to process repository: %v", err)
113+
}
114+
115+
// Check if expected files are included
116+
for _, expectedFile := range tc.expectedFiles {
117+
found := false
118+
for _, file := range repo.Files {
119+
if file.Path == expectedFile {
120+
found = true
121+
break
122+
}
123+
}
124+
if !found {
125+
t.Errorf("Expected file %s to be included, but it wasn't", expectedFile)
126+
}
127+
}
128+
129+
// Check if unexpected files are excluded
130+
for _, unexpectedFile := range tc.unexpectedFiles {
131+
for _, file := range repo.Files {
132+
if file.Path == unexpectedFile {
133+
t.Errorf("File %s should have been excluded, but it was included", unexpectedFile)
134+
break
135+
}
136+
}
137+
}
138+
})
139+
}
140+
}

0 commit comments

Comments
 (0)