Skip to content

Fix #14021 (Better path matching) #7645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open

Fix #14021 (Better path matching) #7645

wants to merge 33 commits into from

Conversation

glankk
Copy link
Collaborator

@glankk glankk commented Jul 6, 2025

No description provided.

@glankk glankk force-pushed the 12821 branch 3 times, most recently from f29f42d to 572dedf Compare July 6, 2025 07:00
@firewave
Copy link
Collaborator

firewave commented Jul 6, 2025

See #7540. I had some issues with the behavior which I have not outlined in that PR yet because the code contains a bug which made the tests all pass although that should not have.

Also std::regex has abysmal performance and no code should ever use it. Thus we should not rely on it's interface. See #6211 a wrapper for our existing regex. That is complete and working but has a memory leak. I think the leak should be ignored for now since it is intermediate code anyways. I had a local branch which replaces the implement with std::regex as another intermediate step to see if our wrapper wi8ll work with other implementation.

@@ -157,7 +157,7 @@ def test_gui_project_loads_relative_vs_solution_2(tmp_path):
def test_gui_project_loads_relative_vs_solution_with_exclude(tmp_path):
proj_dir = tmp_path / 'proj2'
shutil.copytree(__proj_dir, proj_dir)
create_gui_project_file(os.path.join(tmp_path, 'test.cppcheck'), root_path='proj2', import_project='proj2/proj2.sln', exclude_paths=['b'])
create_gui_project_file(os.path.join(tmp_path, 'test.cppcheck'), root_path='proj2', import_project='proj2/proj2.sln', exclude_paths=['b/'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike that we need to rely on some options ending with / so they are treated as a path. That also complicated things while working on my PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think the path matching is currently trying to do too much which makes the syntax weird and difficult to implement. Ideally I would like it to make it simpler but I've been hesitant to do so because I don't know what might break because of it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would like it to make it simpler but I've been hesitant to do so because I don't know what might break because of it.

I dislike it also.. but it is good imho that you try to limit the effects for now.. it's already a big PR anyway.

@@ -450,7 +450,7 @@ class TestImportProject : public TestFixture {
project.fileSettings = {std::move(fs1), std::move(fs2)};

project.ignorePaths({"*foo", "bar*"});
ASSERT_EQUALS(2, project.fileSettings.size());
ASSERT_EQUALS(1, project.fileSettings.size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this match less? That obviously looks like a bug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because * was changed to match any number of characters, including zero. I don't really have a strong opinion on that, but that's how globs usually behave.

if (caseSensitive)
mRegex = std::regex(regex_string, std::regex_constants::extended);
else
mRegex = std::regex(regex_string, std::regex_constants::extended | std::regex_constants::icase);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to print the regular expression and its respective input with the --debug-ignore option.

@firewave
Copy link
Collaborator

firewave commented Jul 6, 2025

See also https://trac.cppcheck.net/ticket/12268. And if you work on a ticket please assign yourself to it in Trac.

@glankk
Copy link
Collaborator Author

glankk commented Jul 7, 2025

See #7540. I had some issues with the behavior which I have not outlined in that PR yet because the code contains a bug which made the tests all pass although that should not have.

Also std::regex has abysmal performance and no code should ever use it. Thus we should not rely on it's interface. See #6211 a wrapper for our existing regex. That is complete and working but has a memory leak. I think the leak should be ignored for now since it is intermediate code anyways. I had a local branch which replaces the implement with std::regex as another intermediate step to see if our wrapper wi8ll work with other implementation.

I only noticed you were also working on PathMatch after I made this, sorry about that.

I agree the standard regex is slow and I'd rather use pcre, but I didn't want to make it a hard dependency just for this. For the common use case of PathMatch, which is matching maybe a few hundred files or so against a handful of patterns, the performance penalty of using std::regex instead is on the order of milliseconds per run. Having a backend-agnostic regex wrapper sounds like a good idea though.

@firewave
Copy link
Collaborator

firewave commented Jul 7, 2025

I only noticed you were also working on PathMatch after I made this, sorry about that.

No problem. I started working on it thinking it would a low-hanging fruit which wasn't the case. And I forgot to tag it with the ticket. And the one you worked on was no properly triaged.

I agree the standard regex is slow and I'd rather use pcre, but I didn't want to make it a hard dependency just for this.

Right, PCRE is currently not a hard dependency. So PCRE2 might not be a good replacement.

For the common use case of PathMatch, which is matching maybe a few hundred files or so against a handful of patterns, the performance penalty of using std::regex instead is on the order of milliseconds per run.

That might depend on the scenario. If there might be thousands of files it could be slow. It has been ages since I tried using std::regex but it was it beyond unacceptable I doubt it was improved upon (it seems like it was completely abandoned and nobody has the balls to suggest to remove it from the standard). PCRE feels like a nop.

Having a backend-agnostic regex wrapper sounds like a good idea though.

I will try to get the regex refactoring in. The first step is awkward but with some help we might get it into better shape fast.

@firewave
Copy link
Collaborator

firewave commented Jul 7, 2025

With a glob implementation inline in PathMatch we now have two different implementations. The other is matchglob() in utils.cpp. We should only have a single one.

That is probably where my performance concerns might have come from since the latter is also used outside of path handling stuff.

@glankk
Copy link
Collaborator Author

glankk commented Jul 8, 2025

With a glob implementation inline in PathMatch we now have two different implementations. The other is matchglob() in utils.cpp. We should only have a single one.

That is probably where my performance concerns might have come from since the latter is also used outside of path handling stuff.

I mostly agree with this, for the three places where matchglob is currently used for something other than paths it makes sense to keep it that way, all other uses of matchglob should be changed to use PathMatch instead.

@firewave
Copy link
Collaborator

firewave commented Jul 8, 2025

I mostly agree with this, for the three places where matchglob is currently used for something other than paths it makes sense to keep it that way, all other uses of matchglob should be changed to use PathMatch instead.

I agree since it makes things more explicit. But I would still prefer if we only have a single implementation of the glob logic.

if (Path::isAbsolute(path))
p = Path::simplifyPath(path);
else
p = Path::simplifyPath(Path::getCurrentPath() + "/" + path);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling it would be good to create a member mCurrentPath that is used instead of the function call. It will be good for unit testing purposes. And avoids repeated getcwd system calls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PathMatch already has mWorkingDirectories which also uses CWD. That should be configured from the outside. The implementation should best be filesystem-agnostic (not taking the flawed case-sensitivity handling into mind right now).

@glankk glankk force-pushed the 12821 branch 2 times, most recently from 8819409 to 3348484 Compare July 11, 2025 12:39
@glankk glankk marked this pull request as ready for review July 11, 2025 16:01
@glankk
Copy link
Collaborator Author

glankk commented Jul 11, 2025

I've updated this to use PathMatch everywhere a pattern is matched against a file, so that file matching is consistent. While doing so I found that there are places where the performance of PathMatch is actually quite important (especially in suppressions), so I've opted to use a hand-written matcher instead of regex.

Another noteworthy change is that directories can now be matched without a trailing '/'. This has the awkward side-effect that 'foo.cpp' matches 'foo.cpp/bar.cpp' when 'foo.cpp' happens to be a directory, and 'foo/' matches 'foo' even if foo is not a directory (enforcing directories for such patterns would require a file system lookup), but I still think it's preferable.

Regarding matchglob, I think it makes sense to have a more general wildcard implementation in matchglob, and a separate implementation in PathMatch because I'm not convinced that the rules for path globs should be the same as for matching general wildcards in error id's and such. Path globs need to match on path component boundaries, and '*' and '?' do not match path separators. For matching path separators as well, '**' is used instead.

These changes fix #12821, #12268, and #13997. Possibly related: #13983.

lib/path.cpp Outdated
{
const std::string& nativePath = toNativeSeparators(path);
return c == '/' || c == '\\';
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm techically in linux, backslash is not a path separator. you can create a directory or file with backslash in the name.

case '*': {
bool slash = false;
++s;
if (*s == '*') {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a change of logic that will probably mean that lots of users will have to adapt their configurations. We have documented that a single star matches path separators in some places I believe so it would make sense to look through documentation .. But overtime I have the feeling it is better to use **.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would make sense to look through documentation

I tried to quickly locate some such documentation and did not see it.

But adding a note about it in the releasenotes.txt wouldn't hurt at least.

lib/pathmatch.h Outdated
*/
explicit PathMatch(std::vector<std::string> paths, bool caseSensitive = true);
enum class Mode : std::uint8_t {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehm.. Writing a enum is better. But not sure about the names here. I think I'd prefer to not abbreviate. I don't have a clear idea about what names I would prefer instead of Mode,icase,scase.

static bool isRelativePattern(const std::string &pattern)
{
if (pattern.empty() || pattern[0] != '.')
return false;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spontanously I feel that file1.c should be a "relative pattern"?

@@ -157,7 +157,7 @@ def test_gui_project_loads_relative_vs_solution_2(tmp_path):
def test_gui_project_loads_relative_vs_solution_with_exclude(tmp_path):
proj_dir = tmp_path / 'proj2'
shutil.copytree(__proj_dir, proj_dir)
create_gui_project_file(os.path.join(tmp_path, 'test.cppcheck'), root_path='proj2', import_project='proj2/proj2.sln', exclude_paths=['b'])
create_gui_project_file(os.path.join(tmp_path, 'test.cppcheck'), root_path='proj2', import_project='proj2/proj2.sln', exclude_paths=['b/'])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would like it to make it simpler but I've been hesitant to do so because I don't know what might break because of it.

I dislike it also.. but it is good imho that you try to limit the effects for now.. it's already a big PR anyway.

@@ -3345,7 +3345,7 @@ class TestCmdlineParser : public TestFixture {
const char * const argv[] = {"cppcheck", "-isrc\\file.cpp", "src/file.cpp"};
ASSERT(!fillSettingsFromArgs(argv));
ASSERT_EQUALS(1, parser->getIgnoredPaths().size());
ASSERT_EQUALS("src/file.cpp", parser->getIgnoredPaths()[0]);
ASSERT_EQUALS("src\\file.cpp", parser->getIgnoredPaths()[0]);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho it is preferable to always use / as path separator internally. we don't have to ensure that backslash is handled properly everywhere then. would it cause problems to keep this behavior?

@danmar
Copy link
Owner

danmar commented Jul 15, 2025

The ticket and PR is called "GUI: exclude folders in imported project"

I think this PR does so much more. Either I think we should create a new ticket that better describes the work in this PR. Or we should rename the ticket+PR..

Copy link

@glankk
Copy link
Collaborator Author

glankk commented Jul 15, 2025

Here's an overview of the current state of this PR. Since the scope has grown beyond the original ticket I think it'd be appropriate to open a new ticket.

Summary of changes:

  • PathMatch changes, notably:
    • Now supports globs. The syntax is specific to path-matching and distinct from matchglob which has been generalized and is only used for e.g. suppression wildcards.
    • Directories don't require trailing slash
    • Better support for relative paths with an explicit syntax
  • The following now use PathMatch instead of matchglob:
    • File filters (with --file-filter=)
    • Ignored paths for projects (with -i <dir> and <exclude>)
    • Suppression filenames
  • The following already used PathMatch and will be affected:
    • Ignored paths for CLI (with -i <dir>)
    • GUI exclude list

New PathMatch behavior:

All match patterns and paths are canonicalized during matching.
/foo/bar/./.. => /foo
/foo/../bar/./.// => /bar
Trailing slashes are removed and thus have no special meaning.

Absolute patterns are matched from the start until a path component boundary in the canonicalized path.
Example:

	/foo/
matches
	/foo
	^~~~ OK
	/foo/
	^~~~ OK
	/foo/bar
	^~~~ OK
but not
	/foobar
	^~~~ No match, only partial

Relative patterns must be explicit by having their first path component be . or ... They are translated to absolute patterns and then treated as above. This is changed from the previous behavior, where "relative" names could match starting at any path component (the previous syntax is still supported though, described further down).
Example:

	../test.c
In a project file located at
	/home/foo/test.cppcheck
is translated to
	/home/test.c
and matches
	/home/test.c
	^~~~~~~~~~~~ OK
	/home/test.c/test.h
	^~~~~~~~~~~~ OK
but not
	/home/test.cpp
	^~~~~~~~~~~~ No match, only partial
	/home/project/test.cpp
	^~~~~~ No match, only partial

Other patterns, not explicitly absolute or relative, are matched between any two path component boundaries in the canonicalized path. This is working as before, except matching the middle of a path previously required a trailing slash. Also, the matching is now always done on the canonical absolute path of the specified path name, where previously a match could depend on whether a filename was given as relative or absolute (i.e. /foo/ would match /foo/bar.c, but not ./bar.c even if the current directory is /foo).
Example:

	foo.h
matches
	/foo.h
	 ^~~~~ OK
	/home/foo.h
	      ^~~~~ OK
	/home/foo.h/bar.c
	      ^~~~~ OK	
but not
	/foo.hpp
	 ^~~~~ No match, only partial
	/home/foo.hpp
	      ^~~~~ No match, only partial
	/home/foo.hpp/bar.cpp
	      ^~~~~ No match, only partial

Patterns can use globs:
** Matches zero or more of any character, including path separators.
* Matches zero or more of any character, not including path separators.
? Matches exactly one character, not including path separators.
Examples:

	foo.*
matches
	/foo./bar
	 ^~~~ OK
	/home/foo.c
	      ^~~~~ OK
	/home/foo.cppcheck/bar
	      ^~~~~~~~~~~~ OK

	/home/cppcheck/**.c
matches
	/home/cppcheck/src/foo.c
	^~~~~~~~~~~~~~~~~~~~~~~~ OK
	/home/cppcheck/.c
	^~~~~~~~~~~~~~~~~ OK (empty glob)

	/home/cppcheck/**/*.c
matches
	/home/cppcheck/src/bar.c
	^~~~~~~~~~~~~~~~~~~~~~~~ OK
	/home/cppcheck/src/foo/bar.c
	^~~~~~~~~~~~~~~~~~~~~~~~ OK
but not
	/home/cppcheck/.c
	^~~~~~~~~~~~~~~ No match, only partial

@glankk glankk changed the title Fix #12821 (GUI: exclude folders in imported project) Fix #14021 (Better path matching) Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants