Use file content heuristics to decide file reader. #1962

Dimi1010 · 2025-09-12T09:40:58Z

The PR adds heuristics based on the file content that is more robust than deciding based on the file extension.

The new decision model scans the start of the file for its magic number signature. It then compares it to the signatures of supported file types [1] and constructs a reader instance based on the result.

A new function createReader and tryCreateReader has been added due to changes in the public API of the factory.
The functions differ in the error handling scheme, as createReader throws and tryCreateReader returns nullptr on error.

Method behaviour changes during erroneous scenarios:

Scenario	`getReader`	`createReader`	`tryCreateReader`
File not found	N/A	Throws exception	Return `nullptr`
Unsupported format	Return `PcapFileDeviceReader`	Throws exception	Return `nullptr`

…sed on the magic number.

…le-selection

… tied to it.

…ics detection method.

codecov · 2025-09-12T09:59:59Z

Codecov Report

❌ Patch coverage is 89.20188% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.43%. Comparing base (e227b75) to head (af12d2f).

Files with missing lines	Patch %	Lines
Pcap++/src/PcapFileDevice.cpp	90.20%	12 Missing and 2 partials ⚠️
Tests/Pcap++Test/Tests/FileTests.cpp	85.48%	5 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev    #1962      +/-   ##
==========================================
+ Coverage   83.41%   83.43%   +0.01%     
==========================================
  Files         311      311              
  Lines       55002    55197     +195     
  Branches    12098    12145      +47     
==========================================
+ Hits        45878    46051     +173     
- Misses       7889     7894       +5     
- Partials     1235     1252      +17

Flag	Coverage Δ
alpine320	`75.91% <76.10%> (-0.01%)`	⬇️
fedora42	`75.86% <76.31%> (+<0.01%)`	⬆️
macos-13	`81.56% <80.33%> (-0.01%)`	⬇️
macos-14	`81.56% <80.33%> (-0.01%)`	⬇️
macos-15	`81.58% <81.92%> (-0.01%)`	⬇️
mingw32	`70.66% <80.95%> (+0.06%)`	⬆️
mingw64	`70.63% <80.95%> (+0.18%)`	⬆️
npcap	`?`
rhel94	`75.89% <76.31%> (+0.01%)`	⬆️
ubuntu2004	`60.16% <59.49%> (-0.01%)`	⬇️
ubuntu2004-zstd	`60.26% <56.57%> (+0.01%)`	⬆️
ubuntu2204	`75.83% <76.31%> (+0.02%)`	⬆️
ubuntu2204-icpx	`60.62% <61.53%> (-0.01%)`	⬇️
ubuntu2404	`75.90% <76.10%> (+<0.01%)`	⬆️
ubuntu2404-arm64	`75.59% <76.10%> (+0.02%)`	⬆️
unittest	`83.43% <89.20%> (+0.01%)`	⬆️
windows-2022	`85.44% <87.61%> (+0.16%)`	⬆️
windows-2025	`85.47% <87.71%> (+0.12%)`	⬆️
winpcap	`85.47% <87.71%> (-0.08%)`	⬇️
xdp	`53.35% <0.00%> (-0.21%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Pcap++/src/PcapFileDevice.cpp

Tests/Pcap++Test/Tests/FileTests.cpp

seladb · 2025-09-15T08:08:40Z

Tests/Pcap++Test/Tests/FileTests.cpp

-	PTF_ASSERT_NOT_NULL(dynamic_cast<pcpp::PcapNgFileReaderDevice*>(genericReader));
-	PTF_ASSERT_TRUE(genericReader->open());
+	// ------- IFileReaderDevice::createReader() Factory
+	// TODO: Move to a separate unit test.


We should add the following to get more coverage:

Open a snoop file

Open a file that is not any of the options

Open pcap files with different magic numbers

Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions

3d713ab adds the following tests:

Pcap, PcapNG, Zst file with correct content + extension

Pcap, PcanNG file with correct content + wrong extension

Bogus content file with correct extension (pcap, pcapng, zst)

Bogus content file with wrong extension (txt)

Haven't found a snoop file to add. Do we have any?

Open pcap files with different magic numbers

Do you mean Pcap content that has just its magic number changed? Because IMO it is reasonable to consider that invalid format and fail as regular bogus data.

Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions

Pending on #1962 (comment) .

Pcap++/src/PcapFileDevice.cpp

Move it out if it needs to be reused somewhere.

Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.

seladb · 2025-09-21T08:10:16Z

@Dimi1010 some CI tests fail...

…le-selection

Pcap++/src/PcapFileDevice.cpp

Tests/Pcap++Test/TestDefinition.h

Tests/Pcap++Test/Tests/FileTests.cpp

seladb · 2025-10-03T07:45:47Z

Tests/Pcap++Test/Tests/FileTests.cpp

 	}
 };

+PTF_TEST_CASE(TestIFileReaderDeviceFactory_Pcap_MicroPrecision)


In addition to test a real pcap file, maybe we can add syntethic files that have a different magic number to test all options?
We don't have to put them in PcapExample/file_heuristics, instead we can create vectors with the content std::vector<uint8_t> and save them to temp files

Hmm, what would be the purpose? Just to test that it returns nullptr?
Doesn't TestIFileReaderDeviceFactory_Invalid already handle that?

I mean the other possible magic numbers of a valid pcap file. Since it's not easy to find such pcap files, we can generate synthetic files that are not actually valid, but will look valid for the sake of the test

So, you want a spoofed pcap sample for just these:

// Libpcap 0.9.1 and later support reading a modified pcap format that contains an extended header. // Format reference: https://wiki.wireshark.org/Development/LibpcapFileFormat#modified-pcap 0xa1'b2'cd'34, // Alexey Kuznetzov's modified libpcap format 0x34'cd'b2'a1 // Alexey Kuznetzov's modified libpcap format (byte-swapped)

or for the byte swapped versions of micro and nano too?

I'd suggest we have spoofed pcap samples for all options

Honestly, if we really want to do a unit test on every magic number, this would be easier to do by exposing CaptureFileFormatDetector in the header under internal and unit testing on the passed std::istream content directly than to have a spoofed pcap file.

Is it so hard to create those spoofed pcap files just for the tests? 🤔

Tbf, no. I can have them done.

My idea is that the scenario would essentially test the content detection system and not the factory function creating the devices due to the fact that the devices would be "invalid." If the tests are done through factory function, it can't test to open the device, etc.

Having it done directly on the detection system would remove the requirements for external files as that operates on streams.

Of course, that comes with the tradeoff of having the detection system exposed in the headers as it needs to be referencable.

Tests/Pcap++Test/Tests/FileTests.cpp

…le-selection

Updated pcap file detection to return the precice format of Pcap instead of just `true` / `false`. Updated detect format to always retuirn the detected format. Previous responsibility for unsupported zstd archive files has been passed up the call stack to the factory function `createReader`.

…ptimizations and branch pruning.

…e branches.

Pcap++/src/PcapFileDevice.cpp

…ethods.

…nature.

…on 1 line and doxygen errors when its in 2 lines.

Pcap++/src/PcapFileDevice.cpp

Tests/Pcap++Test/Tests/FileTests.cpp

seladb · 2025-10-09T07:10:16Z

Tests/Pcap++Test/Tests/FileTests.cpp

+
+PTF_TEST_CASE(TestReaderFactory_Snoop)
+{
+	constexpr const char* SNOOP_FILE_PATH = EXAMPLE_SOLARIS_SNOOP;


nit: this variable is not needed, we can just use EXAMPLE_SOLARIS_SNOOP

I would prefer to have it to have a level of detachment from the global macro, if it needs to be changed later.

seladb · 2025-10-09T07:12:06Z

Tests/Pcap++Test/Tests/FileTests.cpp

Let's add createReader() to the tests?

Added tests for the scenarios that throw. The success branches should be covered by tryCreate 🤔 ?

af12d2f

I guess we can add one or two success cases, but not necessary

Pcap++/src/PcapFileDevice.cpp

Dimi1010 added 4 commits September 12, 2025 12:03

Added heuristics file content detector that determines the content ba…

02de760

…sed on the magic number.

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

d2b6339

…le-selection

Moved stream checkpoint outside format detector as it is not directly…

685dd9f

… tied to it.

Added a new factory function createReader that uses the new heurist…

40dee69

…ics detection method.

Dimi1010 added the enhancement label Sep 12, 2025

Add <algorithm> include.

f1e3e18

Dimi1010 added 2 commits September 12, 2025 13:17

Added unit tests.

8da1790

Deprecated old factory function.

3ad51e2

Dimi1010 added the API deprecation Pull requests that deprecate parts of the public interface. label Sep 12, 2025

Dimi1010 added 3 commits September 12, 2025 14:08

Add byte-swapped zstd magic number.

15c2000

Lint

17af8d4

Move enum closer to first usage.

46418ec

Dimi1010 marked this pull request as ready for review September 12, 2025 11:36

Dimi1010 requested a review from seladb as a code owner September 12, 2025 11:36

Dimi1010 requested review from clementperon, tigercosmos and egecetin September 12, 2025 11:36

tigercosmos approved these changes Sep 12, 2025

View reviewed changes

seladb reviewed Sep 15, 2025

View reviewed changes

Dimi1010 added 4 commits September 15, 2025 15:45

Added unit tests for file reader device factory.

3d713ab

Revert indentation.

a2391ec

Fixed StreamCheckpoint to restore original stream state.

ea328d7

Merge branch 'dev' into feature/heuristic-file-selection

db86c3e

Dimi1010 commented Sep 19, 2025

View reviewed changes

Pcap++/src/PcapFileDevice.cpp Outdated Show resolved Hide resolved

Dimi1010 added 3 commits September 20, 2025 12:59

Merge branch 'dev' into feature/heuristic-file-selection

4aed9bd

Moved isStreamSeekable helper to inside CaptureFileFormatDetector.

a83ae2b

Move it out if it needs to be reused somewhere.

Added pcap magic number for Alexey Kuznetzov's modified pcap format.

916e872

Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

022529f

…le-selection

Dimi1010 requested a review from seladb October 2, 2025 17:16

seladb reviewed Oct 3, 2025

View reviewed changes

Dimi1010 added 15 commits October 3, 2025 11:22

Centralized PTF test name width under a macro.

4f52f59

Add Pcap++Test header files to test sources for IDE tooling.

88ebfff

Fixed test output formatting.

41fe188

Lint

c8ae4f8

Typo fix.

c7cab2b

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

6d55077

…le-selection

Shortened test names.

682eeac

Simplified invalid file test.

07804da

Simplified ZST tests.

9c4fc08

Added snoop test.

d975157

Marked checkSupport functions as constexpr to enable compile time o…

96a61b2

…ptimizations and branch pruning.

Exclude json from pre-commit cppcheck as it is slow due to many defin…

55a6b7a

…e branches.

Lint

3ab14e7

Fix runtime side effects inside constexpr function.

5dd9a30

Dimi1010 commented Oct 7, 2025

View reviewed changes

Pcap++/src/PcapFileDevice.cpp Show resolved Hide resolved

Dimi1010 added 6 commits October 8, 2025 00:03

Added a secondary factory function to separate mixed error handling m…

45ad769

…ethods.

Revert deprecation message, as doxygen is unhappy.

d24a9ad

Update tests.

f5ff879

Update deprecation warning to point to the function closer to the sig…

2c1b2c4

…nature.

Catch general exception instead of runtime error.

8d1ed1d

Shortened deprecation message due to pre-commit warnings when its is …

0ea2da9

…on 1 line and doxygen errors when its in 2 lines.

seladb reviewed Oct 9, 2025

View reviewed changes

Dimi1010 added 2 commits October 9, 2025 10:48

Fix braces.

c209c90

Simplfy test.

8d77aa0

Dimi1010 commented Oct 9, 2025

View reviewed changes

Pcap++/src/PcapFileDevice.cpp Show resolved Hide resolved

Added tests for createReader failures.

af12d2f

Dimi1010 mentioned this pull request Oct 9, 2025

Added clang-format rule to insert braces for multiline control blocks. #1990

Draft

Use file content heuristics to decide file reader. #1962

Are you sure you want to change the base?

Use file content heuristics to decide file reader. #1962

Uh oh!

Conversation

Dimi1010 commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seladb commented Sep 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimi1010 Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimi1010 Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimi1010 Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Dimi1010 commented Sep 12, 2025 •

edited

Loading

codecov bot commented Sep 12, 2025 •

edited

Loading

Dimi1010 Oct 6, 2025 •

edited

Loading

Dimi1010 Oct 8, 2025 •

edited

Loading

Dimi1010 Oct 9, 2025 •

edited

Loading