Find-WebLinks

Find-WebLinks is a PowerShell command-line tool for extracting web links from either a single web page or a text file containing many source URLs.

It is built for link discovery, archive preparation, download-list building, deduplication, filtering, blacklist handling, long-running URL jobs, resume-safe processing, logging, failed-URL tracking, optional parallel processing, and maintenance of large text lists.

The script does not require a browser, Selenium, Playwright, ChromeDriver, or external PowerShell modules. It downloads the raw HTTP response and extracts links from common places such as HTML attributes, raw text, script blocks, JSON-like content, CSS url(...) references, noscript blocks, and embedded URL patterns.

It is a raw-response extraction tool, not a browser. It does not execute JavaScript or render web pages.

Current version

Latest version: 1.6.1

Version 1.6.1 is a maintenance reliability release. It keeps the same command-line behaviour as 1.6.0, but improves cleanup around deduplication and sorting temporary files.

Use this version if you use -DeduplicateFiles, -DeduplicateWhen, -SortOutput, -SortWhen, or standalone maintenance commands such as -Command Deduplicate, -Command Sort, or -Command Maintain.

Requirements

Windows PowerShell 5.1 or PowerShell 7+.
PowerShell 7+ is required only when using parallel processing with -ThrottleLimit greater than 1.
No external PowerShell modules required.
No browser required.

Main capabilities

Find-WebLinks can:

Scan one URL.
Scan many URLs from a text file.
Extract links from raw HTTP responses.
Match links using one wildcard pattern or multiple wildcard patterns.
Use Any or All matching logic for include patterns.
Exclude links using one or more wildcard patterns.
Use Any or All matching logic for exclusion patterns.
Write matching links to a plain text output file.
Append to an existing output file or create a fresh output file.
Avoid writing duplicate links already present in the output file.
Optionally keep duplicate matches found within the same page.
Preserve or ignore URL fragments during deduplication.
Use one or more exact-URL blacklist files.
Apply blacklists to input URLs, output links, or both.
Resume interrupted file-mode runs using a progress file.
Detect changed run settings before resuming.
Retry failed requests.
Honour HTTP and meta-refresh redirect limits.
Optionally fetch a page twice and keep the larger response.
Use a custom User-Agent.
Use an HTTP proxy.
Log per-URL processing statistics to CSV.
Save failed source URLs to a separate tab-separated file.
Use independent append/new modes for output, CSV log, and failed URL files.
Process URL lists sequentially or in parallel.
Deduplicate and sort files before or after a scraping run.
Clean failed or stale maintenance temporary files created by deduplication and sorting.
Run standalone maintenance commands without fetching URLs.
Protect against dangerous file collisions.
Warn when failure rates are high.
Expose operational limits as command-line parameters instead of hardcoded values.
Show built-in help with -Help or -h.
Start a guided interactive command builder with -InteractiveHelp or -Interactive.
When started without parameters, ask whether to show help, open the guided command builder, or exit.

Help and guided command builder

Find-WebLinks includes two help modes.

Show normal help

.\Find-WebLinks.ps1 -Help

Short alias:

.\Find-WebLinks.ps1 -h

Start the guided command builder

.\Find-WebLinks.ps1 -InteractiveHelp

Alias:

.\Find-WebLinks.ps1 -Interactive

The guided command builder asks questions and then prints the PowerShell command you should run. It does not fetch URLs, write files, deduplicate files, sort files, or execute the generated command.

It can build commands for:

normal scraping runs;
single-URL source mode;
file-of-URLs source mode;
wildcard include patterns;
wildcard exclude patterns;
Any / All matching behaviour;
output file and output mode;
resume mode and progress files;
blacklist files and blacklist scope;
CSV logging;
failed URL tracking;
retry, timeout, proxy, redirect, and User-Agent settings;
duplicate handling;
sorting and deduplication during a run;
operational safety limits;
standalone maintenance commands.

Behaviour when started without parameters

If you run the script without any parameters:

.\Find-WebLinks.ps1

it asks what you want to do:

Show help
Interactive command builder
Exit

Choose Show help to print the normal usage help. Choose Interactive command builder to answer questions and generate a command string.

Basic usage

.\Find-WebLinks.ps1 "PAGE_OR_FILE" "WHAT_TO_FIND" "OUTPUT_FILE" [Append|New] [Url|File]

Search one web page:

.\Find-WebLinks.ps1 "https://www.bbc.co.uk/news" "*sport*" "bbc-links.txt" New Url

Search many pages from a file:

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched-links.txt" Append File

The * wildcard means “anything”.

Examples:

*news*          matches links containing news
*download*      matches links containing download
*bbc*weather*   matches links containing bbc, then weather later in the link
*               matches everything

Source modes

Find-WebLinks has two source modes.

SourceType	Meaning
`Url`	`Source` is a single web page URL.
`File`	`Source` is a text file containing URLs, one per line.

Single URL mode:

.\Find-WebLinks.ps1 "https://www.bbc.co.uk/news" "*sport*" "links.txt" New Url

File mode:

.\Find-WebLinks.ps1 "urls.txt" "*sport*" "links.txt" New File

In File mode, results are written after each processed page, so long runs keep useful partial output even if interrupted.

Source files may contain blank lines and comments. Blank lines are ignored. Lines starting with # are ignored.

Output modes

The main output file supports two modes.

Mode	Meaning
`Append`	Add new results to the end of the existing file. This is the default.
`New`	Create or overwrite the output file before writing results.

Create a fresh output file:

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" New File

Append to an existing file:

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File

Search patterns

You can use a single positional search pattern:

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File

You can also provide multiple search patterns with -SearchPatterns.

Match links containing news, sport, or weather:

.\Find-WebLinks.ps1 "https://www.bbc.co.uk" "*news*" "out.txt" -SearchPatterns "*sport*","*weather*"

By default, -SearchMode Any is used. That means a link is accepted if it matches any search pattern.

Match links that contain both news and 2026:

.\Find-WebLinks.ps1 "https://www.bbc.co.uk" "*news*" "out.txt" -SearchPatterns "*2026*" -SearchMode All

You can also use -SearchPatterns without the positional SearchPattern:

.\Find-WebLinks.ps1 "https://www.bbc.co.uk" -SearchPatterns "*news*","*sport*" -OutputFile "out.txt"

Search mode

SearchMode	Meaning
`Any`	A link is accepted when it matches at least one search pattern. This is the default.
`All`	A link is accepted only when it matches every search pattern.

Excluding unwanted links

Use -ExcludePattern or -ExcludePatterns to remove links you do not want from the matched output.

Save links containing download or game, but exclude links containing demo or trailer:

.\Find-WebLinks.ps1 "urls.txt" -SearchPatterns "*download*","*game*" -ExcludePatterns "*demo*","*trailer*" -OutputFile "matched.txt" Append File

Save links containing both amiga and lha, but exclude anything containing beta:

.\Find-WebLinks.ps1 "urls.txt" -SearchPatterns "*amiga*","*lha*" -SearchMode All -ExcludePattern "*beta*" -OutputFile "matched.txt" Append File

Exclude only when all exclude patterns match the same link:

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -ExcludePatterns "*demo*","*trial*" -ExcludeMode All

Exclude mode

ExcludeMode	Meaning
`Any`	A link is excluded when it matches at least one exclude pattern. This is the default.
`All`	A link is excluded only when it matches every exclude pattern.

Exclusion counts are included in the CSV log.

Resume interrupted runs

File-mode runs can be resumed with -Resume.

When running in File mode, the script writes completed source URLs to a progress file. If a run is interrupted, run the same command again with -Resume to skip source URLs that were already processed.

First run:

.\Find-WebLinks.ps1 "urls.txt" "*zip*" "matched-links.txt" Append File -LogCsv "run-log.csv" -FailedUrlFile "failed-urls.txt"

Resume the same run:

.\Find-WebLinks.ps1 "urls.txt" "*zip*" "matched-links.txt" Append File -LogCsv "run-log.csv" -FailedUrlFile "failed-urls.txt" -Resume

By default, the progress file is:

<OutputFile>.progress

For example:

matched-links.txt.progress

You can set the progress file manually:

.\Find-WebLinks.ps1 "urls.txt" "*zip*" "matched-links.txt" Append File -ProgressFile "my-run.progress" -Resume

Important resume behaviour:

-Resume only applies to SourceType File.
If a progress file exists and you do not use -Resume, the script refuses to start. This helps prevent accidental mixing of old and new runs.
-Resume forces Mode, LogMode, and FailedUrlMode to Append to prevent data loss.
Failed source URLs are also marked as processed. They are written to -FailedUrlFile if supplied.
The progress file includes a run signature so the script can detect changed search, exclude, output, blacklist, duplicate, and related settings.

Logging and failed URL tracking

CSV log

Use -LogCsv to write per-URL processing statistics.

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -LogCsv "run-log.csv"

The CSV log contains:

Timestamp,SourceUrl,Status,Extracted,Matched,Excluded,Blacklisted,Duplicates,Written,Error

The script automatically creates the CSV header for new or empty files. If an existing CSV has a different header, the script warns that columns may be misaligned.

Failed URL file

Use -FailedUrlFile to save source URLs that failed to load.

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -FailedUrlFile "failed.txt"

The failed URL file is tab-separated and contains:

SourceUrl    Error

Independent file modes

The main output file, CSV log, and failed URL file can each use their own mode.

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched.txt" Append File -LogCsv "run-log.csv" -LogMode New -FailedUrlFile "failed.txt" -FailedUrlMode New

Option	Default	Meaning
`Mode`	`Append`	Controls the main output file.
`LogMode`	`Append`	Controls the CSV log file.
`FailedUrlMode`	`Append`	Controls the failed URL file.

Blacklist support

Use -BlacklistFile to exclude exact URLs.

.\Find-WebLinks.ps1 "urls.txt" "*" "out.txt" Append File -BlacklistFile "blocked.txt"

A blacklist file contains one URL per line:

https://example.com/unwanted-page
https://example.com/another-page

Blank lines are ignored. Lines starting with # are ignored.

Blacklist matching is exact after normalisation. A blacklist entry such as:

https://facebook.com

will not automatically block:

https://facebook.com/some/page

Blacklist scope

Use -BlacklistScope to control where the blacklist applies.

BlacklistScope	Meaning
`Input`	Skip matching source URLs before fetching them.
`Output`	Remove matching extracted links from the final output.
`Both`	Apply both behaviours. This is the default.

Apply the blacklist only to source URLs:

.\Find-WebLinks.ps1 "urls.txt" "*" "out.txt" Append File -BlacklistFile "blocked.txt" -BlacklistScope Input

Apply the blacklist only to extracted output links:

.\Find-WebLinks.ps1 "urls.txt" "*" "out.txt" Append File -BlacklistFile "blocked.txt" -BlacklistScope Output

Use multiple blacklist files:

.\Find-WebLinks.ps1 "urls.txt" "*" "out.txt" Append File -BlacklistFile "ads.txt","tracking.txt"

Duplicate handling

By default, the script avoids writing duplicate links already present in the output file or already written during the current run.

Option	Default	Meaning
`-NoDuplicates`	`$true`	Skip links already written or already present in the output file.
`-KeepDuplicates`	off	Keep repeated matches found within the same page.
`-KeepFragments`	off	Preserve URL fragments such as `#section` during deduplication. Useful for some single-page apps.

Disable duplicate protection:

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -NoDuplicates:$false

Keep repeated matches from the same page:

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -KeepDuplicates

Preserve URL fragments:

.\Find-WebLinks.ps1 "urls.txt" "*" "matched.txt" Append File -KeepFragments

Retry, timeout, and fetch behaviour

Default network behaviour:

Option	Default	Meaning
`-RetryCount`	`3`	Number of retry attempts per URL.
`-WaitSeconds`	`30`	Seconds to wait between retries for the same URL.
`-TimeoutSeconds`	`120`	HTTP timeout per request attempt.
`-DelaySeconds`	`5`	Seconds to wait between different URLs in `File` mode.
`-SecondFetch`	`$true`	Fetch each URL twice and keep the larger response.
`-SecondFetchWait`	`5`	Seconds to wait before the second fetch.
`-MaxRedirects`	`10`	Maximum HTTP and meta-refresh redirects.
`-MaxRetryAfterSeconds`	`300`	Maximum server `Retry-After` wait honoured. `0` means ignore.
`-UserAgent`	Chrome-like UA	Custom User-Agent string.
`-Proxy`	none	HTTP proxy URL.
`-ConnectionLimit`	`100`	.NET HTTP connection limit.

Increase retries:

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -RetryCount 5 -WaitSeconds 60

Fetch each page only once:

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -SecondFetch:$false

Use a custom User-Agent:

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -UserAgent "MyLinkScanner/1.0"

Use an HTTP proxy:

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -Proxy "http://proxy:8080"

Parallel processing

Use -ThrottleLimit to process multiple source URLs in parallel.

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -ThrottleLimit 8

Important behaviour:

Parallel mode requires PowerShell 7 or later.
-ThrottleLimit greater than 1 is only useful in SourceType File mode.
Worker runspaces fetch pages and extract links.
The parent process handles filtering, writing, logging, and progress centrally to reduce file-lock races.
The default is 1, which means sequential processing.

Maintenance during a normal run

Find-WebLinks can deduplicate and sort involved files before or after a scraping run.

Option	Default	Meaning
`-DeduplicateWhen`	`None`	Deduplicate involved files at `Start`, `End`, or `Both`.
`-SortWhen`	`None`	Sort involved files at `Start`, `End`, or `Both`.
`-SortDirection`	`Ascending`	Sort order for maintenance sorting.
`-DeduplicateFiles`	off	Legacy switch. Maps to start deduplication if `-DeduplicateWhen` is not set.
`-SortOutput`	`$false`	Legacy switch. Sorts output after the run, preserving older behaviour.

Deduplicate before scraping and sort at the end:

.\Find-WebLinks.ps1 ".\urls.txt" "*zip*" ".\matches.txt" Append File -DeduplicateWhen Start -SortWhen End

This is useful when working with input, output, or blacklist files that may already contain repeated entries.

Standalone maintenance commands

Use -Command for maintenance-only mode. No URLs are fetched.

Deduplicate one or more files:

.\Find-WebLinks.ps1 -Command Deduplicate -Files .\a.txt,.\b.txt

Sort one or more files:

.\Find-WebLinks.ps1 -Command Sort -Files .\a.txt,.\b.txt -SortDirection Descending

Deduplicate and/or sort using Maintain:

.\Find-WebLinks.ps1 -Command Maintain -Files .\a.txt,.\b.txt -DeduplicateWhen Start -SortWhen End

In standalone maintenance mode, Start, End, and Both collapse to a single maintenance pass because there is no scraping phase between them.

-Files also has the alias -MaintenanceFiles.

Large-file maintenance safety limit

In-memory maintenance operations such as deduplication and sorting are protected by a default 1 GB limit.

Default:

-MaintenanceLargeFileLimitMB 1024

This avoids accidentally loading very large files into memory.

Disable the limit for a controlled run:

.\Find-WebLinks.ps1 -Command Deduplicate -Files .\huge.txt -MaintenanceLargeFileLimitMB 0

Or explicitly ignore the limit:

.\Find-WebLinks.ps1 -Command Deduplicate -Files .\huge.txt -IgnoreMaintenanceLargeFileLimit

Use this carefully. Sorting or deduplicating very large files can consume a lot of RAM.

Operational limit overrides

Find-WebLinks exposes operational limits as command-line options.

Option	Default	Meaning
`-MaintenanceLargeFileLimitMB`	`1024`	Maximum MB for in-memory dedup/sort. `0` means no limit.
`-IgnoreMaintenanceLargeFileLimit`	off	Allow dedup/sort above the maintenance size limit.
`-MaxPageContentMB`	`50`	Maximum page body size to parse. `0` means no limit.
`-RegexTimeoutSeconds`	`10`	Regex match timeout. `0` means no timeout.
`-MaxUrlLength`	`8192`	Maximum URL/key length before truncation. `0` means no limit.
`-MaxRedirects`	`10`	Maximum HTTP/meta-refresh redirects.
`-MaxRetryAfterSeconds`	`300`	Maximum server `Retry-After` wait honoured. `0` means ignore.
`-ConnectionLimit`	`100`	.NET HTTP connection limit.
`-FileWriteRetryCount`	`5`	Append retry attempts for output, log, failed, and progress files.
`-FileWriteRetryDelayMinMs`	`50`	Minimum delay between append retries.
`-FileWriteRetryDelayMaxMs`	`300`	Maximum delay between append retries.
`-FileMoveRetryCount`	`5`	Replace retry attempts after dedup/sort temporary file write.
`-FileMoveRetryDelayMs`	`300`	Delay between dedup/sort replace retries.
`-HighFailureRatePercent`	`50`	Warn when file-mode failures reach this percentage. `0` disables the warning.
`-AllowExtremeOperationalValues`	off	Allow values above typo guardrails.

Allow larger pages:

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -MaxPageContentMB 250

Disable regex timeout for a controlled local test:

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -RegexTimeoutSeconds 0

Increase file-write retry behaviour:

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -FileWriteRetryCount 10 -FileWriteRetryDelayMinMs 100 -FileWriteRetryDelayMaxMs 1000

Typo guardrails for extreme values

Many numeric parameters accept very large values so advanced users can intentionally override limits.

This is probably a typo:

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -RetryCount 100000

By default, values above normal guardrails are rejected.

To intentionally allow them, add:

-AllowExtremeOperationalValues

Intentional extreme run:

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -RetryCount 100000 -AllowExtremeOperationalValues

Normal guardrails include:

Parameter	Normal guardrail
`RetryCount`	`100`
`WaitSeconds`	`86400`
`TimeoutSeconds`	`86400`
`DelaySeconds`	`86400`
`SecondFetchWait`	`86400`
`ThrottleLimit`	`64`
`MaxPageContentMB`	`1024`
`RegexTimeoutSeconds`	`3600`
`MaxUrlLength`	`1048576`
`MaxRedirects`	`100`
`MaxRetryAfterSeconds`	`86400`
`FileWriteRetryCount`	`100`
`FileWriteRetryDelayMinMs`	`86400000`
`FileWriteRetryDelayMaxMs`	`86400000`
`FileMoveRetryCount`	`100`
`FileMoveRetryDelayMs`	`86400000`
`ConnectionLimit`	`10000`

These are typo guardrails, not hard technical ceilings.

File collision protection

Find-WebLinks refuses to run when important files would collide with each other. It checks dangerous combinations involving:

Source file.
Output file.
CSV log file.
Failed URL file.
Progress file.
Blacklist files.

Examples of refused combinations:

Source file is the same as output file.
Output file is the same as blacklist file.
Log CSV is the same as output file.
Failed URL file is the same as source file.
Progress file is the same as output, source, log, failed URL, or blacklist file.

This is intentional. It prevents accidental data loss.

Common examples

Show help

.\Find-WebLinks.ps1 -Help

Build a command interactively

.\Find-WebLinks.ps1 -InteractiveHelp

Search one page and create a new output file

.\Find-WebLinks.ps1 "https://www.bbc.co.uk/news" "*sport*" "bbc-links.txt" New Url

Search one page and append to an existing file

.\Find-WebLinks.ps1 "https://www.bbc.co.uk/news" "*politics*" "bbc-links.txt" Append Url

Search many pages from a text file

Create urls.txt:

https://www.bbc.co.uk/news
https://www.bbc.co.uk/sport
https://www.bbc.co.uk/weather

Run:

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" New File

Search multiple patterns and exclude unwanted results

.\Find-WebLinks.ps1 "urls.txt" -SearchPatterns "*download*","*game*" -ExcludePatterns "*demo*","*trailer*" -OutputFile "matched.txt" Append File

Resume a long run

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -LogCsv "run-log.csv" -FailedUrlFile "failed.txt" -Resume

Run with fresh logs but appended output

.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched.txt" Append File -LogCsv "run-log.csv" -LogMode New -FailedUrlFile "failed.txt" -FailedUrlMode New

Parallel run with PowerShell 7+

.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -ThrottleLimit 8

Deduplicate before scraping and sort at the end

.\Find-WebLinks.ps1 "urls.txt" "*zip*" "matches.txt" Append File -DeduplicateWhen Start -SortWhen End

Deduplicate a huge file and override the 1 GB safety limit

.\Find-WebLinks.ps1 -Command Deduplicate -Files .\huge.txt -MaintenanceLargeFileLimitMB 0

Full options reference

Option	Default	Description
`-Help` / `-h`	off	Show built-in help and exit.
`-InteractiveHelp` / `-Interactive`	off	Start the guided command builder and exit.
`Source`	none	URL or file path to process.
`SearchPattern`	none	Main wildcard pattern. Optional if `-SearchPatterns` is used.
`-SearchPatterns`	none	One or more wildcard patterns.
`-SearchMode`	`Any`	`Any` = match any pattern. `All` = match every pattern.
`-ExcludePattern`	none	Main wildcard exclusion pattern.
`-ExcludePatterns`	none	One or more wildcard exclusion patterns.
`-ExcludeMode`	`Any`	`Any` = exclude if any exclusion pattern matches. `All` = exclude only if all match.
`OutputFile` / `-OutputFile`	none	File where matched links are saved.
`Mode`	`Append`	`Append` or `New` for the main output file.
`SourceType`	`Url`	`Url` or `File`.
`-RetryCount`	`3`	Number of retry attempts per URL.
`-WaitSeconds`	`30`	Seconds between retries.
`-TimeoutSeconds`	`120`	HTTP timeout per request.
`-DelaySeconds`	`5`	Delay between URLs in file mode.
`-SecondFetch`	`$true`	Fetch each URL twice and keep the larger response.
`-SecondFetchWait`	`5`	Seconds before the second fetch.
`-KeepDuplicates`	off	Keep repeated matches found within the same page.
`-NoDuplicates`	`$true`	Skip links already written or already in the output file.
`-BlacklistFile`	none	One or more exact-URL blacklist files.
`-BlacklistScope`	`Both`	Apply blacklist to `Input`, `Output`, or `Both`.
`-ThrottleLimit`	`1`	Number of URLs to process in parallel. Requires PowerShell 7+ when greater than `1`.
`-Resume`	off	Resume a previous file-mode run using the progress file.
`-ProgressFile`	`<OutputFile>.progress`	Progress file for resume mode.
`-DeduplicateFiles`	off	Legacy deduplication switch.
`-KeepFragments`	off	Preserve URL fragments during deduplication.
`-Proxy`	none	HTTP proxy URL.
`-SortOutput`	`$false`	Legacy end-of-run output sorting switch.
`-Command`	`Run`	`Run`, `Deduplicate`, `Sort`, or `Maintain`.
`-Files` / `-MaintenanceFiles`	none	Files for standalone maintenance commands.
`-SortDirection`	`Ascending`	`Ascending` or `Descending`.
`-DeduplicateWhen`	`None`	`None`, `Start`, `End`, or `Both`.
`-SortWhen`	`None`	`None`, `Start`, `End`, or `Both`.
`-UserAgent`	Chrome-like UA	Custom User-Agent header.
`-LogCsv`	none	CSV file for per-URL processing statistics.
`-FailedUrlFile`	none	Tab-separated file for failed source URLs and errors.
`-LogMode`	`Append`	`Append` or `New` for the CSV log.
`-FailedUrlMode`	`Append`	`Append` or `New` for the failed URL file.
`-MaintenanceLargeFileLimitMB`	`1024`	Max MB for in-memory maintenance. `0` means no limit.
`-IgnoreMaintenanceLargeFileLimit`	off	Ignore the maintenance large-file safety limit.
`-MaxPageContentMB`	`50`	Maximum page body size to parse. `0` means no limit.
`-RegexTimeoutSeconds`	`10`	Regex timeout. `0` means no timeout.
`-MaxUrlLength`	`8192`	Maximum URL/key length before truncation. `0` means no limit.
`-MaxRedirects`	`10`	Maximum HTTP/meta-refresh redirects.
`-MaxRetryAfterSeconds`	`300`	Max server Retry-After wait honoured. `0` means ignore.
`-FileWriteRetryCount`	`5`	Retry count for appending output/log/progress lines.
`-FileWriteRetryDelayMinMs`	`50`	Minimum delay between append retries.
`-FileWriteRetryDelayMaxMs`	`300`	Maximum delay between append retries.
`-FileMoveRetryCount`	`5`	Retry count for replacing files after maintenance.
`-FileMoveRetryDelayMs`	`300`	Delay between file replace retries.
`-ConnectionLimit`	`100`	.NET HTTP connection limit.
`-AllowExtremeOperationalValues`	off	Allow values above normal typo guardrails.
`-HighFailureRatePercent`	`50`	Warn when file-mode failures reach this percent. `0` disables.

Output files

Depending on the options used, the script may create:

matched-links.txt           Matched links
run-log.csv                 Per-URL processing log
failed.txt                  Failed source URLs and errors
matched-links.txt.progress  Resume progress file

You can open .txt files with any text editor. You can open .csv files with Excel, LibreOffice, Numbers, or similar tools.

Troubleshooting

PowerShell says scripts are disabled

Run this once:

Set-ExecutionPolicy -Scope CurrentUser RemoteSigned

Then run the script again.

I do not know which command to run

Use the guided command builder:

.\Find-WebLinks.ps1 -InteractiveHelp

It will ask questions and print a command string. It will not run the command automatically.

A progress file already exists

This usually means a previous file-mode run was interrupted.

Use:

-Resume

or delete the progress file if you want to start fresh.

You can also specify a different progress file:

-ProgressFile "another-run.progress"

No links are found

Possible causes:

The page does not contain matching links.
The search pattern is too specific.
The links are generated by JavaScript after the page loads.
The website blocked the request.
The page requires login or cookies.
The links were excluded by -ExcludePattern or -ExcludePatterns.
The links were removed by the blacklist.
The links were skipped as duplicates.

Try a broader search:

.\Find-WebLinks.ps1 "https://www.bbc.co.uk/news" "*" "all-links.txt" New Url -LogCsv "run-log.csv"

The script finds fewer links than a browser

That is expected on modern sites that rely on JavaScript. Find-WebLinks does not execute JavaScript, click buttons, scroll pages, accept cookie banners, or wait for React, Vue, Angular, or other client-side frameworks to build the page.

Maintenance skipped a huge file

Maintenance operations such as deduplication and sorting are protected by a default 1 GB safety limit.

Override it with:

-MaintenanceLargeFileLimitMB 0

or:

-IgnoreMaintenanceLargeFileLimit

Temporary maintenance files are visible

During deduplication and sorting, the script writes temporary files beside the file being maintained. Their names look like:

download-now.txt.<PID>.dedup.tmp
download-now.txt.<PID>.sort.tmp

Version 1.6.1 cleans these files automatically when a maintenance write or replace operation fails. It also removes stale maintenance temporary files older than 60 minutes before running a new maintenance pass.

If a very old .dedup.tmp or .sort.tmp file remains after a crash, power loss, or manual termination, it is safe to delete it manually once the script is no longer running.

Parallel mode fails

Parallel mode requires PowerShell 7+.

Check your version:

$PSVersionTable.PSVersion

If you are on Windows PowerShell 5.1, use the default sequential mode or install PowerShell 7+.

Release notes

1.6.1

Maintenance reliability release.

Fixed

Fixed a cleanup issue where temporary deduplication files such as download-now.txt.<PID>.dedup.tmp could remain after a failed or interrupted maintenance operation.
Fixed the equivalent cleanup path for temporary sorting files such as download-now.txt.<PID>.sort.tmp.
Fixed failed deduplication and sorting writes so their temporary files are removed instead of being left beside the original file.
Fixed failed replacement/move operations so temporary maintenance files are cleaned up when the final file replace does not complete.

Improved

Added automatic cleanup of stale maintenance temporary files before running a new deduplication or sorting pass.
Added safe cleanup handling for temporary maintenance files without making cleanup failure crash the whole run.
Improved writer disposal safety in the sorting path.
Improved maintenance-phase resilience when a run is interrupted, cancelled, or fails part-way through.

Changed

Updated script version from 1.6.0 to 1.6.1.
No command-line parameter changes.
No change to link extraction, matching, blacklist, resume, logging, failed-URL tracking, or interactive-help behaviour.

Limitations

Find-WebLinks is a best-effort raw-response link extraction tool. It does not:

Execute JavaScript.
Render pages.
Use a real browser engine.
Click buttons.
Accept cookie banners.
Log into websites.
Scroll pages.
Wait for client-side frameworks to populate links.
Bypass access controls.

If a link only appears after browser-side JavaScript runs, this script may not see it.

Responsible use

Use this tool responsibly. Respect website terms of service, robots.txt guidance where applicable, rate limits, copyright restrictions, and access controls.

Do not use it to overload websites or collect data you are not allowed to access.

License

This project is released under The Unlicense / public domain terms, as stated in the script header.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Find-WebLinks		Find-WebLinks
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Find-WebLinks

Current version

Requirements

Main capabilities

Help and guided command builder

Show normal help

Start the guided command builder

Behaviour when started without parameters

Basic usage

Source modes

Output modes

Search patterns

Search mode

Excluding unwanted links

Exclude mode

Resume interrupted runs

Logging and failed URL tracking

CSV log

Failed URL file

Independent file modes

Blacklist support

Blacklist scope

Duplicate handling

Retry, timeout, and fetch behaviour

Parallel processing

Maintenance during a normal run

Standalone maintenance commands

Large-file maintenance safety limit

Operational limit overrides

Typo guardrails for extreme values

File collision protection

Common examples

Show help

Build a command interactively

Search one page and create a new output file

Search one page and append to an existing file

Search many pages from a text file

Search multiple patterns and exclude unwanted results

Resume a long run

Run with fresh logs but appended output

Parallel run with PowerShell 7+

Deduplicate before scraping and sort at the end

Deduplicate a huge file and override the 1 GB safety limit

Full options reference

Output files

Troubleshooting

PowerShell says scripts are disabled

I do not know which command to run

A progress file already exists

No links are found

The script finds fewer links than a browser

Maintenance skipped a huge file

Temporary maintenance files are visible

Parallel mode fails

Release notes

1.6.1

Fixed

Improved

Changed

Limitations

Responsible use

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages