Find-WebLinks is a PowerShell command-line tool for extracting web links from either a single web page or a text file containing many source URLs.
It is built for link discovery, archive preparation, download-list building, deduplication, filtering, blacklist handling, long-running URL jobs, resume-safe processing, logging, failed-URL tracking, optional parallel processing, and maintenance of large text lists.
The script does not require a browser, Selenium, Playwright, ChromeDriver, or external PowerShell modules. It downloads the raw HTTP response and extracts links from common places such as HTML attributes, raw text, script blocks, JSON-like content, CSS url(...) references, noscript blocks, and embedded URL patterns.
It is a raw-response extraction tool, not a browser. It does not execute JavaScript or render web pages.
Latest version: 1.6.1
Version 1.6.1 is a maintenance reliability release. It keeps the same command-line behaviour as 1.6.0, but improves cleanup around deduplication and sorting temporary files.
Use this version if you use -DeduplicateFiles, -DeduplicateWhen, -SortOutput, -SortWhen, or standalone maintenance commands such as -Command Deduplicate, -Command Sort, or -Command Maintain.
- Windows PowerShell 5.1 or PowerShell 7+.
- PowerShell 7+ is required only when using parallel processing with
-ThrottleLimitgreater than1. - No external PowerShell modules required.
- No browser required.
Find-WebLinks can:
- Scan one URL.
- Scan many URLs from a text file.
- Extract links from raw HTTP responses.
- Match links using one wildcard pattern or multiple wildcard patterns.
- Use
AnyorAllmatching logic for include patterns. - Exclude links using one or more wildcard patterns.
- Use
AnyorAllmatching logic for exclusion patterns. - Write matching links to a plain text output file.
- Append to an existing output file or create a fresh output file.
- Avoid writing duplicate links already present in the output file.
- Optionally keep duplicate matches found within the same page.
- Preserve or ignore URL fragments during deduplication.
- Use one or more exact-URL blacklist files.
- Apply blacklists to input URLs, output links, or both.
- Resume interrupted file-mode runs using a progress file.
- Detect changed run settings before resuming.
- Retry failed requests.
- Honour HTTP and meta-refresh redirect limits.
- Optionally fetch a page twice and keep the larger response.
- Use a custom User-Agent.
- Use an HTTP proxy.
- Log per-URL processing statistics to CSV.
- Save failed source URLs to a separate tab-separated file.
- Use independent append/new modes for output, CSV log, and failed URL files.
- Process URL lists sequentially or in parallel.
- Deduplicate and sort files before or after a scraping run.
- Clean failed or stale maintenance temporary files created by deduplication and sorting.
- Run standalone maintenance commands without fetching URLs.
- Protect against dangerous file collisions.
- Warn when failure rates are high.
- Expose operational limits as command-line parameters instead of hardcoded values.
- Show built-in help with
-Helpor-h. - Start a guided interactive command builder with
-InteractiveHelpor-Interactive. - When started without parameters, ask whether to show help, open the guided command builder, or exit.
Find-WebLinks includes two help modes.
.\Find-WebLinks.ps1 -HelpShort alias:
.\Find-WebLinks.ps1 -h.\Find-WebLinks.ps1 -InteractiveHelpAlias:
.\Find-WebLinks.ps1 -InteractiveThe guided command builder asks questions and then prints the PowerShell command you should run. It does not fetch URLs, write files, deduplicate files, sort files, or execute the generated command.
It can build commands for:
- normal scraping runs;
- single-URL source mode;
- file-of-URLs source mode;
- wildcard include patterns;
- wildcard exclude patterns;
Any/Allmatching behaviour;- output file and output mode;
- resume mode and progress files;
- blacklist files and blacklist scope;
- CSV logging;
- failed URL tracking;
- retry, timeout, proxy, redirect, and User-Agent settings;
- duplicate handling;
- sorting and deduplication during a run;
- operational safety limits;
- standalone maintenance commands.
If you run the script without any parameters:
.\Find-WebLinks.ps1it asks what you want to do:
Show help
Interactive command builder
Exit
Choose Show help to print the normal usage help. Choose Interactive command builder to answer questions and generate a command string.
.\Find-WebLinks.ps1 "PAGE_OR_FILE" "WHAT_TO_FIND" "OUTPUT_FILE" [Append|New] [Url|File]Search one web page:
.\Find-WebLinks.ps1 "https://www.bbc.co.uk/news" "*sport*" "bbc-links.txt" New UrlSearch many pages from a file:
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched-links.txt" Append FileThe * wildcard means “anything”.
Examples:
*news* matches links containing news
*download* matches links containing download
*bbc*weather* matches links containing bbc, then weather later in the link
* matches everything
Find-WebLinks has two source modes.
| SourceType | Meaning |
|---|---|
Url |
Source is a single web page URL. |
File |
Source is a text file containing URLs, one per line. |
Single URL mode:
.\Find-WebLinks.ps1 "https://www.bbc.co.uk/news" "*sport*" "links.txt" New UrlFile mode:
.\Find-WebLinks.ps1 "urls.txt" "*sport*" "links.txt" New FileIn File mode, results are written after each processed page, so long runs keep useful partial output even if interrupted.
Source files may contain blank lines and comments. Blank lines are ignored. Lines starting with # are ignored.
The main output file supports two modes.
| Mode | Meaning |
|---|---|
Append |
Add new results to the end of the existing file. This is the default. |
New |
Create or overwrite the output file before writing results. |
Create a fresh output file:
.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" New FileAppend to an existing file:
.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append FileYou can use a single positional search pattern:
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append FileYou can also provide multiple search patterns with -SearchPatterns.
Match links containing news, sport, or weather:
.\Find-WebLinks.ps1 "https://www.bbc.co.uk" "*news*" "out.txt" -SearchPatterns "*sport*","*weather*"By default, -SearchMode Any is used. That means a link is accepted if it matches any search pattern.
Match links that contain both news and 2026:
.\Find-WebLinks.ps1 "https://www.bbc.co.uk" "*news*" "out.txt" -SearchPatterns "*2026*" -SearchMode AllYou can also use -SearchPatterns without the positional SearchPattern:
.\Find-WebLinks.ps1 "https://www.bbc.co.uk" -SearchPatterns "*news*","*sport*" -OutputFile "out.txt"| SearchMode | Meaning |
|---|---|
Any |
A link is accepted when it matches at least one search pattern. This is the default. |
All |
A link is accepted only when it matches every search pattern. |
Use -ExcludePattern or -ExcludePatterns to remove links you do not want from the matched output.
Save links containing download or game, but exclude links containing demo or trailer:
.\Find-WebLinks.ps1 "urls.txt" -SearchPatterns "*download*","*game*" -ExcludePatterns "*demo*","*trailer*" -OutputFile "matched.txt" Append FileSave links containing both amiga and lha, but exclude anything containing beta:
.\Find-WebLinks.ps1 "urls.txt" -SearchPatterns "*amiga*","*lha*" -SearchMode All -ExcludePattern "*beta*" -OutputFile "matched.txt" Append FileExclude only when all exclude patterns match the same link:
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -ExcludePatterns "*demo*","*trial*" -ExcludeMode All| ExcludeMode | Meaning |
|---|---|
Any |
A link is excluded when it matches at least one exclude pattern. This is the default. |
All |
A link is excluded only when it matches every exclude pattern. |
Exclusion counts are included in the CSV log.
File-mode runs can be resumed with -Resume.
When running in File mode, the script writes completed source URLs to a progress file. If a run is interrupted, run the same command again with -Resume to skip source URLs that were already processed.
First run:
.\Find-WebLinks.ps1 "urls.txt" "*zip*" "matched-links.txt" Append File -LogCsv "run-log.csv" -FailedUrlFile "failed-urls.txt"Resume the same run:
.\Find-WebLinks.ps1 "urls.txt" "*zip*" "matched-links.txt" Append File -LogCsv "run-log.csv" -FailedUrlFile "failed-urls.txt" -ResumeBy default, the progress file is:
<OutputFile>.progress
For example:
matched-links.txt.progress
You can set the progress file manually:
.\Find-WebLinks.ps1 "urls.txt" "*zip*" "matched-links.txt" Append File -ProgressFile "my-run.progress" -ResumeImportant resume behaviour:
-Resumeonly applies toSourceType File.- If a progress file exists and you do not use
-Resume, the script refuses to start. This helps prevent accidental mixing of old and new runs. -ResumeforcesMode,LogMode, andFailedUrlModetoAppendto prevent data loss.- Failed source URLs are also marked as processed. They are written to
-FailedUrlFileif supplied. - The progress file includes a run signature so the script can detect changed search, exclude, output, blacklist, duplicate, and related settings.
Use -LogCsv to write per-URL processing statistics.
.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -LogCsv "run-log.csv"The CSV log contains:
Timestamp,SourceUrl,Status,Extracted,Matched,Excluded,Blacklisted,Duplicates,Written,ErrorThe script automatically creates the CSV header for new or empty files. If an existing CSV has a different header, the script warns that columns may be misaligned.
Use -FailedUrlFile to save source URLs that failed to load.
.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -FailedUrlFile "failed.txt"The failed URL file is tab-separated and contains:
SourceUrl Error
The main output file, CSV log, and failed URL file can each use their own mode.
.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched.txt" Append File -LogCsv "run-log.csv" -LogMode New -FailedUrlFile "failed.txt" -FailedUrlMode New| Option | Default | Meaning |
|---|---|---|
Mode |
Append |
Controls the main output file. |
LogMode |
Append |
Controls the CSV log file. |
FailedUrlMode |
Append |
Controls the failed URL file. |
Use -BlacklistFile to exclude exact URLs.
.\Find-WebLinks.ps1 "urls.txt" "*" "out.txt" Append File -BlacklistFile "blocked.txt"A blacklist file contains one URL per line:
https://example.com/unwanted-page
https://example.com/another-page
Blank lines are ignored. Lines starting with # are ignored.
Blacklist matching is exact after normalisation. A blacklist entry such as:
https://facebook.com
will not automatically block:
https://facebook.com/some/page
Use -BlacklistScope to control where the blacklist applies.
| BlacklistScope | Meaning |
|---|---|
Input |
Skip matching source URLs before fetching them. |
Output |
Remove matching extracted links from the final output. |
Both |
Apply both behaviours. This is the default. |
Apply the blacklist only to source URLs:
.\Find-WebLinks.ps1 "urls.txt" "*" "out.txt" Append File -BlacklistFile "blocked.txt" -BlacklistScope InputApply the blacklist only to extracted output links:
.\Find-WebLinks.ps1 "urls.txt" "*" "out.txt" Append File -BlacklistFile "blocked.txt" -BlacklistScope OutputUse multiple blacklist files:
.\Find-WebLinks.ps1 "urls.txt" "*" "out.txt" Append File -BlacklistFile "ads.txt","tracking.txt"By default, the script avoids writing duplicate links already present in the output file or already written during the current run.
| Option | Default | Meaning |
|---|---|---|
-NoDuplicates |
$true |
Skip links already written or already present in the output file. |
-KeepDuplicates |
off | Keep repeated matches found within the same page. |
-KeepFragments |
off | Preserve URL fragments such as #section during deduplication. Useful for some single-page apps. |
Disable duplicate protection:
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -NoDuplicates:$falseKeep repeated matches from the same page:
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -KeepDuplicatesPreserve URL fragments:
.\Find-WebLinks.ps1 "urls.txt" "*" "matched.txt" Append File -KeepFragmentsDefault network behaviour:
| Option | Default | Meaning |
|---|---|---|
-RetryCount |
3 |
Number of retry attempts per URL. |
-WaitSeconds |
30 |
Seconds to wait between retries for the same URL. |
-TimeoutSeconds |
120 |
HTTP timeout per request attempt. |
-DelaySeconds |
5 |
Seconds to wait between different URLs in File mode. |
-SecondFetch |
$true |
Fetch each URL twice and keep the larger response. |
-SecondFetchWait |
5 |
Seconds to wait before the second fetch. |
-MaxRedirects |
10 |
Maximum HTTP and meta-refresh redirects. |
-MaxRetryAfterSeconds |
300 |
Maximum server Retry-After wait honoured. 0 means ignore. |
-UserAgent |
Chrome-like UA | Custom User-Agent string. |
-Proxy |
none | HTTP proxy URL. |
-ConnectionLimit |
100 |
.NET HTTP connection limit. |
Increase retries:
.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -RetryCount 5 -WaitSeconds 60Fetch each page only once:
.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -SecondFetch:$falseUse a custom User-Agent:
.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -UserAgent "MyLinkScanner/1.0"Use an HTTP proxy:
.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" Append File -Proxy "http://proxy:8080"Use -ThrottleLimit to process multiple source URLs in parallel.
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -ThrottleLimit 8Important behaviour:
- Parallel mode requires PowerShell 7 or later.
-ThrottleLimitgreater than1is only useful inSourceType Filemode.- Worker runspaces fetch pages and extract links.
- The parent process handles filtering, writing, logging, and progress centrally to reduce file-lock races.
- The default is
1, which means sequential processing.
Find-WebLinks can deduplicate and sort involved files before or after a scraping run.
| Option | Default | Meaning |
|---|---|---|
-DeduplicateWhen |
None |
Deduplicate involved files at Start, End, or Both. |
-SortWhen |
None |
Sort involved files at Start, End, or Both. |
-SortDirection |
Ascending |
Sort order for maintenance sorting. |
-DeduplicateFiles |
off | Legacy switch. Maps to start deduplication if -DeduplicateWhen is not set. |
-SortOutput |
$false |
Legacy switch. Sorts output after the run, preserving older behaviour. |
Deduplicate before scraping and sort at the end:
.\Find-WebLinks.ps1 ".\urls.txt" "*zip*" ".\matches.txt" Append File -DeduplicateWhen Start -SortWhen EndThis is useful when working with input, output, or blacklist files that may already contain repeated entries.
Use -Command for maintenance-only mode. No URLs are fetched.
Deduplicate one or more files:
.\Find-WebLinks.ps1 -Command Deduplicate -Files .\a.txt,.\b.txtSort one or more files:
.\Find-WebLinks.ps1 -Command Sort -Files .\a.txt,.\b.txt -SortDirection DescendingDeduplicate and/or sort using Maintain:
.\Find-WebLinks.ps1 -Command Maintain -Files .\a.txt,.\b.txt -DeduplicateWhen Start -SortWhen EndIn standalone maintenance mode, Start, End, and Both collapse to a single maintenance pass because there is no scraping phase between them.
-Files also has the alias -MaintenanceFiles.
In-memory maintenance operations such as deduplication and sorting are protected by a default 1 GB limit.
Default:
-MaintenanceLargeFileLimitMB 1024
This avoids accidentally loading very large files into memory.
Disable the limit for a controlled run:
.\Find-WebLinks.ps1 -Command Deduplicate -Files .\huge.txt -MaintenanceLargeFileLimitMB 0Or explicitly ignore the limit:
.\Find-WebLinks.ps1 -Command Deduplicate -Files .\huge.txt -IgnoreMaintenanceLargeFileLimitUse this carefully. Sorting or deduplicating very large files can consume a lot of RAM.
Find-WebLinks exposes operational limits as command-line options.
| Option | Default | Meaning |
|---|---|---|
-MaintenanceLargeFileLimitMB |
1024 |
Maximum MB for in-memory dedup/sort. 0 means no limit. |
-IgnoreMaintenanceLargeFileLimit |
off | Allow dedup/sort above the maintenance size limit. |
-MaxPageContentMB |
50 |
Maximum page body size to parse. 0 means no limit. |
-RegexTimeoutSeconds |
10 |
Regex match timeout. 0 means no timeout. |
-MaxUrlLength |
8192 |
Maximum URL/key length before truncation. 0 means no limit. |
-MaxRedirects |
10 |
Maximum HTTP/meta-refresh redirects. |
-MaxRetryAfterSeconds |
300 |
Maximum server Retry-After wait honoured. 0 means ignore. |
-ConnectionLimit |
100 |
.NET HTTP connection limit. |
-FileWriteRetryCount |
5 |
Append retry attempts for output, log, failed, and progress files. |
-FileWriteRetryDelayMinMs |
50 |
Minimum delay between append retries. |
-FileWriteRetryDelayMaxMs |
300 |
Maximum delay between append retries. |
-FileMoveRetryCount |
5 |
Replace retry attempts after dedup/sort temporary file write. |
-FileMoveRetryDelayMs |
300 |
Delay between dedup/sort replace retries. |
-HighFailureRatePercent |
50 |
Warn when file-mode failures reach this percentage. 0 disables the warning. |
-AllowExtremeOperationalValues |
off | Allow values above typo guardrails. |
Allow larger pages:
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -MaxPageContentMB 250Disable regex timeout for a controlled local test:
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -RegexTimeoutSeconds 0Increase file-write retry behaviour:
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -FileWriteRetryCount 10 -FileWriteRetryDelayMinMs 100 -FileWriteRetryDelayMaxMs 1000Many numeric parameters accept very large values so advanced users can intentionally override limits.
This is probably a typo:
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -RetryCount 100000By default, values above normal guardrails are rejected.
To intentionally allow them, add:
-AllowExtremeOperationalValuesIntentional extreme run:
.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -RetryCount 100000 -AllowExtremeOperationalValuesNormal guardrails include:
| Parameter | Normal guardrail |
|---|---|
RetryCount |
100 |
WaitSeconds |
86400 |
TimeoutSeconds |
86400 |
DelaySeconds |
86400 |
SecondFetchWait |
86400 |
ThrottleLimit |
64 |
MaxPageContentMB |
1024 |
RegexTimeoutSeconds |
3600 |
MaxUrlLength |
1048576 |
MaxRedirects |
100 |
MaxRetryAfterSeconds |
86400 |
FileWriteRetryCount |
100 |
FileWriteRetryDelayMinMs |
86400000 |
FileWriteRetryDelayMaxMs |
86400000 |
FileMoveRetryCount |
100 |
FileMoveRetryDelayMs |
86400000 |
ConnectionLimit |
10000 |
These are typo guardrails, not hard technical ceilings.
Find-WebLinks refuses to run when important files would collide with each other. It checks dangerous combinations involving:
- Source file.
- Output file.
- CSV log file.
- Failed URL file.
- Progress file.
- Blacklist files.
Examples of refused combinations:
- Source file is the same as output file.
- Output file is the same as blacklist file.
- Log CSV is the same as output file.
- Failed URL file is the same as source file.
- Progress file is the same as output, source, log, failed URL, or blacklist file.
This is intentional. It prevents accidental data loss.
.\Find-WebLinks.ps1 -Help.\Find-WebLinks.ps1 -InteractiveHelp.\Find-WebLinks.ps1 "https://www.bbc.co.uk/news" "*sport*" "bbc-links.txt" New Url.\Find-WebLinks.ps1 "https://www.bbc.co.uk/news" "*politics*" "bbc-links.txt" Append UrlCreate urls.txt:
https://www.bbc.co.uk/news
https://www.bbc.co.uk/sport
https://www.bbc.co.uk/weather
Run:
.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched-links.txt" New File.\Find-WebLinks.ps1 "urls.txt" -SearchPatterns "*download*","*game*" -ExcludePatterns "*demo*","*trailer*" -OutputFile "matched.txt" Append File.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -LogCsv "run-log.csv" -FailedUrlFile "failed.txt" -Resume.\Find-WebLinks.ps1 "urls.txt" "*news*" "matched.txt" Append File -LogCsv "run-log.csv" -LogMode New -FailedUrlFile "failed.txt" -FailedUrlMode New.\Find-WebLinks.ps1 "urls.txt" "*download*" "matched.txt" Append File -ThrottleLimit 8.\Find-WebLinks.ps1 "urls.txt" "*zip*" "matches.txt" Append File -DeduplicateWhen Start -SortWhen End.\Find-WebLinks.ps1 -Command Deduplicate -Files .\huge.txt -MaintenanceLargeFileLimitMB 0| Option | Default | Description |
|---|---|---|
-Help / -h |
off | Show built-in help and exit. |
-InteractiveHelp / -Interactive |
off | Start the guided command builder and exit. |
Source |
none | URL or file path to process. |
SearchPattern |
none | Main wildcard pattern. Optional if -SearchPatterns is used. |
-SearchPatterns |
none | One or more wildcard patterns. |
-SearchMode |
Any |
Any = match any pattern. All = match every pattern. |
-ExcludePattern |
none | Main wildcard exclusion pattern. |
-ExcludePatterns |
none | One or more wildcard exclusion patterns. |
-ExcludeMode |
Any |
Any = exclude if any exclusion pattern matches. All = exclude only if all match. |
OutputFile / -OutputFile |
none | File where matched links are saved. |
Mode |
Append |
Append or New for the main output file. |
SourceType |
Url |
Url or File. |
-RetryCount |
3 |
Number of retry attempts per URL. |
-WaitSeconds |
30 |
Seconds between retries. |
-TimeoutSeconds |
120 |
HTTP timeout per request. |
-DelaySeconds |
5 |
Delay between URLs in file mode. |
-SecondFetch |
$true |
Fetch each URL twice and keep the larger response. |
-SecondFetchWait |
5 |
Seconds before the second fetch. |
-KeepDuplicates |
off | Keep repeated matches found within the same page. |
-NoDuplicates |
$true |
Skip links already written or already in the output file. |
-BlacklistFile |
none | One or more exact-URL blacklist files. |
-BlacklistScope |
Both |
Apply blacklist to Input, Output, or Both. |
-ThrottleLimit |
1 |
Number of URLs to process in parallel. Requires PowerShell 7+ when greater than 1. |
-Resume |
off | Resume a previous file-mode run using the progress file. |
-ProgressFile |
<OutputFile>.progress |
Progress file for resume mode. |
-DeduplicateFiles |
off | Legacy deduplication switch. |
-KeepFragments |
off | Preserve URL fragments during deduplication. |
-Proxy |
none | HTTP proxy URL. |
-SortOutput |
$false |
Legacy end-of-run output sorting switch. |
-Command |
Run |
Run, Deduplicate, Sort, or Maintain. |
-Files / -MaintenanceFiles |
none | Files for standalone maintenance commands. |
-SortDirection |
Ascending |
Ascending or Descending. |
-DeduplicateWhen |
None |
None, Start, End, or Both. |
-SortWhen |
None |
None, Start, End, or Both. |
-UserAgent |
Chrome-like UA | Custom User-Agent header. |
-LogCsv |
none | CSV file for per-URL processing statistics. |
-FailedUrlFile |
none | Tab-separated file for failed source URLs and errors. |
-LogMode |
Append |
Append or New for the CSV log. |
-FailedUrlMode |
Append |
Append or New for the failed URL file. |
-MaintenanceLargeFileLimitMB |
1024 |
Max MB for in-memory maintenance. 0 means no limit. |
-IgnoreMaintenanceLargeFileLimit |
off | Ignore the maintenance large-file safety limit. |
-MaxPageContentMB |
50 |
Maximum page body size to parse. 0 means no limit. |
-RegexTimeoutSeconds |
10 |
Regex timeout. 0 means no timeout. |
-MaxUrlLength |
8192 |
Maximum URL/key length before truncation. 0 means no limit. |
-MaxRedirects |
10 |
Maximum HTTP/meta-refresh redirects. |
-MaxRetryAfterSeconds |
300 |
Max server Retry-After wait honoured. 0 means ignore. |
-FileWriteRetryCount |
5 |
Retry count for appending output/log/progress lines. |
-FileWriteRetryDelayMinMs |
50 |
Minimum delay between append retries. |
-FileWriteRetryDelayMaxMs |
300 |
Maximum delay between append retries. |
-FileMoveRetryCount |
5 |
Retry count for replacing files after maintenance. |
-FileMoveRetryDelayMs |
300 |
Delay between file replace retries. |
-ConnectionLimit |
100 |
.NET HTTP connection limit. |
-AllowExtremeOperationalValues |
off | Allow values above normal typo guardrails. |
-HighFailureRatePercent |
50 |
Warn when file-mode failures reach this percent. 0 disables. |
Depending on the options used, the script may create:
matched-links.txt Matched links
run-log.csv Per-URL processing log
failed.txt Failed source URLs and errors
matched-links.txt.progress Resume progress file
You can open .txt files with any text editor. You can open .csv files with Excel, LibreOffice, Numbers, or similar tools.
Run this once:
Set-ExecutionPolicy -Scope CurrentUser RemoteSignedThen run the script again.
Use the guided command builder:
.\Find-WebLinks.ps1 -InteractiveHelpIt will ask questions and print a command string. It will not run the command automatically.
This usually means a previous file-mode run was interrupted.
Use:
-Resumeor delete the progress file if you want to start fresh.
You can also specify a different progress file:
-ProgressFile "another-run.progress"Possible causes:
- The page does not contain matching links.
- The search pattern is too specific.
- The links are generated by JavaScript after the page loads.
- The website blocked the request.
- The page requires login or cookies.
- The links were excluded by
-ExcludePatternor-ExcludePatterns. - The links were removed by the blacklist.
- The links were skipped as duplicates.
Try a broader search:
.\Find-WebLinks.ps1 "https://www.bbc.co.uk/news" "*" "all-links.txt" New Url -LogCsv "run-log.csv"That is expected on modern sites that rely on JavaScript. Find-WebLinks does not execute JavaScript, click buttons, scroll pages, accept cookie banners, or wait for React, Vue, Angular, or other client-side frameworks to build the page.
Maintenance operations such as deduplication and sorting are protected by a default 1 GB safety limit.
Override it with:
-MaintenanceLargeFileLimitMB 0or:
-IgnoreMaintenanceLargeFileLimitDuring deduplication and sorting, the script writes temporary files beside the file being maintained. Their names look like:
download-now.txt.<PID>.dedup.tmp
download-now.txt.<PID>.sort.tmp
Version 1.6.1 cleans these files automatically when a maintenance write or replace operation fails. It also removes stale maintenance temporary files older than 60 minutes before running a new maintenance pass.
If a very old .dedup.tmp or .sort.tmp file remains after a crash, power loss, or manual termination, it is safe to delete it manually once the script is no longer running.
Parallel mode requires PowerShell 7+.
Check your version:
$PSVersionTable.PSVersionIf you are on Windows PowerShell 5.1, use the default sequential mode or install PowerShell 7+.
Maintenance reliability release.
- Fixed a cleanup issue where temporary deduplication files such as
download-now.txt.<PID>.dedup.tmpcould remain after a failed or interrupted maintenance operation. - Fixed the equivalent cleanup path for temporary sorting files such as
download-now.txt.<PID>.sort.tmp. - Fixed failed deduplication and sorting writes so their temporary files are removed instead of being left beside the original file.
- Fixed failed replacement/move operations so temporary maintenance files are cleaned up when the final file replace does not complete.
- Added automatic cleanup of stale maintenance temporary files before running a new deduplication or sorting pass.
- Added safe cleanup handling for temporary maintenance files without making cleanup failure crash the whole run.
- Improved writer disposal safety in the sorting path.
- Improved maintenance-phase resilience when a run is interrupted, cancelled, or fails part-way through.
- Updated script version from
1.6.0to1.6.1. - No command-line parameter changes.
- No change to link extraction, matching, blacklist, resume, logging, failed-URL tracking, or interactive-help behaviour.
Find-WebLinks is a best-effort raw-response link extraction tool. It does not:
- Execute JavaScript.
- Render pages.
- Use a real browser engine.
- Click buttons.
- Accept cookie banners.
- Log into websites.
- Scroll pages.
- Wait for client-side frameworks to populate links.
- Bypass access controls.
If a link only appears after browser-side JavaScript runs, this script may not see it.
Use this tool responsibly. Respect website terms of service, robots.txt guidance where applicable, rate limits, copyright restrictions, and access controls.
Do not use it to overload websites or collect data you are not allowed to access.
This project is released under The Unlicense / public domain terms, as stated in the script header.