Skip to content

Conversation

gabearro
Copy link

@gabearro gabearro commented Sep 1, 2025

Changes:

  • Improves regex for finding proxies to be more flexible to commonly found URIs.
  • Implements deduping of scraped + tested proxies based on the exit ip of the proxy.

Deduplication based on the exit ip is because a lot of times proxy servers will have different IPs but they actually exit to the same server.

Improved the regex because I was having trouble getting the scraper/checker to identify my proxies

@monosans monosans requested a review from Copilot September 1, 2025 17:21
Copilot

This comment was marked as outdated.

@gabearro
Copy link
Author

gabearro commented Sep 2, 2025

@monosans could you trigger this workflow please

@gabearro
Copy link
Author

gabearro commented Sep 2, 2025

@monosans do you understand what is not passing in the newly failed rustfmt? It doesn't seem like there are actually syntax errors unless I'm blind

@monosans monosans requested a review from Copilot September 6, 2025 08:05
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves proxy discovery and deduplication by enhancing the regex pattern for proxy detection and implementing deduplication based on exit IP addresses. The changes address issues with proxy identification and prevent duplicate proxies that use different endpoints but share the same exit server.

  • Enhanced regex pattern to support more flexible proxy URI formats with expanded character sets for usernames/passwords
  • Added deduplication logic to remove proxies that share the same exit IP address within the same protocol
  • Minor code organization improvements

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/parsers.rs Updated proxy regex pattern to support more characters in credentials and removed trailing whitespace
src/output.rs Added exit IP-based deduplication logic and reorganized imports

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


pub static PROXY_REGEX: LazyLock<fancy_regex::Regex> = LazyLock::new(|| {
let pattern = r"(?:^|[^0-9A-Za-z])(?:(?P<protocol>https?|socks[45]):\/\/)?(?:(?P<username>[0-9A-Za-z]{1,64}):(?P<password>[0-9A-Za-z]{1,64})@)?(?P<host>[A-Za-z][\-\.A-Za-z]{0,251}[A-Za-z]|[A-Za-z]|(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])){3}):(?P<port>[0-9]|[1-9][0-9]{1,3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])(?=[^0-9A-Za-z]|$)";
let pattern = r"(?:^|[^0-9A-Za-z])(?:(?P<protocol>https?|socks[45]):\/\/)?(?:(?P<username>[0-9A-Za-z._~\-]{1,256}):(?P<password>[0-9A-Za-z._~\-]{1,256})@)?(?P<host>[A-Za-z][\-\.A-Za-z]{0,251}[A-Za-z]|[A-Za-z]|(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])){3}):(?P<port>[0-9]|[1-9][0-9]{1,3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])(?=[^0-9A-Za-z]|$)";
Copy link

Copilot AI Sep 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex allows hyphens in usernames and passwords without escaping them in the character class. In regex character classes, hyphens should be escaped or placed at the beginning/end to avoid being interpreted as a range operator. Consider changing [0-9A-Za-z._~\-] to [0-9A-Za-z._~-] or [0-9A-Za-z._~\-] where the hyphen is properly positioned.

Copilot uses AI. Check for mistakes.

Comment on lines +79 to +80
let mut seen: std::collections::HashSet<(ProxyType, String)> =
std::collections::HashSet::new();
Copy link

Copilot AI Sep 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The deduplication logic uses std::collections::HashSet directly instead of the project's HashMap alias. For consistency with the existing codebase that imports HashMap, consider using std::collections::HashSet consistently or adding a HashSet alias to match the pattern.

Copilot uses AI. Check for mistakes.

Comment on lines +82 to +90
for p in proxies {
if let Some(ip) = &p.exit_ip {
let key = (p.protocol, ip.clone());
if !seen.insert(key) {
continue;
}
}
deduped.push(p);
}
Copy link

Copilot AI Sep 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deduplication logic clones the IP string for each proxy when creating the key. Consider using a reference to avoid unnecessary string allocations: change the HashSet type to HashSet<(ProxyType, &str)> and use &ip instead of ip.clone() in the key.

Copilot uses AI. Check for mistakes.

@monosans
Copy link
Owner

monosans commented Sep 6, 2025

@monosans do you understand what is not passing in the newly failed rustfmt? It doesn't seem like there are actually syntax errors unless I'm blind

The imports are not sorted and the comment is longer than 80 characters in one line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants