-
Notifications
You must be signed in to change notification settings - Fork 255
Better regex for finding proxies + deduplication of proxies based on exit ip #765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@monosans could you trigger this workflow please |
@monosans do you understand what is not passing in the newly failed rustfmt? It doesn't seem like there are actually syntax errors unless I'm blind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves proxy discovery and deduplication by enhancing the regex pattern for proxy detection and implementing deduplication based on exit IP addresses. The changes address issues with proxy identification and prevent duplicate proxies that use different endpoints but share the same exit server.
- Enhanced regex pattern to support more flexible proxy URI formats with expanded character sets for usernames/passwords
- Added deduplication logic to remove proxies that share the same exit IP address within the same protocol
- Minor code organization improvements
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
src/parsers.rs | Updated proxy regex pattern to support more characters in credentials and removed trailing whitespace |
src/output.rs | Added exit IP-based deduplication logic and reorganized imports |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
||
pub static PROXY_REGEX: LazyLock<fancy_regex::Regex> = LazyLock::new(|| { | ||
let pattern = r"(?:^|[^0-9A-Za-z])(?:(?P<protocol>https?|socks[45]):\/\/)?(?:(?P<username>[0-9A-Za-z]{1,64}):(?P<password>[0-9A-Za-z]{1,64})@)?(?P<host>[A-Za-z][\-\.A-Za-z]{0,251}[A-Za-z]|[A-Za-z]|(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])){3}):(?P<port>[0-9]|[1-9][0-9]{1,3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])(?=[^0-9A-Za-z]|$)"; | ||
let pattern = r"(?:^|[^0-9A-Za-z])(?:(?P<protocol>https?|socks[45]):\/\/)?(?:(?P<username>[0-9A-Za-z._~\-]{1,256}):(?P<password>[0-9A-Za-z._~\-]{1,256})@)?(?P<host>[A-Za-z][\-\.A-Za-z]{0,251}[A-Za-z]|[A-Za-z]|(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])){3}):(?P<port>[0-9]|[1-9][0-9]{1,3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])(?=[^0-9A-Za-z]|$)"; |
Copilot
AI
Sep 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex allows hyphens in usernames and passwords without escaping them in the character class. In regex character classes, hyphens should be escaped or placed at the beginning/end to avoid being interpreted as a range operator. Consider changing [0-9A-Za-z._~\-]
to [0-9A-Za-z._~-]
or [0-9A-Za-z._~\-]
where the hyphen is properly positioned.
Copilot uses AI. Check for mistakes.
let mut seen: std::collections::HashSet<(ProxyType, String)> = | ||
std::collections::HashSet::new(); |
Copilot
AI
Sep 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The deduplication logic uses std::collections::HashSet
directly instead of the project's HashMap
alias. For consistency with the existing codebase that imports HashMap
, consider using std::collections::HashSet
consistently or adding a HashSet
alias to match the pattern.
Copilot uses AI. Check for mistakes.
for p in proxies { | ||
if let Some(ip) = &p.exit_ip { | ||
let key = (p.protocol, ip.clone()); | ||
if !seen.insert(key) { | ||
continue; | ||
} | ||
} | ||
deduped.push(p); | ||
} |
Copilot
AI
Sep 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deduplication logic clones the IP string for each proxy when creating the key. Consider using a reference to avoid unnecessary string allocations: change the HashSet type to HashSet<(ProxyType, &str)>
and use &ip
instead of ip.clone()
in the key.
Copilot uses AI. Check for mistakes.
The imports are not sorted and the comment is longer than 80 characters in one line. |
Changes:
Deduplication based on the exit ip is because a lot of times proxy servers will have different IPs but they actually exit to the same server.
Improved the regex because I was having trouble getting the scraper/checker to identify my proxies