Skip to content

C URL Library : a set a C functions to manipulate URL strings and achieve URL Canonicalization as described by Google in its Safe Browsing developer's guide

License

Notifications You must be signed in to change notification settings

kirabou/Url-Canonicalization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

URL CANONICALIZATION AND UTILITIES FUNCTIONS

This is a set a C functions to manipulate URL strings (in an RFC 3986 compliant
way) and achieve URL Canonicalization as described by Google in its Safe 
Browsing developer's guide. See :

https://developers.google.com/safe-browsing/developers_guide_v3#Canonicalization

It includes the following functions (see url.h for full documentation) :

- url_RemoveTabCRLF() : removes leading and trailing spaces, as well as tab
  (0x09), CR (0x0d), and LF (0x0a) characters from the URL.

- url_RemoveFragment() : removes the fragment part of an URL.

- url_RemoveQuery() : removes the query part of an URL.

- url_Unescape() : percent-decode an URL, calling itself until there is no 
  more percent-decoding to be done.

- url_Normalize() : applies URL normalization rules as described in Google Safe
  Browsing Developer's Guide.

- url_Escape() : Percent-encode an URL. Reserved characters (from RFC 3986
  that is one of "!*'();:@&=+$,/?#[]") are not encoded.

- url_EscapeIncludingReservedChars() : Percent-encode a string, including
  reserved characters.

- url_Encode() : encodes a string to be compliant with 
  application/x-www-form-urlencoded format. Same as 
  url_EscapeIncludingReservedChars() but also replaces spaces with '+'.

- url_Canonicalize() : canonicalizes an URL by successively calling 
  url_Normalize(), then url_Escape() to get a percent-encoded url.

- url_CanonicalizeWithFullEscape() : canonicalizes an URL (same as
  url_Canonicalize()) but returned string is fully percent-encoded, even the
  reserved characters.

- url_ParseNextKeyValuePair() : allows parsing of a "key=value&key=value&..."
  string.

- url_Split() : splits an URL into its schme, link and query parts, as defined
  by RFC3986.

- url_GetHostname() : returns the hostname part extracted from an URL.

- url_GetBase() : returns the base part of an URL.

- url_MakeAbsolute() : turn a relative URL into an absolute URL?


All these functions are supposed to be thread safe. Tests were made with
Valgrind to find and fix memory leaks.

test_url.c implements the tests provided by Google in its documentation to
help validate a canonicalization implementation. 

test_url.c also provides example of basic uses of the provided functions.

To compile : gcc -std=c99 test_url.c url.c -o test_url
Tu run tests : ./test_url

About

C URL Library : a set a C functions to manipulate URL strings and achieve URL Canonicalization as described by Google in its Safe Browsing developer's guide

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages