hbz.cdntools

This tool parses a HTML site and returns a list of found CSS and JS files from external Content Delivery Networks (CDN). In addition <img> Tags are scanned for more Domains and collected in a file hostnames.txt.

The module is installed by running:

$ pip install -f https://dist.pubsys.hbz-nrw.de hbz.cdntools

Update a new version with

$ pip install -f https://dist.pubsys.hbz-nrw.de -U hbz.cdntools

Under Ubuntu 14.04. pip does not work for unknown reasons, but easy_install does:

$ easy_install -f https://dist.pubsys.hbz-nrw.de -U hbz.cdntools

Dependencies like the Requests and BeautifulSoup libraries are installed automatically.

Usage:

$ cdnparse -h

    usage: cdnparse [-h] [-a] [-k] [-n] [-c COOKIES] [-u USERAGENT] [-l LOGFILE]
                    [--version]
                    url

    CDN gathering

    positional arguments:
      url                   URL of website

    optional arguments:
      -h, --help            show this help message and exit
      -a, --all             include also local css/js
      -k, --keep            keep the downloaded HTML file
      -n, --no-check-certificate
                            do not validate SSL certificates
      -c COOKIES, --cookies COOKIES
                            Cookiestring
      -u USERAGENT, --useragent USERAGENT
                            User Agent (default: Google Chrome)
      -l LOGFILE, --logfile LOGFILE
                            Name of the logfile (default: cdnparse.log)
      --version             show program's version number and exit

Together with wpull the output of the cdnparse command can be used to harvest a complete website including also external javascript or stylesheets from CDN servers. The following bash script wraps wpull and cdnparse together. In a first step a warc is created with only js and css files. Note that wpull is called non-recursive, each wpull call should fetch just one single file, but put in the same warc file.

In the second step the actual webseite is recursively crawled and collected in the warc from the first step.

#!/bin/bash

SITE=$1
HOSTNAME=`echo $SITE | cut -d"/" -f3`
TIMESTAMP=`date +"%Y%m%d%H%M%S"`
NAME="$HOSTNAME-$TIMESTAMP"

# CDNPARSE=/opt/regal/python3/bin/cdnparse
CDNPARSE=cdnparse
# WPULL=/opt/regal/python3/bin/wpull
WPULL=wpull
cdns=`$CDNPARSE -a  $SITE`

echo "##### Gathering CSS and JS #####"
for cdn in $cdns
do
   $WPULL --warc-file $NAME \
          --no-check-certificate \
          --no-robots \
          --delete-after \
          --tries=5 \
          --waitretry=20 \
          --random-wait \
          --strip-session-id \
          --warc-append \
          --database $NAME.db \
          $cdn
done

echo "##### Gathering site #####"

$WPULL --warc-file $NAME \
       --recursive \
       --tries=5 \
       --waitretry=20 \
       --random-wait \
       --link-extractors=javascript,html,css \
       --escaped-fragment \
       --strip-session-id \
       --no-host-directories \
       --page-requisites \
       --no-parent \
       --database $NAME.db \
       --no-check-certificate \
       --no-directories \
       --delete-after \
       --convert-links  \
       --span-hosts \
       --hostnames="$HOSTNAME" \
      $SITE

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
hbz		hbz
.gitignore		.gitignore
CHANGES.rst		CHANGES.rst
MANIFEST.in		MANIFEST.in
README.rst		README.rst
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hbz.cdntools

About

Releases 4

Packages

Languages

hbz/hbz.cdntools

Folders and files

Latest commit

History

Repository files navigation

hbz.cdntools

About

Resources

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages