Skip to content

hbz/hbz.cdntools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hbz.cdntools

This tool parses a HTML site and returns a list of found CSS and JS files from external Content Delivery Networks (CDN). In addition <img> Tags are scanned for more Domains and collected in a file hostnames.txt.

The module is installed by running:

$ pip install -f https://dist.pubsys.hbz-nrw.de hbz.cdntools

Update a new version with

$ pip install -f https://dist.pubsys.hbz-nrw.de -U hbz.cdntools

Under Ubuntu 14.04. pip does not work for unknown reasons, but easy_install does:

$ easy_install -f https://dist.pubsys.hbz-nrw.de -U hbz.cdntools

Dependencies like the Requests and BeautifulSoup libraries are installed automatically.

Usage:

$ cdnparse -h

    usage: cdnparse [-h] [-a] [-k] [-n] [-c COOKIES] [-u USERAGENT] [-l LOGFILE]
                    [--version]
                    url

    CDN gathering

    positional arguments:
      url                   URL of website

    optional arguments:
      -h, --help            show this help message and exit
      -a, --all             include also local css/js
      -k, --keep            keep the downloaded HTML file
      -n, --no-check-certificate
                            do not validate SSL certificates
      -c COOKIES, --cookies COOKIES
                            Cookiestring
      -u USERAGENT, --useragent USERAGENT
                            User Agent (default: Google Chrome)
      -l LOGFILE, --logfile LOGFILE
                            Name of the logfile (default: cdnparse.log)
      --version             show program's version number and exit

Together with wpull the output of the cdnparse command can be used to harvest a complete website including also external javascript or stylesheets from CDN servers. The following bash script wraps wpull and cdnparse together. In a first step a warc is created with only js and css files. Note that wpull is called non-recursive, each wpull call should fetch just one single file, but put in the same warc file.

In the second step the actual webseite is recursively crawled and collected in the warc from the first step.

#!/bin/bash

SITE=$1
HOSTNAME=`echo $SITE | cut -d"/" -f3`
TIMESTAMP=`date +"%Y%m%d%H%M%S"`
NAME="$HOSTNAME-$TIMESTAMP"

# CDNPARSE=/opt/regal/python3/bin/cdnparse
CDNPARSE=cdnparse
# WPULL=/opt/regal/python3/bin/wpull
WPULL=wpull
cdns=`$CDNPARSE -a  $SITE`

echo "##### Gathering CSS and JS #####"
for cdn in $cdns
do
   $WPULL --warc-file $NAME \
          --no-check-certificate \
          --no-robots \
          --delete-after \
          --tries=5 \
          --waitretry=20 \
          --random-wait \
          --strip-session-id \
          --warc-append \
          --database $NAME.db \
          $cdn
done

echo "##### Gathering site #####"

$WPULL --warc-file $NAME \
       --recursive \
       --tries=5 \
       --waitretry=20 \
       --random-wait \
       --link-extractors=javascript,html,css \
       --escaped-fragment \
       --strip-session-id \
       --no-host-directories \
       --page-requisites \
       --no-parent \
       --database $NAME.db \
       --no-check-certificate \
       --no-directories \
       --delete-after \
       --convert-links  \
       --span-hosts \
       --hostnames="$HOSTNAME" \
      $SITE

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages