- 
                Notifications
    You must be signed in to change notification settings 
- Fork 140
This project provides a hardware accelerated URL extraction system. The NetFPGA reference router has been modified to identify HTTP packets containing URLs and send a copy to the host system. Software running on the host extracts and stores URLs and search terms into a database, and then displays them through a graphical user interface.
- Status :
- Version :
- Authors :
- NetFPGA base source :
- Install the URL Extraction Project.
The URL extraction system consists of two main components: hardware and software. The hardware component is an extended NetFPGA IPv4 reference router that identifies packets containing a HTTP GET request in hardware and sends a copy to the host. The software component is composed of three parts: URL Extractor, database, and graphical user interface. The URL Extractor parses HTTP GET packets, extracts the contained URLs and search terms, and then stores them into a database. The GUI queries the database for top occurring URLs and search terms, and displays them on-screen. A system diagram is shown in below.
 
Image adapted from [1]
The packet life cycle of the URL extraction system can be explained in the following six step sequence:
- A packet enters the NetFPGA through the gigabit Ethernet ports and is put in a MAC RxQ .
- It then traverses the User Data Path, which processes the packet to determine the output port, and places the packet in the TxQ corresponding to the output port. The User Data Path duplicates HTTP GET packets, sending a copy up to the host by placing it into the CPU TxQ, and forwarding the other along its normal path.
- The packet in the MAC TxQ is sent out onto the Ethernet, whereas the packet in the CPU TxQ is transfered across the PCI Bus to the NetFPGA kernel driver.
- The URL Extractor software then receives the packet by reading from a socket bound to the NetFPGA software interface (nf2c0).
- The URL Extractor parses the HTTP GET packet and extracts the contained URL, storing it into the database. The URL is then checked for embedded search engine terms, and if found, they are also extracted and stored into the database.
- Finally, the GUI queries the database for top occuring URLs and search terms, displaying them on-screen.
Below is a screenshot of the output produced by the URL Extractor.
 
Below is a screenshot of the GUI. The left pane displays the top occurring URLs in the the database. The right pane displays the top occurring Google search terms (some asian characters distort the search term count alignment).
 
The regression tests verify the functionality of the hardware component of the URL extractor system. In order to run the tests, you need to have the machine connected for the regression tests as stated in the Run Regression Tests section of the Guide.
After connecting the cables, ensure dhclient is not running. Then execute the following command to run the regression tests.
nf2_regress_test.pl --project url_extraction
The URL extraction router contains all the same regression tests as the reference router, with the addition of three new test (below). The definition of the reference router regression tests can be found on Router Tests wiki page.
- Name :
- Initialize netfpga hardware (same as test_packet_forwarding)
- Send 20 Unix GET packets from eth1 to eth2 and nf2c0.
- Send 20 Unix GET packets from eth2 to eth1 and nf2c0.
- Check the number of forwarded packets register and verify the value is correct.
- Location
projects/url_extraction/regress/test_get_unix
- Output
SUCCESS!
- Name :
- Initialize netfpga hardware (same as test_packet_forwarding)
- Send 20 Windows GET packets from eth1 to eth2 and nf2c0.
- Send 20 Windows GET packets from eth2 to eth1 and nf2c0.
- Check the number of forwarded packets register and verify the value is correct.
- Location
projects/url_extraction/regress/test_get_win
- Output
SUCCESS!
- Name :
- Initialize netfpga hardware (same as test_packet_forwarding)
- Send 20 packets from eth1 to eth2 with an ip_len < MIN_LEN, and proto = TCP.
- Send 20 packets from eth2 to eth1 with an ip_len < MIN_LEN, and proto = TCP.
- Send 20 packets from eth1 to eth2 with an ip_len < MIN_LEN, and proto != TCP.
- Send 20 packets from eth2 to eth1 with an ip_len < MIN_LEN, and proto != TCP.
- Send 20 packets from eth1 to eth2 with an ip_len > MIN_LEN, and proto != TCP.
- Send 20 packets from eth2 to eth1 with an ip_len > MIN_LEN, and proto != TCP.
- Send 20 packets from eth1 to eth2 with an ip_len > MIN_LEN, proto = TCP, and dst port != HTTP.
- Send 20 packets from eth2 to eth1 with an ip_len > MIN_LEN, proto = TCP, and dst port != HTTP
- Check the number of forwarded packets register and verify the value is correct.
- Location
projects/url_extraction/regress/test_get_nondup
- Output
SUCCESS!
- Install packages required by software components:
yum install mysql-server mysql-devel gtk2-devel
- Start the MySQL server:
service mysqld start
- Set a password for the root database user:
mysqladmin -u root password netfpga mysqladmin -u root --password reload
- Create the database:
mysqladmin -u root -p create db
- Create the database tables:
cd projects/url_extraction/sw/db mysql -u root -p db < search_term_table.sql mysql -u root -p db < url_tbl.sql
- Compile the URL Extractor from the source:
cd projects/url_extraction/sw/urlx make
- Compile the GUI from the source:
cd projects/url_extraction/sw/gui make
- Ensure that the NetFPGA kernel driver is loaded and that the CPCI has been reprogrammed.
- Download the URL extraction bitfile:
nf2_download url_extraction.bit
There are two main ways to configure the router:
- Using SCONE. Note: This hasn't been throughly tested. Connecting hosts on port MAC-0 may have weird effects since SCONE will received extra unexpected GET packets. However, it has been tested and works in testbed topology below.
- Statically configure all networking information using the cli or Java gui. Adjacent nodes will also require a static ARP entry for the router.
The URL Extractor is started by running the urlx binary. The interface_name argument specifies the network interface to receive GET packets from, e.g. nf2c0.
 cd project/url_extraction/sw/urlx
 ./urlx 
    Usage: ./urlx interface_name
The GUI is started by running the gui binary:
cd project/url_extraction/sw/gui ./gui
The URL Extraction system can be tested using the below topology. The NetFPGA interfaces use IP addresses 192.168.x.1, where 'x' is the interface number (starting at 1). Connect the PC to the 2nd NetFPGA port. Connect the NAT router to the 3rd NetFPGA port.
 
On the host system :
- Run SCONE. The cpuhw and rtable files have been provided for this topology (projects/url_extraction/sw/scone). The NetFPGA has a default route through the NAT router.
rtable: 0.0.0.0 192.168.3.2 0.0.0.0 eth2
- Run the URL Extractor:
./urlx nf2c0
- Run the GUI:
./gui
- View accessed URLs and search terms.
- Set the default route to 192.168.2.1
- Start web browsing.
- The NAT router can be a PC running iptables or a home router/gateway. Here is a sample bash script to enable NAT where eth1 is connected to the LAN and eth0 has the public IP:
#!/bin/sh echo "Enabling IP forwarding...\n" echo 1 > /proc/sys/net/ipv4/ip_forward echo "Flashing iptables...\n" iptables -F echo "Adding iptables rules...\n" iptables -A FORWARD -i eth1 -j ACCEPT iptables -A FORWARD -o eth1 -j ACCEPT iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
[1] J.W. Lockwood, J. Naous, G. Gibb. (2008, Aug) Building Gigabit-rate Routers with the NetFPGA: NICTA Tutorial at UNSW. Sydney, Australia.