A concurent web crawler written in Go.
go get github.com/rapidclock/web-octopus/octopus
go get github.com/rapidclock/web-octopus/adapter
- Depth Limited Crawling
- User specified valid protocols
- User buildable adapters that the crawler feeds output to.
- Filter Duplicates. (Default, Non-Customizable)
- Filter URLs that fail a HEAD request. (Default, Non-Customizable)
- User specifiable max timeout between two successive url requests.
- Max Number of Links to be crawled.
package main
import (
"github.com/rapidclock/web-octopus/adapter"
"github.com/rapidclock/web-octopus/octopus"
)
func main() {
opAdapter := &adapter.StdOpAdapter{}
options := octopus.GetDefaultCrawlOptions()
options.MaxCrawlDepth = 3
options.TimeToQuit = 10
options.CrawlRatePerSec = 5
options.CrawlBurstLimitPerSec = 8
options.OpAdapter = opAdapter
crawler := octopus.New(options)
crawler.SetupSystem()
crawler.BeginCrawling("https://www.example.com")
}
Customizations can be made by supplying the crawler an instance of CrawlOptions
. The basic structure is shown below, with a brief explanation for each option.
type CrawlOptions struct {
MaxCrawlDepth int64 // Max Depth of Crawl, 0 is the initial link.
MaxCrawledUrls int64 // Max number of links to be crawled in total.
StayWithinBaseHost bool // [Not-Implemented-Yet]
CrawlRatePerSec int64 // Max Rate at which requests can be made (req/sec).
CrawlBurstLimitPerSec int64 // Max Burst Capacity (should be atleast the crawl rate).
RespectRobots bool // [Not-Implemented-Yet]
IncludeBody bool // Include the Request Body (Contents of the web page) in the result of the crawl.
OpAdapter OutputAdapter // A user defined crawl output handler (See next section for info).
ValidProtocols []string // Valid protocols to crawl (http, https, ftp, etc.)
TimeToQuit int64 // Timeout (seconds) between two attempts or requests, before the crawler quits.
}
A default instance of the CrawlOptions
can be obtained by calling octopus.GetDefaultCrawlOptions()
. This can be further customized by overriding individual properties.
NOTE: If rate-limiting is not required, then just ignore(don't set value) both CrawlRatePerSec
and CrawlBurstLimitPerSec
in the CrawlOptions
.
An Output Adapter is the final destination of a crawler processed request. The output of the crawler is fed here, according to the customizations made before starting the crawler through the CrawlOptions
attached to the crawler.
The OutputAdapter
is a Go Interface, that has to be implemented by your(user-defined) processor.
type OutputAdapter interface {
Consume() *NodeChSet
}
The user has to implement the Consume()
method that returns a pointer to a NodeChSet
. The NodeChSet
is described below. The crawler uses the returned channel to send the crawl output. The user can start listening for output from the crawler.
Note : If the user chooses to implement their custom OutputAdapter
REMEMBER to listen for the output on another go-routine. Otherwise you might block the crawler from running. Atleast begin the crawling on another go-routine before you begin processing output.
The structure of the NodeChSet
is given below.
type NodeChSet struct {
NodeCh chan<- *Node
*StdChannels
}
type StdChannels struct {
QuitCh chan<- int
}
type Node struct {
*NodeInfo
Body io.ReadCloser
}
type NodeInfo struct {
ParentUrlString string
UrlString string
Depth int64
}
You can use the utility function MakeDefaultNodeChSet()
to get a NodeChSet
built for you. This also returns the Node
and quit channels. Example given below:
var opNodeChSet *NodeChSet
var nodeCh chan *Node
var quitCh chan int
// above to demo the types. One can easily use go lang type erasure.
opNodeChSet, nodeCh, quitCh = MakeDefaultNodeChSet()
The user should supply the custom OutputAdapter as an argument to the CrawlOptions
.
We supply two default Adapters for you to try out. They are not meant to be feature rich, but you can still use them. Their primary purpose is meant to be a demonstration of how to build and use a OutputAdapter
.
adapter.StdOpAdapter
: Writes the crawled output (only links, not body) to the standard output.adapter.FileWriterAdapter
: Writes the crawled output (only links, not body) to a supplied file.
We have supplied the implementation of adapter.StdOpAdapter
below to get a rough idea of what goes into building your own adapter.
// StdOpAdapter is an output adapter that just prints the output onto the
// screen.
//
// Sample Output Format is:
// LinkNum - Depth - Url
type StdOpAdapter struct{}
func (s *StdOpAdapter) Consume() *oct.NodeChSet {
listenCh := make(chan *oct.Node)
quitCh := make(chan int, 1)
listenChSet := &oct.NodeChSet{
NodeCh: listenCh,
StdChannels: &oct.StdChannels{
QuitCh: quitCh,
},
}
go func() {
i := 1
for {
select {
case output := <-listenCh:
fmt.Printf("%d - %d - %s\n", i, output.Depth, output.UrlString)
i++
case <-quitCh:
return
}
}
}()
return listenChSet
}