-
-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading #26
Comments
This is possible, but difficult. The main problem is a race condition between the url/page callbacks and the requesting of pages. The callbacks could modify the filtering rules, as another thread is requesting a page, that is suddenly unwanted. The second problem is currently Spidr uses persistent HTTP connections, so I'm unsure how multi-threading would improve performance? We've been looking at alternative HTTP libraries, but they all have various pros/cons. |
Thanks for the quick response. I don't know too much about multi-threading, maybe X amount of persistent HTTP connections can be opened?! Either way seems like a difficult task to achieve. |
If you decide to go with it, I'd give Celluloid a look. Alas, it is Ruby 1.9 only due to its use of fibers. But it's a pretty nice library. |
I'm considering switching to net-http-persistent, a Thread pool for requests, with mutexes around adding filters. |
+1, this seems to be the best spider/crawling library out, and this would be a great feature. |
What happened with this request? |
I don't have the time currently to work on such a large feature. |
been a year, any chance you have time to work on such a feature now? :-) |
I've written more than a crawler or N in my career and if you didn't make it multi-threaded from the start, it is damn hard to do so in retrospect. Now, that said, I think the overall goal here is throughput rather than threads. If the discovered urls can be surfaced to an overall queue (Redis or SQS) then that would change the equation because rather than threads you simply run more instances (or containers) of Spidr and let the queue handle distribution of work across N copies. Thoughts? |
A distributed Spidr is a little out of scope, or at least further down the road. Multi-threading here is mainly to address blocking I/O when waiting on responses to come back from the HTTP Sessions. Luckily, net-http-persistent is already thread aware. We'd just need to replace the spidering loop with a producer/consumer thread pool. Each thread would have it's own session cache via net-http-persistent, would dequeue URLs, and enqueue the responses/Pages. All additional logic with headers and parsing HTML would still be done in the main thread, to avoid additional Mutex complexity. There's probably other hidden work and locking issues hidden in the details. |
+1, a producer/consumer for the requests would be awesome! I really like the interface of your library by the way. |
I don't understand "a producer/consumer for the requests"...
==================
David Adams
[email protected]
…On Sun, Sep 24, 2017 at 9:33 AM, Vincent Wochnik ***@***.***> wrote:
+1, a producer/consumer for the requests would be awesome! I really like
the interface of your library by the way.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#26 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAzcSIIeYHPwcLScA05GBD8Qk5UGEPG0ks5slmhIgaJpZM4BF_tL>
.
|
I mean a producer/consumer pattern where one a thread pool of worker threads that do the requesting are connected to the main thread with queues, like a manufacturing band. The main thread puts all requests that it wants to have resolved in a queue and any worker thread can pick the task from the queue, do the request, and put it inside the finished responses queue which is being read by the main thread. In this way, the main thread does not do any requesting, i.e. blocking activity, itself which will lead to a speedup. |
Hi there,
I was wondering if it would be possible to multithread the spidr gem? I don't know much about multithreading in ruby, but I believe only Ruby 1.9.x is able to do so?
I had a look through the source but couldn't find where the spidr gem makes its http requests.
Maybe something like Typhoeus can be used?! (http://rubygems.org/gems/typhoeus)
Thanks,
Ryan
The text was updated successfully, but these errors were encountered: