Multithreading #26

ethicalhack3r · 2011-06-06T17:07:35Z

Hi there,

I was wondering if it would be possible to multithread the spidr gem? I don't know much about multithreading in ruby, but I believe only Ruby 1.9.x is able to do so?

I had a look through the source but couldn't find where the spidr gem makes its http requests.

Maybe something like Typhoeus can be used?! (http://rubygems.org/gems/typhoeus)

Thanks,
Ryan

postmodern · 2011-06-06T20:44:57Z

This is possible, but difficult. The main problem is a race condition between the url/page callbacks and the requesting of pages. The callbacks could modify the filtering rules, as another thread is requesting a page, that is suddenly unwanted. The second problem is currently Spidr uses persistent HTTP connections, so I'm unsure how multi-threading would improve performance? We've been looking at alternative HTTP libraries, but they all have various pros/cons.

ethicalhack3r · 2011-06-07T10:59:50Z

Thanks for the quick response. I don't know too much about multi-threading, maybe X amount of persistent HTTP connections can be opened?!

Either way seems like a difficult task to achieve.

nirvdrum · 2012-05-08T02:25:10Z

If you decide to go with it, I'd give Celluloid a look. Alas, it is Ruby 1.9 only due to its use of fibers. But it's a pretty nice library.

postmodern · 2012-05-08T03:54:01Z

I'm considering switching to net-http-persistent, a Thread pool for requests, with mutexes around adding filters.

grrowl · 2013-10-14T09:47:09Z

+1, this seems to be the best spider/crawling library out, and this would be a great feature.

dadamschi · 2015-12-01T20:43:33Z

What happened with this request?

postmodern · 2015-12-01T21:18:56Z

I don't have the time currently to work on such a large feature.

ZeroChaos- · 2017-01-04T15:48:16Z

been a year, any chance you have time to work on such a feature now? :-)

fuzzygroup · 2017-04-07T20:21:44Z

I've written more than a crawler or N in my career and if you didn't make it multi-threaded from the start, it is damn hard to do so in retrospect. Now, that said, I think the overall goal here is throughput rather than threads. If the discovered urls can be surfaced to an overall queue (Redis or SQS) then that would change the equation because rather than threads you simply run more instances (or containers) of Spidr and let the queue handle distribution of work across N copies.

Thoughts?

postmodern · 2017-04-08T02:11:30Z

A distributed Spidr is a little out of scope, or at least further down the road.

Multi-threading here is mainly to address blocking I/O when waiting on responses to come back from the HTTP Sessions. Luckily, net-http-persistent is already thread aware. We'd just need to replace the spidering loop with a producer/consumer thread pool. Each thread would have it's own session cache via net-http-persistent, would dequeue URLs, and enqueue the responses/Pages. All additional logic with headers and parsing HTML would still be done in the main thread, to avoid additional Mutex complexity. There's probably other hidden work and locking issues hidden in the details.

vwochnik · 2017-09-24T14:33:43Z

+1, a producer/consumer for the requests would be awesome! I really like the interface of your library by the way.

dadamschi · 2017-09-24T16:31:51Z

I don't understand "a producer/consumer for the requests"... ================== David Adams [email protected]

…

On Sun, Sep 24, 2017 at 9:33 AM, Vincent Wochnik ***@***.***> wrote: +1, a producer/consumer for the requests would be awesome! I really like the interface of your library by the way. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAzcSIIeYHPwcLScA05GBD8Qk5UGEPG0ks5slmhIgaJpZM4BF_tL> .

vwochnik · 2017-09-24T16:56:02Z

I mean a producer/consumer pattern where one a thread pool of worker threads that do the requesting are connected to the main thread with queues, like a manufacturing band.

The main thread puts all requests that it wants to have resolved in a queue and any worker thread can pick the task from the queue, do the request, and put it inside the finished responses queue which is being read by the main thread. In this way, the main thread does not do any requesting, i.e. blocking activity, itself which will lead to a speedup.

ethicalhack3r closed this as completed Jun 7, 2011

ethicalhack3r reopened this Jun 7, 2011

postmodern added the feature label Dec 31, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreading #26

Multithreading #26

ethicalhack3r commented Jun 6, 2011

postmodern commented Jun 6, 2011

ethicalhack3r commented Jun 7, 2011

nirvdrum commented May 8, 2012

postmodern commented May 8, 2012

grrowl commented Oct 14, 2013

dadamschi commented Dec 1, 2015

postmodern commented Dec 1, 2015

ZeroChaos- commented Jan 4, 2017

fuzzygroup commented Apr 7, 2017

postmodern commented Apr 8, 2017

vwochnik commented Sep 24, 2017

dadamschi commented Sep 24, 2017 via email

vwochnik commented Sep 24, 2017

Multithreading #26

Multithreading #26

Comments

ethicalhack3r commented Jun 6, 2011

postmodern commented Jun 6, 2011

ethicalhack3r commented Jun 7, 2011

nirvdrum commented May 8, 2012

postmodern commented May 8, 2012

grrowl commented Oct 14, 2013

dadamschi commented Dec 1, 2015

postmodern commented Dec 1, 2015

ZeroChaos- commented Jan 4, 2017

fuzzygroup commented Apr 7, 2017

postmodern commented Apr 8, 2017

vwochnik commented Sep 24, 2017

dadamschi commented Sep 24, 2017 via email

vwochnik commented Sep 24, 2017