Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use address instead of ISBN to query book price #11

Open
rimbi opened this issue Oct 10, 2011 · 11 comments
Open

Use address instead of ISBN to query book price #11

rimbi opened this issue Oct 10, 2011 · 11 comments

Comments

@rimbi
Copy link
Member

rimbi commented Oct 10, 2011

This has a few advantages over the ISBN search:

  1. Extensions are no more prone to errors due to the change in the page structure regarding ISBN display.
  2. No more special treatment for different web sites
  3. web address is a potential means of data collection, which may be used for data extraction experiments.
  4. By using address we can flag pages that has not been crawled yet and run a special crawl session for those pages.
@sardok
Copy link
Member

sardok commented Oct 10, 2011

good idea but i have following questions;

  • you need to get title and price. thus you still have to parse the web page content. yes ISBN changes wont affect you but the rest will do.
  • i dont understand how plugin is going to work with this structure. Consider this,
    from plugin point of view, you get a book information somehow (from parsing the page content or web url) and want to query about this book.
  • what are you going to use as key to represent the book in database (we use ISBN as you know)? if you use web url, you need to do parsing in the server to recognise the book then do matching and send result to web browser.

@rimbi
Copy link
Member Author

rimbi commented Oct 10, 2011

  • We even now don't parse title and price within extensions.
  • No big difference: Currently extensions use ISBN, in the future they are supposed to use URL. The rest is the same.
  • I didn't understand what you meant in this item. Sounds like you missed something?

@sardok
Copy link
Member

sardok commented Oct 11, 2011

haaa. okay okay. i got it. but it will turn out that we need to
implement url filtering mechanism, which we should done it before.

On Mon, Oct 10, 2011 at 9:45 PM, Cem Eliguzel
[email protected]
wrote:

  • We even now don't parse title and price within extensions.
  • No big difference: Currently extensions use ISBN, in the future they are supposed to use URL. The rest is the same.
  • I didn't understand what you meant in this item. Sounds like you missed something?

Reply to this email directly or view it on GitHub:
#11 (comment)

@rimbi
Copy link
Member Author

rimbi commented Oct 11, 2011

Sorry, I didn't understand the need for url filtering?

@sardok
Copy link
Member

sardok commented Oct 11, 2011

how are you going to differentiate a book's url other than non-book url's?

On Tue, Oct 11, 2011 at 1:05 PM, Cem Eliguzel
[email protected]
wrote:

Sorry, I didn't understand the need for url filtering?

Reply to this email directly or view it on GitHub:
#11 (comment)

@sardok
Copy link
Member

sardok commented Oct 11, 2011

also the one of the questions above that you didn't understand was that;
you want to search a book available in idefix in kitapsever database.
if the information of that book is available as just url in the
database, how are you going to match the same book available in other
web sites like pandora?

i am missing something i think.

On Tue, Oct 11, 2011 at 3:13 PM, Sinan Nalkaya [email protected] wrote:

how are you going to differentiate a book's url other than non-book url's?

On Tue, Oct 11, 2011 at 1:05 PM, Cem Eliguzel
[email protected]
wrote:

Sorry, I didn't understand the need for url filtering?

Reply to this email directly or view it on GitHub:
#11 (comment)

@rimbi
Copy link
Member Author

rimbi commented Oct 11, 2011

A joined query: URL --> ISBN --> All URLS ordered by price

On Tue, Oct 11, 2011 at 3:16 PM, Sinan Nalkaya <
[email protected]>wrote:

also the one of the questions above that you didn't understand was that;
you want to search a book available in idefix in kitapsever database.
if the information of that book is available as just url in the
database, how are you going to match the same book available in other
web sites like pandora?

i am missing something i think.

On Tue, Oct 11, 2011 at 3:13 PM, Sinan Nalkaya [email protected] wrote:

how are you going to differentiate a book's url other than non-book
url's?

On Tue, Oct 11, 2011 at 1:05 PM, Cem Eliguzel
[email protected]
wrote:

Sorry, I didn't understand the need for url filtering?

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:

#11 (comment)

@rimbi
Copy link
Member Author

rimbi commented Oct 11, 2011

We will not on the client side. I guess that the (non-book pages / book
pages) ratio will be so small that it will be acceptable.

On Tue, Oct 11, 2011 at 3:13 PM, Sinan Nalkaya <
[email protected]>wrote:

how are you going to differentiate a book's url other than non-book url's?

On Tue, Oct 11, 2011 at 1:05 PM, Cem Eliguzel
[email protected]
wrote:

Sorry, I didn't understand the need for url filtering?

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:

#11 (comment)

@sardok
Copy link
Member

sardok commented Oct 11, 2011

cemo i really dont udnerstand, you said that there will be NO parsing
at all, am i right? no ISBN parsing, no title parse nothing. just send
web url to the database.
but here you are searching for a specific ISBN from the urls, how are
you going to do that?

On Tue, Oct 11, 2011 at 3:33 PM, Cem Eliguzel
[email protected]
wrote:

A joined query: URL --> ISBN --> All URLS ordered by price

On Tue, Oct 11, 2011 at 3:16 PM, Sinan Nalkaya <
[email protected]>wrote:

also the one of the questions above that you didn't understand was that;
you want to search a book available in idefix in kitapsever database.
if the information of that book is available as just url in the
database, how are you going to match the same book available in other
web sites like pandora?

i am missing something i think.

On Tue, Oct 11, 2011 at 3:13 PM, Sinan Nalkaya [email protected] wrote:

how are you going to differentiate a book's url other than non-book
url's?

On Tue, Oct 11, 2011 at 1:05 PM, Cem Eliguzel
[email protected]
wrote:

Sorry, I didn't understand the need for url filtering?

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:
#11 (comment)

@rimbi
Copy link
Member Author

rimbi commented Oct 11, 2011

I knew you missed the point :)

We'll continue to extract ISBN from the pages and put it in the database
together with the URL as usual. No change at this point in crawler.

What we'll change is the way extensions make their queries. They'll use URL
instead of ISBN to query the book and that's possible since we already have
URLs in the database.

On Tue, Oct 11, 2011 at 3:47 PM, Sinan Nalkaya <
[email protected]>wrote:

cemo i really dont udnerstand, you said that there will be NO parsing
at all, am i right? no ISBN parsing, no title parse nothing. just send
web url to the database.
but here you are searching for a specific ISBN from the urls, how are
you going to do that?

On Tue, Oct 11, 2011 at 3:33 PM, Cem Eliguzel
[email protected]
wrote:

A joined query: URL --> ISBN --> All URLS ordered by price

On Tue, Oct 11, 2011 at 3:16 PM, Sinan Nalkaya <
[email protected]>wrote:

also the one of the questions above that you didn't understand was that;
you want to search a book available in idefix in kitapsever database.
if the information of that book is available as just url in the
database, how are you going to match the same book available in other
web sites like pandora?

i am missing something i think.

On Tue, Oct 11, 2011 at 3:13 PM, Sinan Nalkaya [email protected]
wrote:

how are you going to differentiate a book's url other than non-book
url's?

On Tue, Oct 11, 2011 at 1:05 PM, Cem Eliguzel
[email protected]
wrote:

Sorry, I didn't understand the need for url filtering?

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:

#11 (comment)

@sardok
Copy link
Member

sardok commented Oct 11, 2011

all right, all right. i got it know.

On Tue, Oct 11, 2011 at 3:50 PM, Cem Eliguzel
[email protected]
wrote:

I knew you missed the point :)

We'll continue to extract ISBN from the pages and put it in the database
together with the URL as usual. No change at this point in crawler.

What we'll change is the way extensions make their queries. They'll use URL
instead of ISBN to  query the book and that's possible since we already have
URLs in the database.

On Tue, Oct 11, 2011 at 3:47 PM, Sinan Nalkaya <
[email protected]>wrote:

cemo i really dont udnerstand, you said that there will be NO parsing
at all, am i right? no ISBN parsing, no title parse nothing. just send
web url to the database.
but here you are searching for a specific ISBN from the urls, how are
you going to do that?

On Tue, Oct 11, 2011 at 3:33 PM, Cem Eliguzel
[email protected]
wrote:

A joined query: URL --> ISBN --> All URLS ordered by price

On Tue, Oct 11, 2011 at 3:16 PM, Sinan Nalkaya <
[email protected]>wrote:

also the one of the questions above that you didn't understand was that;
you want to search a book available in idefix in kitapsever database.
if the information of that book is available as just url in the
database, how are you going to match the same book available in other
web sites like pandora?

i am missing something i think.

On Tue, Oct 11, 2011 at 3:13 PM, Sinan Nalkaya [email protected]
wrote:

how are you going to differentiate a book's url other than non-book
url's?

On Tue, Oct 11, 2011 at 1:05 PM, Cem Eliguzel
[email protected]
wrote:

Sorry, I didn't understand the need for url filtering?

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:

#11 (comment)

Reply to this email directly or view it on GitHub:
#11 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants