Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider embedding resources in responses instead of IDs #1

Open
mosheduminer opened this issue Nov 3, 2020 · 6 comments
Open

Consider embedding resources in responses instead of IDs #1

mosheduminer opened this issue Nov 3, 2020 · 6 comments

Comments

@mosheduminer
Copy link

Hi, I noticed the parabible project, and I like it. 👍

I read through the wiki here, in particular the query section. It does seem rather difficult to handle without POST requests, though I imagine you could do base64 encoding of the JSON if you don't mind it not being human readable, it's better than URL encodings, at least. (some discussion here https://softwareengineering.stackexchange.com/a/353105)

However, I thought to comment on the response schema. It appears that you intend to return IDs, based upon which you can make further calls. I suggest the possibility of embedding words, instead of just the IDs. This has the advantage of requiring fewer API calls. Of course, this would make the response much larger, but if you paginate (which I saw you mentioned in the user interface section here, it shouldn't be too much of an issue.

Keep up the good work! 👍

@jcuenod
Copy link
Member

jcuenod commented Nov 6, 2020

Hi @mosheduminer, thanks so much for the feedback! I really appreciate the interaction!

The point of this process is to get exactly the kind of feedback you're giving me so I want to respond but please push back/let me know your thoughts.

1. Base64 encoding

I had considered doing something like this. It's (mostly) not impossible to use GET requests. The way I see it, the problem is:

  • GET requests have a 2000 character limit but I don't know what the limit to the size of these searches is (it's conceivable in my mind that this turns into a problem and I'm not sure how to mitigate when it rears its head if I don't go in this direction at the outset). I also want to avoid premature optimisation kinds of mistakes, though (which, honestly, riddle my projects).
  • I could be wrong but I believe base64 was making queries longer.
  • I think simple queries are probably in the order of about 200-300 characters which is well below that limit so maybe I should just not worry about this problem.

2. Returning Words instead of IDs

I think I'm a bit confused by this suggestion so let me outline the way I'm understanding you. Correct me if I'm missing it.

I'm imagining a query for phrases with בהו. Hitting /query/new returns an object with a result field that looks like this (I'm making up IDs):

{
    "text_ids": [2],
    "matching_words": [
        [125]
    ],
    "word_ids": [124, 125]
}

At first I thought your suggestion was to put words in the word_ids field. The reason that I wouldn't want to do that is that the point of this array is that I know which words were matching in an actual verse text that looks like this:

<word id=123>היתה</word> <word id=124>תהו</word> <word id=125>ובהו</word>.

The point, though, is that I can query for the text with id=2 and get this (3 word) verse and then the client can figure out how to highlight the matching words (ID 125) and the matching phrase (which includes ID 124). So I want to be returning IDs in this case.

But now I'm thinking you mean instead of text_ids I should return the actual verse text. I had considered this (and I agree, fewer hits to the API is preferable). So here's what discouraged me from going with that approach:

  • In short, it makes for a lighter response payload which is significant when searches have lots of results. If the result set is large, the biggest chunk of data in the result is going to be this verse text and returning a ton of ids is going to be a lot lighter (plus the backend can respond faster with results so the client can start communicating to the user that the query has been successful). I had considered sending partial results (like the first 20) or something and that just seemed like effort not worth the expense but let me know if you think that's an idea worth pursuing.
  • In addition, text_ids are versioned. That is, you can't just request text_id=2. Rather, you need to request text_id=2&version=bhs to get the Hebrew verse. I am hoping to get a bunch of different parallel versions in here so parabible can't return every version available when it's not set. The alternative would be to optionally return text if the versions variable is set but the previous problem is still unresolved, in my mind.

Now that I think about this, I did want to give some way of sorting these results by canonical order (which is a versioned concept) and I'm realising the API doesn't have a way of doing that as it stands so I need to address that...

Sorry for all the writing. I truly welcome your feedback!

@mosheduminer
Copy link
Author

GET requests have a 2000 character limit but I don't know what the limit to the size of these searches is (it's conceivable in my mind that this turns into a problem and I'm not sure how to mitigate when it rears its head if I don't go in this direction at the outset).

I could be wrong but I believe base64 was making queries longer.

Yes, true. I believe Base64 lengthens the url by 33%. But it's worse using raw JSON, due to URL encoding. And even if you don't url encode, when sharing a link, the url is automatically URL encoded. This: encodeURIComponent("[{}]") results in this: "%5B%7B%7D%5D". So, at least for sharing purposes, raw json is much worse.

If, however, it is quite conceivable that queries will be longer than the limit, then it could be an issue.

But now I'm thinking you mean instead of text_ids I should return the actual verse text.

Yes, this is what I meant 🙂.

The alternative would be to optionally return text if the versions variable is set but the previous problem is still unresolved, in my mind.

Yes, even if you didn't have the version problem, it might even still be preferable to have an boolean embed parameter (or an /embed URL extension) to choose whether to return the text itself.

I had considered sending partial results (like the first 20) or something and that just seemed like effort not worth the expense but let me know if you think that's an idea worth pursuing.

I completely understand the problem. I suggested it because I understood you wanted to do exactly this (as per the link in my first comment). However, reading it again, it appears you only intended to paginate client side, but not at the API level.

There is a downside to paginating at the API level, however. Because it means that the database query is run every time, even though only some results are returned. So it may actually be preferable to not paginate at the API level.

sorting these results by canonical order (which is a versioned concept)

Different versions in fact have different orders?

@jcuenod
Copy link
Member

jcuenod commented Nov 30, 2020

Okay, I've been letting this percolate for a while and I have a new idea:

Instead of dumping the whole query into json, what about encoding the terms individually? I'm imagining queries that look something like this: /query?search_type=collocation&filter[]=genesis&filter[]=exodus&syntax_range=clause&terms[]=base64encodedthing&terms[]=base64encodedthing&terms[]=base64encodedthing

That is:

/query
?search_type=collocation

&filter[]=genesis
&filter[]=exodus

&syntax_range=clause

&terms[]=base64encodedthing
&terms[]=base64encodedthing
&terms[]=base64encodedthing

I'm still thinking about the other comments regarding sending text content.

Different versions in fact have different orders?

It's unusual but it happens (IIRC, there are few places this happens with the LXX in the Psalms).

Thanks so much for the interaction! I deeply appreciate the feedback.

@mosheduminer
Copy link
Author

Instead of dumping the whole query into json, what about encoding the terms individually?

So, the terms parameter will still be an array holding complex objects, but it would be base64 encoded.

I think the url might look a little less like garbage with this technique, and you also gain a little bit by not encoding everything in base64.

By the way, regarding URL character limits, it appears that most modern browsers do just fine many thousands of characters, according to this SO answer (note that the new edge browser, being chromium based, probably has the same limits as chrome). However, you might still have problems with URL length server side, but even then, it appears you can safely ise 4000 characters with apache server, for example.

@jcuenod
Copy link
Member

jcuenod commented Nov 30, 2020

So, the terms parameter will still be an array holding complex objects, but it would be base64 encoded.

Yes.

By the way, regarding URL character limits ...

Interesting. Thanks! I'm definitely coming round to this approach...

@jcuenod
Copy link
Member

jcuenod commented Jan 4, 2021

@mosheduminer Just as a follow up, I have started experimenting with base64 encoding the params. This approach is quite appealing the more I think about it. Thanks very much for pushing me in this direction!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants