Skip to content

Custom Pagination #26

Open
Open
@Swader

Description

@Swader

The pagination side of Diffbot is buggy at best. It will often fail to recognize articles that are multi-page and will not merge them. What's more, it tops out at 20 pages, so anything longer will get ignored.

The feature suggestion for the client is as follows:

Add a new method to the Article API: paginateBy. This method takes 2 arguments: $identifier and $maxPages. The former is a way to identify the nextPage link element on the page. This element would auto-processed to find out all the next pages programmatically. The latter is the max number of pages to concat.

This method would, in order:

  1. Make an Article API request to the original URL.
  2. Find the nextPage element and process it to find out the pattern to which to attach incrementing numbers, thus generating next pages.
  3. Make an additional Article API request to each page, up to $maxPages number of pages
  4. Concatenate the HTML content of all pages.
  5. Send the merged HTML content as a POST request to the Article API, for a final analysis of the entire post.

Alternatively, in order to save Article API requests and use up only one, the client could just Guzzle the raw HTML of all the articles, extract the content HTML, merge that and send it as POST. This, however, is less reliable, as Diffbot is much better at figuring out what is content on the page, and what isn't (headers, ads, comments, etc.).

Maybe make it a switch of some kind, and additional setter?

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions