Handling web scraper pagination

Created by Kai Sasaki, Modified on Mon, 21 Nov, 2022 at 12:59 AM by Kai Sasaki

Sometimes we want to collect information from a URL where all the data is not shown at first.

In web development, multiple pagination types can be implemented. These are the supported pagination methods we currently support:

Infinite scroll (time constrained)

Infinite scroll is when you scroll to the end of a page and more items are loaded and appended to the list.

In this case, we will automatically scroll to the bottom of the page and wait for new results. Since websites have different response times, and sometimes fetching new results is not instant, we can define a wait time (in seconds) to decide when to stop the scraper.

For example, if you define 4 seconds, we will scroll to the bottom and wait 4 seconds. If more results are loaded, we will repeat the process. When no more results are loaded after 4 seconds, we will stop the scraper and return the results.

Infinite scroll (text search)

This type of pagination is also an infinite scroll, but in this case, we wait until we see a specific text in the website's content.

This is useful if the page where you are extracting from, displays a special message when all items are loaded.

For example, if the website displays a “No more items to load” or you know the items are sorted in a certain way, and you have a unique string from the last item.

Once we see this text, we will stop the scraper.

Pagination is resource intensive, and we cannot scroll forever. We currently have a 120 seconds timeout, and after this time, the scraper will end.