Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(notubiz): missing documents from several municipalities #513

Open
BluntKatana opened this issue Feb 12, 2025 · 0 comments
Open

fix(notubiz): missing documents from several municipalities #513

BluntKatana opened this issue Feb 12, 2025 · 0 comments
Labels
bug High priority issue for (blocking) problems

Comments

@BluntKatana
Copy link

BluntKatana commented Feb 12, 2025

Problem

I've found that in the notubiz API there are several 'hidden' agenda items and documents which are currently not being scraped resulting in a large difference between documents actually available on the municipalities sites and on ORI.

There are two main issues I have found. Both of which are related to the agenda items properties. Currently an agenda item from a meeting (.agenda_items[]) is parsed only on the .documents[]. However there are two more properties which are interesting:

  1. .module_items[]: A module item in itself does not look interesting (see below). But once fetching this item using the .self-property we find that a module item can have several documents containing it.
    (see point 6 on municipality website)
    (see 7th agenda item: https://api.notubiz.nl/events/meetings/1152031?format=json&version=1.17.0)
Image
  1. .agenda_items[]: An agenda item itself can contain several more agenda items which (again) do not look interesting at first (see below), but when fetching them outright they can ofcourse contain documents again (and even more agenda items or module items..)
    (they have a special suffix on the municipality website)
    (see 14th agenda item: https://api.notubiz.nl/events/meetings/1161553?format=json&version=1.17.0)
Image

Some examples of missing documents

(note that my simple scraper is also missing some documents atm, but has better coverage for the notubiz api)

Breda

year scraped_from_notubiz scraped_from_ori in_notubiz_not_in_ori in_ori_not_in_notubiz
2014 0 126 0 126
2015 0 578 0 578
2016 3740 1000 2856 116
2017 1745 1000 933 188
2018 431 243 198 10
2019 1915 218 1698 1
2020 227 157 72 2
2021 193 184 10 1
2022 220 186 37 3
2023 240 221 22 3
2024 206 155 56 5

Waddinxveen

year scraped_from_notubiz scraped_from_ori in_notubiz_not_in_ori in_ori_not_in_notubiz
2014 0 964 0 964
2015 0 1000 0 1000
2016 5166 1000 4391 225
2017 2362 0 2362 0
2018 1514 988 601 75
2019 1603 993 647 37
2020 1593 469 1125 1
2021 1544 695 926 77
2022 1636 1000 662 26
2023 1523 1000 554 31
2024 1120 698 459 37

Bunschoten
Image

Enkhuizen
Image

IJsstelstein
Image

@BluntKatana BluntKatana added the bug High priority issue for (blocking) problems label Feb 12, 2025
@BluntKatana BluntKatana changed the title fix: missing documents from several notubiz municipalities fix(notubiz): missing documents from several municipalities Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug High priority issue for (blocking) problems
Projects
None yet
Development

No branches or pull requests

1 participant