Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
360e573
Fix command
ppinchuk Feb 12, 2026
1487cc3
Bump elm version
ppinchuk Feb 25, 2026
874561b
Minor prompt update
ppinchuk Mar 9, 2026
6fe3c86
Merge remote-tracking branch 'origin/main' into pp/ghp_start
ppinchuk Mar 10, 2026
0b0ea61
Update lockfile
ppinchuk Mar 10, 2026
7ceef99
Fix linter
ppinchuk Mar 10, 2026
9adf343
Merge remote-tracking branch 'origin/main' into pp/ghp_start
ppinchuk Mar 11, 2026
45ccb83
Merge remote-tracking branch 'origin/main' into pp/ghp_start
ppinchuk Mar 11, 2026
8f86ab7
Documentation updates
ppinchuk Mar 11, 2026
4ae511b
First pass of GHP schema
ppinchuk Mar 11, 2026
c8a0a8c
Add basic plugin config
ppinchuk Mar 12, 2026
0cee1ea
Wire up GHP plugin
ppinchuk Mar 12, 2026
1e1139e
Have function return created class
ppinchuk Mar 12, 2026
c9baf2e
Clarification for noise
ppinchuk Mar 12, 2026
d38d6c3
Clarification for setbacks
ppinchuk Mar 12, 2026
3a2f80a
Clarification
ppinchuk Mar 12, 2026
fdab685
Use general guidance instead
ppinchuk Mar 12, 2026
3a995c9
Add clarification to definitions
ppinchuk Mar 12, 2026
405c4b3
Single row instruction
ppinchuk Mar 12, 2026
7d7cc19
Add clarification
ppinchuk Mar 12, 2026
60527cd
Allow nulls
ppinchuk Mar 12, 2026
980b7c6
Add instruction
ppinchuk Mar 12, 2026
96d68a6
update instructions
ppinchuk Mar 12, 2026
ab80874
Update instructions around null
ppinchuk Mar 12, 2026
a70c030
Tighten schema
ppinchuk Mar 12, 2026
02f01d8
Updates to schema
ppinchuk Mar 12, 2026
d650176
Add debug statements
ppinchuk Mar 12, 2026
4b22349
Update prompt
ppinchuk Mar 12, 2026
135e49f
Add logging
ppinchuk Mar 12, 2026
d655335
More logging
ppinchuk Mar 12, 2026
3f2fb7e
Update descriptions
ppinchuk Mar 13, 2026
4e24cb1
Update instructions
ppinchuk Mar 13, 2026
71eaf70
Add clarification
ppinchuk Mar 13, 2026
f939b81
Add info
ppinchuk Mar 13, 2026
8e9e1e8
Add more info to logger
ppinchuk Mar 13, 2026
51284c1
Add task ids
ppinchuk Mar 13, 2026
b18d805
Trimmed
ppinchuk Mar 13, 2026
5cb0c75
Update schema
ppinchuk Mar 13, 2026
ab4ac12
Update prompt
ppinchuk Mar 13, 2026
a9e29ca
Generalize implementation of `_get_model_config` and use it
ppinchuk Mar 13, 2026
2729dcd
Update logging statement
ppinchuk Mar 13, 2026
7f7a8eb
Change logging level
ppinchuk Mar 13, 2026
6b44d19
Fix import
ppinchuk Mar 13, 2026
faa0baa
Align playwright versions
ppinchuk Mar 13, 2026
dc4d87a
Fix bug in llm config retrieval
ppinchuk Mar 13, 2026
84422f1
Provide additional context even if user submits prompt
ppinchuk Mar 15, 2026
ebff812
Fix pandas link
ppinchuk Mar 15, 2026
c563530
WIP
ppinchuk Mar 30, 2026
aa90233
Merge remote-tracking branch 'origin/main' into pp/docling
ppinchuk Apr 22, 2026
19b8298
Fix env
ppinchuk Apr 22, 2026
c5a1703
ELM updates (WIP)
ppinchuk Apr 28, 2026
d5ed66d
Minor update
ppinchuk May 2, 2026
6c776e0
Minor cleanup
ppinchuk May 2, 2026
6abaf10
Minor cleanup
ppinchuk May 2, 2026
1398e1e
Rename func
ppinchuk May 2, 2026
734f791
Reduce redundancy
ppinchuk May 3, 2026
de457ef
Minor refactor
ppinchuk May 3, 2026
12e90ff
CLarify argument
ppinchuk May 3, 2026
76c9832
Add missing docs
ppinchuk May 3, 2026
3a254e7
Add docling support
ppinchuk May 3, 2026
b4a05bb
Link to docling docs
ppinchuk May 3, 2026
c87e306
Suppress numpy NaNmean warnings
ppinchuk May 3, 2026
0deae79
Add docling-based file loaders
ppinchuk May 3, 2026
5a6333a
Add `to_md_kwargs`
ppinchuk May 3, 2026
5754309
Include docling in logs
ppinchuk May 3, 2026
ba60587
Minor updates
ppinchuk May 3, 2026
b696ec0
Use `COMPASSWebFileLoader`
ppinchuk May 3, 2026
dca0e2b
Remove bad func
ppinchuk May 3, 2026
d1caa4b
Change to FIleLoader
ppinchuk May 3, 2026
dc995b7
Add tests for env
ppinchuk May 3, 2026
585363e
Fix tests
ppinchuk May 3, 2026
bbe8a62
Fix test
ppinchuk May 3, 2026
adb1141
Fix for clarity
ppinchuk May 3, 2026
f3bd60f
MInor fix
ppinchuk May 3, 2026
69b0296
Pull from env var
ppinchuk May 3, 2026
2f0a9a8
Extra logging
ppinchuk May 3, 2026
c205302
Write files with correct extension
ppinchuk May 3, 2026
d776061
Minor cleanup
ppinchuk May 3, 2026
7d3dab9
Use multiprocessing queue
ppinchuk May 3, 2026
c8538e2
Move subprocessing logging messages to `main.log`
ppinchuk May 3, 2026
59c9f52
Fix tests
ppinchuk May 3, 2026
e25c6aa
Bump elm dep
ppinchuk May 3, 2026
b4ddf06
Default to elm backend for now
ppinchuk May 3, 2026
9876c54
FIx docs
ppinchuk May 3, 2026
6a12b7c
Bug fix
ppinchuk May 3, 2026
71fa565
Delay import
ppinchuk May 3, 2026
0bd1920
Revert change
ppinchuk May 3, 2026
0690789
Update env
ppinchuk May 3, 2026
812b9ec
MInor change
ppinchuk May 4, 2026
9f108e7
Try fix rust env
ppinchuk May 4, 2026
326e801
Cleanup deps sightly
ppinchuk May 4, 2026
1b56561
No frozen in CI
ppinchuk May 4, 2026
0c61811
Add mac intel to tests
ppinchuk May 4, 2026
b0ff5af
Adjust test
ppinchuk May 4, 2026
c0feea1
update tox tests
ppinchuk May 4, 2026
0158075
Update deps
ppinchuk May 4, 2026
d2eca41
Update openai dep
ppinchuk May 5, 2026
bc1e628
No docling on Python 3.13 MacOS intel
ppinchuk May 5, 2026
cee84a0
Try fix
ppinchuk May 5, 2026
fb61868
Break out pixi toml
ppinchuk May 5, 2026
3691459
Fix build
ppinchuk May 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions .github/workflows/ci-python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ jobs:
fail-fast: false
max-parallel: 8
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
os: [ubuntu-latest, macos-latest, macos-26-intel, windows-latest]

steps:
- name: Checkout Repo
Expand Down Expand Up @@ -101,7 +101,7 @@ jobs:
fail-fast: false
max-parallel: 8
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
os: [ubuntu-latest, macos-latest, macos-26-intel, windows-latest]

steps:
- name: Checkout Repo
Expand Down Expand Up @@ -130,8 +130,12 @@ jobs:
fail-fast: false
max-parallel: 8
matrix:
os: [ubuntu-latest, macos-latest]
os: [ubuntu-latest, macos-latest, macos-26-intel]
python-version: ['3.13', '3.12']
exclude:
# https://docling-project.github.io/docling/getting_started/installation/#:~:text=When%20installing%20Docling,using%20Python%203.13%2B.
- os: macos-26-intel
python-version: '3.13'

steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
Expand Down
8 changes: 1 addition & 7 deletions .github/workflows/ci-rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ jobs:
- uses: prefix-dev/setup-pixi@1b2de7f3351f171c8b4dfeb558c639cb58ed4ec0 # v0.9.5
with:
pixi-version: v0.62.2
frozen: true
cache: true
cache-write: ${{ github.ref == 'refs/heads/main' }}
environments: rdev
Expand All @@ -62,7 +61,6 @@ jobs:
- uses: prefix-dev/setup-pixi@1b2de7f3351f171c8b4dfeb558c639cb58ed4ec0 # v0.9.5
with:
pixi-version: v0.62.2
frozen: true
cache: true
cache-write: ${{ github.ref == 'refs/heads/main' }}
environments: rdev
Expand All @@ -89,15 +87,14 @@ jobs:
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
os: [ubuntu-latest, macos-latest, macos-26-intel, windows-latest]

steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

- uses: prefix-dev/setup-pixi@1b2de7f3351f171c8b4dfeb558c639cb58ed4ec0 # v0.9.5
with:
pixi-version: v0.62.2
frozen: true
cache: true
cache-write: ${{ github.ref == 'refs/heads/main' }}
environments: rdev
Expand Down Expand Up @@ -130,7 +127,6 @@ jobs:
- uses: prefix-dev/setup-pixi@1b2de7f3351f171c8b4dfeb558c639cb58ed4ec0 # v0.9.5
with:
pixi-version: v0.62.2
frozen: true
cache: true
cache-write: ${{ github.ref == 'refs/heads/main' }}
environments: rdev
Expand Down Expand Up @@ -161,7 +157,6 @@ jobs:
- uses: prefix-dev/setup-pixi@1b2de7f3351f171c8b4dfeb558c639cb58ed4ec0 # v0.9.5
with:
pixi-version: v0.62.2
frozen: true
cache: true
cache-write: ${{ github.ref == 'refs/heads/main' }}
environments: rdev
Expand Down Expand Up @@ -190,7 +185,6 @@ jobs:
- uses: prefix-dev/setup-pixi@1b2de7f3351f171c8b4dfeb558c639cb58ed4ec0 # v0.9.5
with:
pixi-version: v0.62.2
frozen: true
cache: true
cache-write: ${{ github.ref == 'refs/heads/main' }}
environments: rdev
Expand Down
2 changes: 1 addition & 1 deletion compass/_cli/process.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def _setup_cli_logging(console, verbosity_level, log_level="INFO"):
if verbosity_level >= 1:
libs.append("compass")
if verbosity_level >= 2: # noqa: PLR2004
libs.append("elm")
libs.extend(("elm", "docling"))
if verbosity_level >= 3: # noqa: PLR2004
libs.append("openai")
if verbosity_level >= 4: # noqa: PLR2004
Expand Down
153 changes: 84 additions & 69 deletions compass/scripts/download.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,17 @@
"""Ordinance file downloading logic"""

import pprint
import logging
from contextlib import AsyncExitStack

from elm.web.document import PDFDocument
from elm.web.search.run import (
load_docs,
search_with_fallback,
web_search_links_as_docs,
)
from elm.web.search.run import load_docs, search_with_fallback
from elm.web.website_crawl import (
_SCORE_KEY, # noqa: PLC2701
ELMWebsiteCrawler,
ELMLinkScorer,
)
from elm.web.file_loader import AsyncLocalFileLoader
from elm.web.utilities import filter_documents

from compass.extraction import check_for_relevant_text, extract_date
Expand All @@ -23,9 +21,9 @@
JurisdictionValidator,
JurisdictionWebsiteValidator,
)
from compass.web.file_loader import COMPASSWebFileLoader
from compass.web.website_crawl import COMPASSCrawler, COMPASSLinkScorer
from compass.utilities.enums import LLMTasks
from compass.utilities.io import load_local_docs
from compass.pb import COMPASS_PB


Expand Down Expand Up @@ -74,13 +72,19 @@ async def download_known_urls(

file_loader_kwargs = file_loader_kwargs or {}
file_loader_kwargs.update({"file_cache_coroutine": TempFileCachePB.call})
logger.trace(
"kwargs for COMPASSWebFileLoader:\n%s",
pprint.PrettyPrinter().pformat(file_loader_kwargs),
)
file_loader = COMPASSWebFileLoader(
browser_semaphore=browser_semaphore, **file_loader_kwargs
)

async with COMPASS_PB.file_download_prog_bar(
jurisdiction.full_name, len(urls)
):
try:
out_docs = await load_docs(
urls, browser_semaphore=browser_semaphore, **file_loader_kwargs
)
out_docs = await load_docs(urls, file_loader)
except KeyboardInterrupt:
raise
except Exception as e:
Expand Down Expand Up @@ -130,11 +134,16 @@ async def load_known_docs(jurisdiction, fps, local_file_loader_kwargs=None):
local_file_loader_kwargs.update(
{"file_cache_coroutine": TempFileCachePB.call}
)
logger.trace(
"kwargs for AsyncLocalFileLoader:\n%s",
pprint.PrettyPrinter().pformat(local_file_loader_kwargs),
)
fl = AsyncLocalFileLoader(**local_file_loader_kwargs)
async with COMPASS_PB.file_download_prog_bar(
jurisdiction.full_name, len(fps)
):
try:
out_docs = await load_local_docs(fps, **local_file_loader_kwargs)
out_docs = await load_docs(fps, fl)
except KeyboardInterrupt:
raise
except Exception as e:
Expand Down Expand Up @@ -216,7 +225,7 @@ async def find_jurisdiction_website(
queries=[query_1, query_2],
num_urls=3,
ignore_url_parts=url_ignore_substrings,
browser_sem=search_semaphore,
browser_semaphore=search_semaphore,
task_name=jurisdiction.full_name,
**kwargs,
)
Expand Down Expand Up @@ -332,17 +341,23 @@ async def _crawl_hook(*__, **___): # noqa: RUF029
"""Update progress bar as pages are searched"""
COMPASS_PB.update_website_crawl_task(pb_jurisdiction_name, advance=1)

file_loader_kwargs = file_loader_kwargs or {}
file_loader_kwargs.update({"file_cache_coroutine": TempFileCache.call})
flk = {"verify_ssl": False}
flk.update(file_loader_kwargs or {})
flk.update({"file_cache_coroutine": TempFileCache.call})

browser_config_kwargs = browser_config_kwargs or {}
pw_launch_kwargs = file_loader_kwargs.get("pw_launch_kwargs", {})
pw_launch_kwargs = flk.get("pw_launch_kwargs", {})
browser_config_kwargs["headless"] = pw_launch_kwargs.get("headless", True)

logger.trace(
"kwargs for COMPASSWebFileLoader:\n%s",
pprint.PrettyPrinter().pformat(flk),
)
afl = COMPASSWebFileLoader(**flk)
crawler = ELMWebsiteCrawler(
validator=_doc_heuristic,
async_file_loader=afl,
url_scorer=ELMLinkScorer(keyword_points).score,
file_loader_kwargs=file_loader_kwargs,
browser_config_kwargs=browser_config_kwargs,
crawler_config_kwargs=crawler_config_kwargs,
include_external=True,
Expand Down Expand Up @@ -545,41 +560,29 @@ async def download_jurisdiction_ordinance_using_search_engine(
jurisdiction.full_name, description="Searching web..."
)

pb_store = []

async def _download_hook(urls): # noqa: RUF029
"""Update progress bar as file download starts"""
if not urls:
return

COMPASS_PB.update_jurisdiction_task(
jurisdiction.full_name, description="Downloading files..."
)
pb, task = COMPASS_PB.start_file_download_prog_bar(
jurisdiction.full_name, len(urls)
)
pb_store.append((pb, task, len(urls)))

kwargs.update(file_loader_kwargs or {})
kwargs.update({"file_cache_coroutine": TempFileCachePB.call})
try:
out_docs = await _docs_from_web_search(
query_templates=query_templates,
jurisdiction=jurisdiction,
docs = await _docs_from_web_search(
query_templates,
num_urls=num_urls,
search_semaphore=search_semaphore,
browser_semaphore=browser_semaphore,
url_ignore_substrings=url_ignore_substrings,
on_search_complete_hook=_download_hook,
ignore_url_parts=url_ignore_substrings,
jurisdiction_full_name=jurisdiction.full_name,
**kwargs,
)
finally:
if pb_store:
pb, task, num_urls = pb_store[0]
await COMPASS_PB.tear_down_file_download_prog_bar(
jurisdiction.full_name, num_urls, pb, task
)
except KeyboardInterrupt:
raise
except Exception as e:
msg = (
"Encountered error of type %r while searching web for docs for %s:"
)
err_type = type(e)
logger.exception(msg, err_type, jurisdiction.full_name)
docs = []

return out_docs
return docs


async def filter_ordinance_docs(
Expand Down Expand Up @@ -701,43 +704,55 @@ async def filter_ordinance_docs(

async def _docs_from_web_search(
query_templates,
jurisdiction,
num_urls,
search_semaphore,
browser_semaphore,
url_ignore_substrings,
on_search_complete_hook,
ignore_url_parts,
jurisdiction_full_name,
**kwargs,
):
"""Download documents from the web using jurisdiction queries"""
"""Retrieve top ``N`` search results as document instances"""

queries = [
query.format(jurisdiction=jurisdiction.full_name)
query.format(jurisdiction=jurisdiction_full_name)
for query in query_templates
]
kwargs.update({"file_cache_coroutine": TempFileCachePB.call})
urls = await search_with_fallback(
queries,
num_urls=num_urls,
ignore_url_parts=ignore_url_parts,
browser_semaphore=search_semaphore,
Comment thread
ppinchuk marked this conversation as resolved.
task_name=jurisdiction_full_name,
**kwargs,
)
if not urls:
return []

try:
docs = await web_search_links_as_docs(
queries,
num_urls=num_urls,
search_semaphore=search_semaphore,
browser_semaphore=browser_semaphore,
ignore_url_parts=url_ignore_substrings,
task_name=jurisdiction.full_name,
on_search_complete_hook=on_search_complete_hook,
**kwargs,
)
except KeyboardInterrupt:
raise
except Exception as e:
msg = (
"Encountered error of type %r while searching web for docs for %s:"
)
err_type = type(e)
logger.exception(msg, err_type, jurisdiction.full_name)
docs = []
return await _docs_from_urls(
urls, jurisdiction_full_name, browser_semaphore, **kwargs
)

return docs

async def _docs_from_urls(
urls, jurisdiction_full_name, browser_semaphore, **kwargs
):
"""Load documents from a list of URLs using AsyncWebFileLoader"""
logger.debug("Downloading documents for URLS: \n\t-%s", "\n\t-".join(urls))
logger.trace(
"kwargs for COMPASSWebFileLoader:\n%s",
pprint.PrettyPrinter().pformat(kwargs),
)
file_loader = COMPASSWebFileLoader(
browser_semaphore=browser_semaphore, **kwargs
)

COMPASS_PB.update_jurisdiction_task(
jurisdiction_full_name, description="Downloading files..."
)
async with COMPASS_PB.file_download_prog_bar(
jurisdiction_full_name, len(urls)
):
return await load_docs(urls, file_loader)


async def _down_select_docs_correct_jurisdiction(
Expand Down
4 changes: 2 additions & 2 deletions compass/scripts/process.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
from compass.validation.location import JurisdictionWebsiteValidator
from compass.llm import OpenAIConfig
from compass.services.cpu import (
PDFLoader,
FileLoader,
OCRPDFLoader,
read_pdf_doc,
read_pdf_doc_ocr,
Expand Down Expand Up @@ -598,7 +598,7 @@ def _base_services(self):
self.dirs.out / "jurisdictions.json",
tpe_kwargs=self.tpe_kwargs,
),
PDFLoader(**(self.process_kwargs.ppe_kwargs or {})),
FileLoader(**(self.process_kwargs.ppe_kwargs or {})),
HTMLFileLoader(**self.tpe_kwargs),
]

Expand Down
Loading
Loading