Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add unit tests #51

Merged
merged 46 commits into from
Aug 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
f5eb49a
add the pytests
PeterStaar-IBM Aug 26, 2024
12eea84
renamed the test folder and added the toplevel test
PeterStaar-IBM Aug 26, 2024
2c66075
updated the toplevel function test
PeterStaar-IBM Aug 26, 2024
b7debe7
need to start running all tests successfully
PeterStaar-IBM Aug 26, 2024
6464033
added the reference converted documents
PeterStaar-IBM Aug 26, 2024
c64489a
added first test for json and md output
PeterStaar-IBM Aug 26, 2024
24c0b9d
ran pre-commit
PeterStaar-IBM Aug 26, 2024
08364df
replaced deprecated json function with model_dump_json
PeterStaar-IBM Aug 26, 2024
35bd7b9
replaced deprecated json function with model_dump_json
PeterStaar-IBM Aug 26, 2024
4980b71
reformatted code
PeterStaar-IBM Aug 27, 2024
e59ea8e
Fix backend tests
cau-git Aug 27, 2024
774704a
Merge branch 'main' of github.com:DS4SD/docling into dev/add-strict-t…
cau-git Aug 27, 2024
b548687
commented out the drawing
PeterStaar-IBM Aug 27, 2024
40d754f
ci: avoid duplicate runs
dolfim-ibm Aug 27, 2024
f517e63
Fix backend tests
cau-git Aug 27, 2024
93bdaf0
Fix backend tests
cau-git Aug 27, 2024
3dbd678
commented out json verification for now
PeterStaar-IBM Aug 27, 2024
0d4fd90
added verification of input cells
PeterStaar-IBM Aug 28, 2024
f853d0a
reformat code
PeterStaar-IBM Aug 28, 2024
e6ed6f4
added test to verify the cells in the pages
PeterStaar-IBM Aug 28, 2024
c6440c8
added test to verify the cells in the pages (2)
PeterStaar-IBM Aug 28, 2024
0f172cc
added test to verify the cells in the pages (3)
PeterStaar-IBM Aug 28, 2024
e1c8d69
Merge branch 'main' into dev/add-strict-tests
PeterStaar-IBM Aug 28, 2024
a39449d
run all examples in CI
dolfim-ibm Aug 28, 2024
2ad85fd
make sure examples return failures
dolfim-ibm Aug 28, 2024
9b6b8c4
raise a failure if examples fail
dolfim-ibm Aug 28, 2024
03fbb51
fix examples
dolfim-ibm Aug 28, 2024
6c9fa58
run examples after tests
dolfim-ibm Aug 28, 2024
2d63143
Add tests and update top_level_tests using only datamodels
cau-git Aug 28, 2024
c09d2bc
Merge branch 'dev/add-strict-tests' of github.com:DS4SD/docling into …
cau-git Aug 28, 2024
2cc940f
Remove unnecessary code
cau-git Aug 28, 2024
52b25bf
Merge branch 'dev/add-strict-tests' of github.com:DS4SD/docling into …
dolfim-ibm Aug 28, 2024
07ec034
Validate conversion status on e2e test
cau-git Aug 28, 2024
e447916
Merge branch 'dev/add-strict-tests' of github.com:DS4SD/docling into …
cau-git Aug 28, 2024
a700411
package verify utils and add more tests
dolfim-ibm Aug 28, 2024
c10f555
reduce docs in example, since they are already in the tests
dolfim-ibm Aug 29, 2024
d7a4476
skip batch_convert
dolfim-ibm Aug 29, 2024
237aa13
Merge branch 'main' of github.com:DS4SD/docling into dev/add-strict-t…
cau-git Aug 29, 2024
85304ca
pin docling-parse 1.1.2
dolfim-ibm Aug 29, 2024
4bd5deb
updated the error messages
PeterStaar-IBM Aug 29, 2024
b14675f
Merge branch 'dev/add-strict-tests' of github.com:DS4SD/docling into …
cau-git Aug 29, 2024
07538c0
commented out the json verification for now
PeterStaar-IBM Aug 30, 2024
c8cfd44
bumped GLM version
PeterStaar-IBM Aug 30, 2024
28aad8f
Merge branch 'dev/add-strict-tests' of github.com:DS4SD/docling into …
cau-git Aug 30, 2024
0bf89c7
Fix lockfile
cau-git Aug 30, 2024
408a158
Pin new docling-parse v1.1.3
cau-git Aug 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,22 @@ jobs:
python-version: ${{ matrix.python-version }}
- name: Run styling check
run: poetry run pre-commit run --all-files
- name: Install with poetry
run: poetry install --all-extras
- name: Testing
run: |
poetry run pytest -v tests
- name: Run examples
run: |
for file in examples/*.py; do
# Skip batch_convert.py
if [[ "$(basename "$file")" == "batch_convert.py" ]]; then
echo "Skipping $file"
continue
fi

echo "Running example $file"
poetry run python "$file" || exit 1
done
- name: Build with poetry
run: poetry build
4 changes: 2 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: "Run CI"

on:
pull_request:
types: [opened, reopened, synchronize, ready_for_review]
types: [opened, reopened]
push:
branches:
- "**"
Expand All @@ -25,4 +25,4 @@ jobs:
# - uses: ./.github/actions/setup-poetry
# - name: Build docs
# run: poetry run mkdocs build --verbose --clean


4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ repos:
hooks:
- id: system
name: Black
entry: poetry run black docling examples
entry: poetry run black docling examples tests
pass_filenames: false
language: system
files: '\.py$'
- repo: local
hooks:
- id: system
name: isort
entry: poetry run isort docling examples
entry: poetry run isort docling examples tests
pass_filenames: false
language: system
files: '\.py$'
Expand Down
6 changes: 3 additions & 3 deletions docling/datamodel/base_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -238,9 +238,9 @@ class EquationPrediction(BaseModel):

class PagePredictions(BaseModel):
layout: LayoutPrediction = None
tablestructure: TableStructurePrediction = None
figures_classification: FigureClassificationPrediction = None
equations_prediction: EquationPrediction = None
tablestructure: Optional[TableStructurePrediction] = None
figures_classification: Optional[FigureClassificationPrediction] = None
equations_prediction: Optional[EquationPrediction] = None


PageElement = Union[TextElement, TableElement, FigureElement]
Expand Down
6 changes: 5 additions & 1 deletion docling/models/ds_glm_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,12 @@
class GlmModel:
def __init__(self, config):
self.config = config
self.model_names = self.config.get(
"model_names", ""
) # "language;term;reference"
load_pretrained_nlp_models()
model = init_nlp_model(model_names="language;term;reference")
# model = init_nlp_model(model_names="language;term;reference")
model = init_nlp_model(model_names=self.model_names)
self.model = model

def __call__(self, conv_res: ConversionResult) -> DsDocument:
Expand Down
11 changes: 10 additions & 1 deletion docling/models/table_structure_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,16 @@ def draw_table_and_cells(self, page: Page, tbl_list: List[TableElement]):

for tc in table_element.table_cells:
x0, y0, x1, y1 = tc.bbox.as_tuple()
draw.rectangle([(x0, y0), (x1, y1)], outline="blue")
if tc.column_header:
width = 3
else:
width = 1
draw.rectangle([(x0, y0), (x1, y1)], outline="blue", width=width)
draw.text(
(x0 + 3, y0 + 3),
text=f"{tc.start_row_offset_idx}, {tc.start_col_offset_idx}",
fill="black",
)

image.show()

Expand Down
20 changes: 14 additions & 6 deletions examples/batch_convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,17 +49,18 @@ def export_documents(
f"of which {failure_count} failed "
f"and {partial_success_count} were partially converted."
)
return success_count, partial_success_count, failure_count


def main():
logging.basicConfig(level=logging.INFO)

input_doc_paths = [
Path("./test/data/2206.01062.pdf"),
Path("./test/data/2203.01017v2.pdf"),
Path("./test/data/2305.03393v1.pdf"),
Path("./test/data/redp5110.pdf"),
Path("./test/data/redp5695.pdf"),
Path("./tests/data/2206.01062.pdf"),
Path("./tests/data/2203.01017v2.pdf"),
Path("./tests/data/2305.03393v1.pdf"),
Path("./tests/data/redp5110.pdf"),
Path("./tests/data/redp5695.pdf"),
]

# buf = BytesIO(Path("./test/data/2206.01062.pdf").open("rb").read())
Expand All @@ -73,12 +74,19 @@ def main():
start_time = time.time()

conv_results = doc_converter.convert(input)
export_documents(conv_results, output_dir=Path("./scratch"))
success_count, partial_success_count, failure_count = export_documents(
conv_results, output_dir=Path("./scratch")
)

end_time = time.time() - start_time

_log.info(f"All documents were converted in {end_time:.2f} seconds.")

if failure_count > 0:
raise RuntimeError(
f"The example failed converting {failure_count} on {len(input_doc_paths)}."
)


if __name__ == "__main__":
main()
15 changes: 11 additions & 4 deletions examples/custom_convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,14 +42,14 @@ def export_documents(
f"Processed {success_count + failure_count} docs, of which {failure_count} failed"
)

return success_count, failure_count


def main():
logging.basicConfig(level=logging.INFO)

input_doc_paths = [
Path("./test/data/2206.01062.pdf"),
Path("./test/data/2203.01017v2.pdf"),
Path("./test/data/2305.03393v1.pdf"),
Path("./tests/data/2206.01062.pdf"),
]

###########################################################################
Expand Down Expand Up @@ -114,12 +114,19 @@ def main():
start_time = time.time()

conv_results = doc_converter.convert(input)
export_documents(conv_results, output_dir=Path("./scratch"))
success_count, failure_count = export_documents(
conv_results, output_dir=Path("./scratch")
)

end_time = time.time() - start_time

_log.info(f"All documents were converted in {end_time:.2f} seconds.")

if failure_count > 0:
raise RuntimeError(
f"The example failed converting {failure_count} on {len(input_doc_paths)}."
)


if __name__ == "__main__":
main()
12 changes: 11 additions & 1 deletion examples/export_figures.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ def main():
logging.basicConfig(level=logging.INFO)

input_doc_paths = [
Path("./test/data/2206.01062.pdf"),
Path("./tests/data/2206.01062.pdf"),
]
output_dir = Path("./scratch")

Expand All @@ -41,10 +41,13 @@ def main():

conv_results = doc_converter.convert(input_files)

success_count = 0
failure_count = 0
output_dir.mkdir(parents=True, exist_ok=True)
for conv_res in conv_results:
if conv_res.status != ConversionStatus.SUCCESS:
_log.info(f"Document {conv_res.input.file} failed to convert.")
failure_count += 1
continue

doc_filename = conv_res.input.file.stem
Expand All @@ -66,10 +69,17 @@ def main():
with element_image_filename.open("wb") as fp:
image.save(fp, "PNG")

success_count += 1

end_time = time.time() - start_time

_log.info(f"All documents were converted in {end_time:.2f} seconds.")

if failure_count > 0:
raise RuntimeError(
f"The example failed converting {failure_count} on {len(input_doc_paths)}."
)


if __name__ == "__main__":
main()
Loading
Loading