Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a schema validator / checker #129

Open
greglucas opened this issue Feb 5, 2025 · 11 comments
Open

Add a schema validator / checker #129

greglucas opened this issue Feb 5, 2025 · 11 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@greglucas
Copy link
Collaborator

It would be helpful to add a schema validation step or point to advice for how to do this for new users just setting up their initial XTCE documents. There is an official xsd document: https://www.omg.org/spec/XTCE/20180204/SpaceSystem.xsd
This would also catch incorrect implementations on our end, like with the float naming identified in #128

A quick search shows there are some packages that do this, although I have no idea how robust/useful they would be: https://xmlschema.readthedocs.io/en/latest/usage.html

Some questions to think about:

  • Do we want to add a dependency for schema validation, or just point to an external tool to do this. Could be a package extra too and not on by default.
  • Do we want to commit the official xsd to the repository for checking? I have had that site go down on me several times now...
@cgobat
Copy link
Collaborator

cgobat commented Feb 6, 2025

I agree that this would be a useful addition. I know that lxml has some XSD-based validation capabilities already. Since lxml is already a dependency, maybe we can look at whether it would suffice?

@greglucas
Copy link
Collaborator Author

That looks great if we can leverage what we already have as a dependency!

@cgobat
Copy link
Collaborator

cgobat commented Feb 7, 2025

Does it make the most sense to automatically try to validate against the schema upon instantiation of the XtcePacketDefinition class?

@greglucas
Copy link
Collaborator Author

Does it make the most sense to automatically try to validate against the schema upon instantiation of the XtcePacketDefinition class?

Probably if it is a quick check, we already have the built-in dependency, and it doesn't raise on valid but complex schemas. It would be good to make sure wherever we add it that we add useful information saying why the schema match failed.

@medley56
Copy link
Member

Here is a little example of doing schema validation with lxml. TIL that author is not a valid field in the XTCE Header and that validationStatus is a required field...

from lxml import etree
import requests

defdoc = "tests/test_data/jpss/jpss1_geolocation_xtce_v1.xml"

xsi_uri = "http://www.w3.org/2001/XMLSchema-instance"

tree = etree.parse(defdoc)

root = tree.getroot()
print(root.attrib)

# This attribute must always be a pair of strings separated by a space: e.g.
# xsi:schemaLocation="http://www.omg.org/spec/XTCE/20180204 https://www.omg.org/spec/XTCE/20180204/SpaceSystem.xsd"
xsd, xsd_url = root.attrib[f"{{{xsi_uri}}}schemaLocation"].split()

xsd_doc = requests.get(xsd_url)
#print(xsd_doc.content)

xsd_tree = etree.XML(xsd_doc.content)

xmlschema = etree.XMLSchema(xsd_tree)

result = xmlschema.validate(tree)
print(result)

The results of this validation are absolutely useless for debugging (True/False) so I used this tool to fix things:
https://www.freeformatter.com/xml-validator-xsd.html

@medley56 medley56 added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Feb 11, 2025
@cgobat
Copy link
Collaborator

cgobat commented Feb 12, 2025

Indeed.

BTW, I started exploring this—I have a branch called feature/xml-schema-validation if you want to take a look.

I've found that using the assertion-based validation is a lot better/more useful than just the boolean .validate() method.

@medley56
Copy link
Member

@cgobat That's great! Do you want to own this feature or would you rather someone else use that as a starting point? I would like to add automatic retrieval of XTCE XSD via the xsi:schemaLocation attribute (if provided) to this ticket.

I'm almost ready with a PR that handles namespaces much better, which might help with this.

@cgobat
Copy link
Collaborator

cgobat commented Feb 16, 2025

Yeah I'm happy to own the initial implementation at least. That's good to know that you're in favor of automatic retrieval of the specified XSD. I was going to do that but held off since I wasn't sure how we felt about fetching content from online. Are you okay with adding requests as a dependency? There's a good chance it's already a dependency of an existing dependency, but I'm not positive.

@cgobat
Copy link
Collaborator

cgobat commented Feb 16, 2025

Actually, nevermind. I think I can do it with just the standard library urllib if that's preferable.

@medley56
Copy link
Member

Yeah, let's avoid another dependency if we don't need it. If we start doing a lot of HTTP requests for some other reason, then let's consider including requests.

@medley56
Copy link
Member

And as for fetching XML online, it is technically a security vulnerability since parsing XML with lxml is vulnerable to things like XML bombs if you're using an old version of libxml.

We could limit the XSD retrieval to a validation method that is only called if the user explicitly requests it. This has the dual purpose of not inherently slowing down definition instantiation from XTCE, which has been a problem in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants