Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 23 additions & 27 deletions docs/blog/governance-bottleneck.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Code Quality Is a Solved Problem. Governance Isn't.

*We certified 200+ open-source repositories across 23 industries. Here's what the data says.*
*We certified 201 open-source repositories across 23 industries. Here's what the data says.*

---

Expand All @@ -12,67 +12,63 @@ The hypothesis was simple: popular repos have poor code quality, and that's what

## What We Actually Found

Code quality across popular open-source repos is **remarkably consistent**. The median code quality score is 91.2/100. Most repos score A or B. The tools work. The linters work. Developers lint their code.
Code quality across popular open-source repos is **remarkably consistent**. Most repos score above 85 on code quality — solidly in the A and B range. The tools work. The linters work. Developers lint their code.

What varies wildly — and what determines whether an enterprise should trust a dependency — is **governance**.

| Dimension | Median Score | Variance |
|-----------|-------------|----------|
| Code Quality | 91.2 | Low (σ = 10.3) |
| Governance | 65.0 | **High** (σ = 18.7) |
| Dependencies | 100.0 | Low |

Governance scores range from 20 (Pygame) to 100 (MONAI). Code quality scores cluster between 85 and 98. **Governance is where the signal is.**
Governance scores range from 20 (Pygame) to 100 (MONAI). Code quality clusters between 85 and 98. **Governance is the actual risk surface.**

## The Evidence

### LangChain: 95.4 code, 45 governance

The most popular LLM framework in the world. Used by thousands of enterprises. Scores 95.4 on code quality — excellent by any measure. But only 45/100 on governance:

- No Code of Conduct
- No CONTRIBUTING.md
- No SECURITY.md
- No DCO/CLA process
- No issue/PR templates
- No Code of Conduct
- No DCO ([Developer Certificate of Origin](https://developercertificate.org/)) or CLA process

Arbiter certification: **PROVISIONAL**. Not because the code is bad. Because the governance infrastructure doesn't exist.

### MONAI: The Gold Standard at 98.2

NVIDIA's healthcare AI toolkit scores 98.2 — the highest of all 201 repos. Perfect governance: 100/100. LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, issue templates, PR templates, CI/CD, DCO. Every box checked.

This is what enterprises should require. And almost nobody does.
This is what enterprises should require. Almost nobody does.

### Sentry: 98.5 code, FAILED
### Sentry: Near-Perfect Code, Still PROVISIONAL

Sentry has the **best code quality** of any repo we tested: 98.5/100. But it **fails** certification because of 109 unpinned dependencies. The attack surface isn't the code — it's the supply chain.
Sentry scores 98.5 on code quality — among the highest we tested. But it lands at **PROVISIONAL** because of 109 unpinned dependencies. Arbiter's dependency scoring penalizes unversioned dependency declarations because they make builds unreproducible and expand the supply chain attack surface. The risk isn't the code — it's the supply chain.

### Flask and Click: Foundational, PROVISIONAL

Two of the most fundamental Python libraries. Flask powers millions of web apps. Click powers thousands of CLIs. Both score PROVISIONAL due to 45/100 governance. No Code of Conduct. No security policy. No DCO.

If your enterprise depends on Flask, you're building on a library that doesn't have a documented vulnerability disclosure process.
If your enterprise depends on Flask, you're trusting a library with no documented security response process.

## The Pattern Across 23 Categories

| Category | Repos | Certification Rate |
|----------|-------|-------------------|
| Developer Tools | 7 | 86% CERTIFIED |
| Fintech | 5 | 60% CERTIFIED |
| ML Platforms | 6 | 50% CERTIFIED |
| Healthcare | 4 | 75% CERTIFIED |
| Web Frameworks | 6 | 67% CERTIFIED |
| LLM Frameworks | 5 | 40% CERTIFIED |
| Fintech | 5 | 60% CERTIFIED |
| Databases/ORMs | 5 | 60% CERTIFIED |
| Testing | 4 | 50% CERTIFIED |
| Networking | 5 | 60% CERTIFIED |
| ML Platforms | 6 | 50% CERTIFIED |
| Testing | 4 | 50% CERTIFIED |
| LLM Frameworks | 5 | 40% CERTIFIED |
| Gaming | 5 | 20% CERTIFIED |
| Cybersecurity | 4 | 0% CERTIFIED |
| Cybersecurity Tools | 4 | 0% CERTIFIED |

**Developer tools lead** (pytest, pip, tox — the people who build tools for quality also practice quality). **Gaming and cybersecurity lag** speed over process.
**Developer tools lead** pytest, pip, tox. The people who build tools for quality also practice quality. **Cybersecurity tools** (pwntools, nmap, sqlmap, routersploit) all landed PROVISIONAL — strong code, weak governance artifacts. With only 4 repos in the sample, this likely reflects the tooling culture's bias toward speed over process rather than an industry-wide gap.

## What This Means for Enterprise AI Adoption

These aren't academic distinctions. When AI tooling fails in production, the investigation often traces back to governance gaps, not code bugs.

If you're evaluating open-source AI tools for enterprise use, stop asking "is the code good?" It almost certainly is. Start asking:

1. **Is there a SECURITY.md?** Can you report vulnerabilities privately?
Expand All @@ -87,13 +83,13 @@ These aren't nice-to-haves. They're the difference between a dependency you can

Arbiter scores three dimensions:

- **Code Quality** (50%): ruff, bandit, radon, vulture, shellcheck across Python and Shell
- **Governance Maturity** (30%): 10 checks for LICENSE, CONTRIBUTING, SECURITY, CoC, CI/CD, templates, DCO
- **Dependency Health** (20%): pinning, count, known-good packages
- **Code Quality** (50%): lint, security, complexity via ruff, bandit, radon, vulture, shellcheck across Python and Shell
- **Governance Maturity** (30%): 10 checks for LICENSE, CONTRIBUTING, SECURITY, Code of Conduct, CI/CD, issue/PR templates, DCO
- **Dependency Health** (20%): version pinning, dependency count, known-good packages

**Certification thresholds**: CERTIFIED ≥ 80 overall, PROVISIONAL ≥ 60, FAILED < 60. Deterministic — same repo always gets the same score.
**Certification thresholds**: CERTIFIED ≥ 80 overall, PROVISIONAL ≥ 60, FAILED < 60. Deterministic — same repo always gets the same score. No AI in the scoring path.

When code quality is unscorable (non-Python repo without installed analyzers), Arbiter reweights to Governance 60% + Dependencies 40% rather than penalizing.
When code quality is unscorable (e.g., a Go or Rust repo without installed analyzers), Arbiter reweights to Governance 60% + Dependencies 40% rather than penalizing.

## Try It Yourself

Expand Down
Loading