MD-SOAR Monthly Statistics Pipeline

This repository contains a Python-based pipeline for generating monthly usage statistics for MD-SOAR. It processes raw analytics reports, enriches them with repository metadata, and produces structured outputs for reporting.

Overview

The pipeline automates the end-to-end workflow for monthly statistics generation:

Preprocess raw analytics data
Collect unique repository handles from Solr
Generate page view statistics
Generate download statistics

All steps are orchestrated through a single entry script.

Project Structure

mdsoar_stats_pipeline/
│
├── RunMonthlyStatsPipeline.py   # Main pipeline orchestrator
├── Preprocessing.py             # Cleans raw analytics CSV
├── CollectUniqueHandles.py      # Retrieves handles from Solr
├── PageViewStats.py             # Computes page view statistics
├── DownloadStats.py             # Handles download stats (placeholder/partial)
├── campusuuidlist.csv           # Campus UUID → name mapping
├── requirements.txt             # Python dependencies
└── .gitignore

Requirements

Python 3.8+
Access to Solr instance (for handle collection)
Required Python packages:

pip install -r requirements.txt

Usage

Apply the port-forwarding for Solr:

kubectl port-forward mdsoar-solr-0 8983:8983

In a separate tab run the full monthly pipeline:

python3 RunMonthlyStatsPipeline.py <input_csv> <campus_csv> <start_date> <end_date> --label <LABEL>

Example

(report_February_2026.csv is the Google Analytics generated csv file.)

python3 RunMonthlyStatsPipeline.py \
  report_February_2026.csv \
  campusuuidlist.csv \
  "2026-02-01T00:00:00Z" \
  "2026-02-28T23:59:59Z" \
  --label "February_2026"

Pipeline Steps

1. Preprocessing

Removes unnecessary header lines from raw analytics reports
Outputs a clean CSV for downstream processing

2. Collect Unique Handles

Queries Solr to retrieve all repository item handles
Deduplicates results

3. Page View Statistics

Matches analytics data with repository items
Aggregates views by campus and item

4. Download Statistics

Placeholder for download metrics (currently minimal implementation)

Output

The pipeline generates output under:

Monthy_Stats_Report/
├── PageView_Statistics/
│   └── <LABEL>_PageView.csv
└── Download_Statistics/

Key Features

Modular pipeline design which can be used inside a crobjob in the future
Handles large datasets
Integrates analytics data with repository metadata
Extensible for additional metrics

Notes

Ensure the input CSV format matches expected analytics export structure
The script assumes a working Solr endpoint for handle retrieval

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
mdsoar_stats_pipeline		mdsoar_stats_pipeline
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MD-SOAR Monthly Statistics Pipeline

Overview

Project Structure

Requirements

Usage

Example

Pipeline Steps

1. Preprocessing

2. Collect Unique Handles

3. Page View Statistics

4. Download Statistics

Output

Key Features

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MD-SOAR Monthly Statistics Pipeline

Overview

Project Structure

Requirements

Usage

Example

Pipeline Steps

1. Preprocessing

2. Collect Unique Handles

3. Page View Statistics

4. Download Statistics

Output

Key Features

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages