End-to-end data pipeline and market analysis built on scraped listings from Hispasonic, the largest Spanish-language music technology community. The project covers the full cycle from raw CSV files to supply/demand modelling across Spanish cities.
hispasonic/
├── data/
│ ├── raw/ # 12 source CSVs (Aug 2022 – May 2024)
│ └── processed/ # Unified dataset (gitignored)
├── notebooks/
│ ├── 01_etl.ipynb # ETL — load, normalise, clean and unify all CSVs
│ ├── 02_eda.ipynb # EDA — temporal, price, brand, city, engagement
│ └── 03_supply_demand.ipynb # Supply & demand analysis by city and price
├── reports/
│ ├── figures/ # 27 auto-generated charts
│ └── README.md # Chart gallery
├── jupyterlab_new.yaml # Conda environment (Python 3.9)
├── requirements.txt # Legacy pip requirements (Python 3.8)
└── README.md
git clone https://github.com/albertjimrod/hispasonic.git
cd hispasonic
# Create and activate the conda environment
conda env create -f jupyterlab_new.yaml
conda activate jupyterlab_new
# Run notebooks in order
jupyter lab notebooks/01_etl.ipynb
jupyter lab notebooks/02_eda.ipynb
jupyter lab notebooks/03_supply_demand.ipynbNote:
data/processed/is gitignored. Run01_etl.ipynbfirst to generatehispasonic_unified.csvbefore running the other notebooks.
The project runs inside the jupyterlab_new conda environment (Python 3.9). All dependencies are pinned in jupyterlab_new.yaml.
| Category | Package | Version |
|---|---|---|
| Data | pandas |
2.2.3 |
| Data | numpy |
2.0.1 |
| Data | pyarrow |
21.0.0 |
| Visualisation | matplotlib |
3.9.2 |
| Visualisation | seaborn |
0.13.2 |
| Visualisation | altair |
5.5.0 |
| Statistics / ML | scipy |
1.13.1 |
| Statistics / ML | scikit-learn |
1.6.1 |
| Statistics / ML | xgboost |
2.1.1 |
| Statistics / ML | shap |
0.49.1 |
| Statistics / ML | numba |
0.60.0 |
| Notebooks | jupyterlab |
4.4.4 |
| Apps | streamlit |
1.50.0 |
A legacy
requirements.txt(Python 3.8 environment) is kept for reference but is not the recommended setup.
Load 12 heterogeneous CSV files scraped from Hispasonic across different dates, detect and resolve structural inconsistencies, and consolidate everything into a single clean dataset.
- 12 CSV files in
data/raw/, spanning August 2022 to May 2024 - File naming conventions vary:
2022_08_26.csv,hpw2024517.csv, etc. - Total raw rows: 5,962 listings
Each CSV is loaded with zero rows and its column list compared against a defined CORE_COLUMNS schema of 16 fields. The inspection reveals three types of extra columns present in different files:
| Extra column | Files affected | Action |
|---|---|---|
user |
2022_08_26, 2022_08_30, 2022_10_01 onwards | Dropped (not in core schema) |
anon_user |
2022_09_01 onwards | Dropped |
Unnamed: 0 |
hpw2024517, hpw2024526 | Dropped (auto-generated index) |
All 12 files contain all 16 core columns — no column is missing from any file.
Each CSV is loaded in full, non-core columns are dropped, and a source_file column is added for traceability. All 12 frames are then concatenated into a single unified DataFrame.
2022_08_26.csv → 764 rows │ 2023_03_01.csv → 270 rows
2022_08_30.csv → 760 rows │ 2023_06_04.csv → 261 rows
2022_09_01.csv → 805 rows │ 2023_06_07.csv → 261 rows
2022_10_01.csv → 837 rows │ 2023_08_07.csv → 425 rows
2022_11_01.csv → 784 rows │ hpw2024517.csv → 264 rows
2022_12_31.csv → 264 rows │ hpw2024526.csv → 267 rows
─────────────────────────────────────────────────────────────
Total 5,962 rows × 17 columns
| Column(s) | Original type | Converted to | Reason |
|---|---|---|---|
published, expire, date_scrapped |
object (string) |
datetime64[ns] |
Enable date arithmetic and time-series grouping |
price |
int64 |
float64 (coerce errors) |
Handle edge cases and future NaN compatibility |
seen |
int64 |
Int64 (nullable) |
Preserve NaN capability |
urgent, buy, change, sell, gift, search, repair, parts |
int64 |
Int64 (nullable) |
All are binary 0/1 flags |
After concatenation, only 3 nulls are found, all in the description column. All other columns are complete. The price column contains valid zeros (free/gift listings) which are intentional.
Using a composite key of [description, price, published, city], 2,059 duplicate rows are identified. These are listings that appear in multiple scrape snapshots (the same listing was still active when a later scrape ran). The duplicates are intentionally retained in the dataset because they carry temporal information — a listing appearing in 3 scrapes means it was active for 3 consecutive periods. The supply/demand analysis in notebook 03 relies on per-scrape-date counts, so removing duplicates would distort the temporal signal.
The unified dataset is saved to data/processed/hispasonic_unified.csv.
Final shape: 5,962 rows × 17 columns
Date range: 2022-08-26 → 2024-05-26
Unique brands: 249
Unique cities: 49
Scrape dates: 12
| Column | Type | Description |
|---|---|---|
urgent |
Int64 (0/1) | Listing marked as urgent |
buy |
Int64 (0/1) | Active buy request (demand signal) |
change |
Int64 (0/1) | Swap/trade offer |
sell |
Int64 (0/1) | Item for sale (supply signal) |
price |
float64 | Asking price in euros (0 = free or price on request) |
gift |
Int64 (0/1) | Free item |
search |
Int64 (0/1) | Wanted/search listing (demand signal) |
repair |
Int64 (0/1) | Repair service listing |
parts |
Int64 (0/1) | Spare parts listing |
synt_brand |
object | Extracted brand name (249 unique values; "-" = unidentified) |
description |
object | Free-text listing description |
city |
object | City of the listing (49 unique values) |
published |
datetime | Original listing publication date |
expire |
datetime | Listing expiry date |
date_scrapped |
datetime | Date the listing was captured by the scraper |
seen |
Int64 | Total views accumulated by the listing |
source_file |
object | Source CSV filename (traceability) |
- 2,059 rows (34.5%) are duplicates by composite key — expected and intentional. They reflect listings that were still active when a later scrape ran and carry temporal information required by the supply/demand analysis in notebook 03.
- 3 null values in
descriptiononly; all other 16 columns are 100% complete. - Zeros in
priceare legitimate (free items, gift listings, or "price on request") — do not treat as missing. - The 2022-08-30 session has a broken brand-extraction: all 760 listings are classified as
synt_brand = "-". Confirmed and documented in notebook 02. - Three sessions (2022-12-31, 2023-03-01, 2023-06-04) have price recorded as 0 for all sell listings — a scraping artefact confirmed in notebook 02. These sessions must be excluded from any price-based analysis.
Deep exploratory analysis of the unified dataset. The focus is on temporal evolution, price dynamics, brand and city patterns, engagement metrics, and correlation groups. All 18 figures are saved automatically to reports/figures/.
- Listing volume peaks at ~800 listings/scrape in late 2022, then drops to ~260–430 from early 2023 onwards. A gap with no scraping is visible between Oct 2023 and May 2024. This irregular cadence must be accounted for in any temporal analysis.
- Three scraping sessions (2022-12-31, 2023-03-01, 2023-06-04) have missing price data (all prices recorded as 0 €) — a confirmed data quality artefact, not a real market event. The 2022-08-30 session shows a broken brand extraction where all listings fall into the "-" bucket.
- Sell listings consistently represent ~90–95% of all listing types at every point in time. Buy and Change together never exceed ~7%. The marketplace composition is structurally stable throughout the full period.
- The median sell price is remarkably stable at ~300–350 € across all sessions. The mean (~520–820 €) is systematically inflated by a thin premium segment, confirming the distribution is right-skewed with a long tail.
- Median sell price: ~250 €. IQR: €75–550. Mean: ~637 € (distorted by outliers).
- At the 99th percentile the cap is €5,500. Above that, 47 listings (0.86%) are considered outliers and likely represent high-end vintage instruments.
- The price distribution is strongly right-skewed: most instruments trade below €500, but the tail extends legitimately to several thousand euros.
- The top entry by volume is "-" (~1,530 listings) — listings where no brand could be extracted from the free text. This is a brand-detection limitation, not a real brand.
- Among identifiable brands: Roland (~580), Yamaha (~420), Korg (~390) lead by volume. Eurorack appears in 4th place but represents a format, not a manufacturer — it captures modular listings without a specific maker name.
- Hammond (~1,400 €) leads median sell price by a large margin, followed by Elektron (~620 €) and Moog (~500 €). The brands with the highest listing counts (Roland, Yamaha, Korg) sit in the mid-range at ~300–350 €, reflecting their broad product catalogues.
- Korg shows the highest and most consistent median sell price over time (~300–550 €). Roland is the most price-stable brand (±50 € across the full period). Yamaha is the most volatile.
- The brand × scrape date heatmap reveals the 2022-08-30 anomaly clearly: all named brands show 0 while "-" shows 760 — confirming a data pipeline failure on that date.
- Madrid (~1,450 listings) and Barcelona (~1,040) together account for ~42% of all listings. Girona is a surprising third (~535), which may reflect province-level geographic tagging rather than genuine city activity.
- A counterintuitive inverse relationship exists between listing volume and median price: the highest-volume cities (Madrid ~215 €, Barcelona ~230 €) show the lowest median prices, while low-volume cities (Tarragona ~390 €, Sevilla ~370 €) show the highest. Greater competition in large markets compresses prices; smaller markets see more selective, higher-value listings.
- All top cities follow the same temporal pattern: peak activity late 2022, sharp drop in early 2023, partial recovery. Girona is an outlier — it collapsed to near zero after early 2023 and never recovered, likely a data capture issue rather than a real market exit.
- The
seendistribution is strongly right-skewed with a Pareto-like shape: most listings receive 100–300 views, while a small minority reaches 4,000–6,000+. Engagement is highly concentrated. - Price vs views scatter shows no strong positive correlation. There is a weak moderate relationship (+0.37 per the correlation matrix), but visually the scatter is dominated by the dense low-price cluster. High-priced items attract a narrower, more targeted audience.
- Listings with price = 0 ("price on request") often attract high views — curiosity-driven traffic independent of price.
sell↔buy(−0.68) andsell↔change(−0.69): strong negatives by construction — the listing type flags are mutually exclusive. These reflect the data encoding scheme, not a market phenomenon.price↔seen= +0.37: the most meaningful correlation — higher-priced listings attract somewhat more views, likely because they represent more desirable or aspirational instruments.repair↔parts= +0.37: co-occurrence of repair and parts listings makes intuitive sense.parts↔seen= +0.38: parts listings attract above-average views, as a single spare part can satisfy multiple potential buyers simultaneously.- Most other pairs are near zero, indicating low multicollinearity — a favourable property for downstream modelling.
- The brand × city heatmap confirms Madrid as the dominant market for every brand. Roland (145), Yamaha (122), Korg (112) and Eurorack (83) all peak there. Elektron is noticeably more concentrated in large cities (Madrid 27, Barcelona 22), consistent with its premium boutique positioning.
- Median price by brand and listing type reveals an anomaly: Behringer "Change" listings show a median of ~1,875 €, far above Behringer's typical sell price (~110 €). This suggests trade-up dynamics — sellers offering Behringer equipment as part of a bundle exchange for a single high-value item.
- The brand × time price chart confirms that Korg commands the highest and most consistent median price among the top brands, while Roland is the most price-stable and Yamaha the most volatile.
- Data quality issues are significant — three sessions with zeroed prices and one with broken brand extraction must be excluded or imputed before any modelling.
- Hispasonic is structurally a selling platform — over 90% of listings are sales, consistently across the full period.
- Prices are right-skewed with a stable median of ~250–350 € — robust estimators or log-transformation are recommended for any regression task.
- High volume ≠ high value — the brands with the most listings (Roland, Yamaha, Korg) are not the most expensive. Hammond, Elektron, and Moog lead on price with far fewer listings.
- Larger cities have lower prices — the inverse volume-price relationship across cities reflects competitive dynamics, not geographic pricing differences.
- Engagement follows a Pareto distribution — price is a weak predictor of views (+0.37). Most traffic concentrates on a small fraction of listings.
- Eurorack should be treated as a category, not a brand — it aggregates hundreds of small module manufacturers and its statistics are not comparable to single-brand entries.
- Irregular scraping cadence — volume trends over time are as much a function of scraping frequency as real market dynamics.
| Fig | File | Description |
|---|---|---|
| 01 | 01_listing_volume_over_time.png |
Listing volume per scrape date |
| 02 | 02_price_evolution_over_time.png |
Median and mean sell price over time (with IQR band) |
| 03 | 03_listing_type_share_over_time.png |
Sell / Buy / Change / Search share over time |
| 04 | 04_price_distribution.png |
Price histogram and boxplot (sell listings, ≤p99) |
| 05 | 05_top_brands_by_listings.png |
Top 15 brands by listing count |
| 06 | 06_median_price_per_brand.png |
Median sell price per brand |
| 07 | 07_brand_evolution_over_time.png |
Top 8 brands — listing volume over time |
| 08 | 08_brand_time_heatmap.png |
Brand × scrape date heatmap |
| 09 | 09_top_cities_by_listings.png |
Top 15 cities by listing count |
| 10 | 10_median_price_per_city.png |
Median sell price per city |
| 11 | 11_city_evolution_over_time.png |
Top 6 cities — listing volume over time |
| 12 | 12_seen_distribution.png |
Views distribution |
| 13 | 13_price_vs_seen_scatter.png |
Price vs views scatter |
| 14 | 14_median_seen_per_brand.png |
Median views per brand |
| 15 | 15_correlation_matrix.png |
Correlation matrix — all numeric variables |
| 16 | 16_brand_city_heatmap.png |
Brand × City listing count heatmap |
| 17 | 17_price_by_brand_and_type.png |
Median price by brand and listing type |
| 18 | 18_price_evolution_by_brand.png |
Median sell price over time — top 5 brands |
Determine whether a relationship exists between the supply/demand balance per city and median sell prices. Test whether the D/S ratio at time t can predict prices at time t+1. All 9 figures are saved automatically to reports/figures/.
| Concept | Column(s) | Logic |
|---|---|---|
| Supply | sell == 1 |
Someone is offering an item for sale |
| Demand | buy == 1 or search == 1 |
Someone is actively looking to buy |
| D/S ratio | demand / supply × 100 | Expressed as %. > 100% = seller's market |
| Passive demand | seen |
Total views — proxy for passive interest |
- Cross-sectional: Cities with a higher D/S ratio should show higher median sell prices.
- Temporal: When the D/S ratio rises in a city at scrape t, prices should follow at scrape t+1.
| Metric | Value |
|---|---|
| Total supply listings | 5,480 |
| Total demand listings | 242 |
| Global D/S ratio | 4.42% |
| Mean D/S ratio over time | 3.9% |
| HHI (market concentration) | 1,638 (moderately concentrated) |
| Madrid + Barcelona volume share | 51.4% |
| Madrid + Barcelona HHI contribution | 82.6% |
| Fully visible cities (all thresholds met) | 5 of 15 |
| Cities with significant lagged correlation | 1 of 28 (Valencia only) |
Supply bars dominate demand in every single city. The demand signal is near-invisible at the chart scale. Four properties define the market structure:
- Supply and demand are structurally imbalanced across the entire geography
- Liquidity is concentrated in a few large hubs (Madrid, Barcelona)
- In medium and small cities, active demand is essentially absent
- The imbalance worsens proportionally as city volume decreases
The Herfindahl-Hirschman Index (HHI = 1,638) classifies the market as moderately concentrated (DOJ/FTC range: 1,500–2,500):
- Madrid alone contributes 53.9% of the HHI; Barcelona adds 28.6% — together 82.6%
- Girona appears as an anomalous 3rd contributor (~8.1%), likely reflecting supply-only batch listings rather than organic activity
- The remaining 12 cities collectively add only 9.3% — they are statistically irrelevant at national scale
The Pareto curve confirms that 6 cities (40%) account for 80% of total volume: Madrid, Barcelona, Girona, Valencia, Zaragoza and Alicante. The step from Madrid to Barcelona alone adds 21 percentage points — the steepest single step on the curve.
Cities are classified against three simultaneous criteria: market share > 2%, volume > 100 listings, active demand > 10 listings.
Only 5 cities pass all three filters: Madrid, Barcelona, Valencia, Zaragoza and Alicante — representing 68% of total market volume. Notable failure: Girona (3rd by volume, ~541 listings) is excluded because it has only 6 active demand listings. Its volume is supply-only activity — many sellers, almost no buyers.
The Market Attractiveness Score (35% Volume + 25% HHI weight + 30% D/S Ratio + 10% Visibility) ranks all 15 cities on a 0–100 scale:
| City | Score | Zone | Limiting factor |
|---|---|---|---|
| Madrid | ~79.2 | MEDIUM | Low D/S ratio (3.49%) caps demand component |
| Barcelona | ~56.3 | MEDIUM | Even lower D/S ratio (2.84%) |
| Valencia | ~41 | LOW | Good D/S ratio but insufficient volume |
| Alicante | ~35 | LOW | Same profile as Valencia but smaller |
| Baleares | ~33 | LOW | Best D/S ratio (11.36%) but tiny volume |
No city reaches HIGH (≥80). Madrid scores 79.2 — just below the threshold. The HIGH zone is structurally unreachable because no city simultaneously combines high volume, high HHI weight, D/S ratio > 10% and full visibility.
Every bar is blue — every city is a buyer's market. Three clusters emerge by ratio level:
- Cold (ratio < 3%): Bizkaia (0.86%), Girona (1.12%), Barcelona (2.84%). Near-zero demand signal despite economic weight. Barcelona's large supply base dilutes its 28 demand listings to 2.84%.
- Warm (3–10%): Madrid (3.49%), Valencia, Zaragoza, Sevilla, Alicante, Granada. Most of the market lives here.
- Above 10% — only Baleares (11.36%): the single city closest to a seller's market; the only one above the 10% demand health threshold.
High-ratio outliers (Ciudad Real 57%, La Rioja 38%, Gipuzkoa 36%) are small-denominator artefacts: a handful of demand listings over near-zero supply. They are not actionable.
Pearson correlation between D/S ratio and median price across cities:
- r = −0.291, p = 0.0529 — not statistically significant
- Direction is negative (counterintuitive): cities with more relative demand tend to have lower prices
Two explanations for the negative direction: (1) high-ratio cities are small markets trading cheap instruments — buyers seek affordable gear; (2) synthesiser prices are nationally referenced — local D/S ratio gives no pricing power because buyers can order from any city. Local ratio analysis must be combined with brand/instrument-type data to understand price drivers.
Bizkaia is the most striking counterexample: lowest D/S ratio in the chart (0.86% — almost no buyers) yet one of the highest median prices (~€350). This makes the independence between local demand ratio and price visually undeniable. Price is driven by the composition of the local listing mix, not by local supply/demand dynamics.
Supply and demand are tracked per city across the 12 scrape dates. The demand line runs flat near zero for all cities throughout 2022–2024 — supply and demand never converge.
- The early-2023 supply collapse is the dominant temporal event. Supply halved synchronously across all cities simultaneously in January 2023. This synchrony confirms it is a platform-wide event (scraping frequency change or platform restructuring), not a genuine market contraction.
- Girona's collapse: supply peaked at ~125 in late 2022 then dropped to near zero after January 2023, never recovering. The resulting D/S ratio spike to 100% in July 2023 is a zero-denominator artefact (one demand listing against zero supply), not a real market event.
- Valencia is the only city that consistently maintains a slightly elevated D/S ratio (5–13%) relative to the others, foreshadowing its significance in the lagged correlation analysis.
At national level, supply halved from ~750 to ~250 listings in January 2023 and never recovered. The mean D/S ratio is 3.9% with no trend toward equilibrium. The price time series shows two sparsity artefacts in early 2023 (near-zero median, then a sharp recovery) — these are data quality events, not real market movements.
Mean views per city vs median price:
- r = −0.003, p = 0.985 — essentially zero correlation
- Cheap listings attract more views, pulling the aggregate against any positive price signal
- Views reflect curiosity, not purchase intent; geographic distribution of viewers is unknown
seenmay be useful as a listing-level engagement indicator but loses all predictive signal when aggregated by city
For each city with ≥4 scrape periods of non-zero data (28 of 49 cities pass this filter), Pearson correlation is computed between D/S ratio(t) and median price(t+1):
- Valencia: r = 0.819, p = 0.002 — the only statistically significant result (robust: probability of observing r ≥ 0.819 by chance with n = 11 is 0.2%)
- All other 27 cities: not significant (p > 0.05)
Why only Valencia? It is the one city where three conditions coincide: sufficient temporal coverage with non-zero demand in most periods, a moderately consistent buyer base (27 demand listings spread over time), and enough price variance. Madrid and Barcelona have more data but their D/S ratio is so consistently low that there is no meaningful variation for the correlation to capture.
- Permanent buyer's market. The global D/S ratio is 4.42% — for every 100 sell listings, only 4 active buyers exist. This condition holds consistently across all cities and all time periods without exception. The market would need to roughly double its active demand, or halve its supply, to approach equilibrium.
- Central hypothesis rejected. D/S ratio does not predict price (r = −0.291, p = 0.0529). Synthesiser prices are determined by the instrument itself (brand, model, category) and by national reference pricing — not by local supply/demand pressure.
- Market concentrated in two cities. Madrid + Barcelona account for 51.4% of volume and 82.6% of the HHI. Any national-level model is primarily shaped by these two markets. City-specific segmentation is methodologically required.
- Only 5 cities are strategically actionable under the composite visibility threshold (Madrid, Barcelona, Valencia, Zaragoza, Alicante). Together they represent 68% of total market volume.
- No city justifies priority investment. The highest composite score achieved is 79.2 (Madrid), just below the HIGH threshold of 80. The HIGH zone is structurally unreachable at current market conditions.
- Valencia is the only exception. It is the only city where D/S ratio(t) predicts median price(t+1) with statistical significance (r = 0.819, p = 0.002). It is the only market in the dataset where the classic supply/demand pricing mechanism is empirically active.
- Views are not a price predictor at the city level (r ≈ 0, p = 0.985). The
seenmetric measures passive curiosity, not purchase intent. - The early-2023 structural break is the main methodological risk. Supply dropped by half synchronously across all cities in January 2023 and never recovered. Any longitudinal model that spans this break should treat it as a regime change, not a continuous series.
| Fig | File | Description |
|---|---|---|
| 19 | 19_supply_vs_demand_by_city.png |
Supply vs demand side by side — top 15 cities |
| 20 | 20_sd_ratio_ranking_by_city.png |
D/S ratio ranking — all markets are buyer's markets |
| 21 | 21_sd_ratio_vs_price_scatter.png |
D/S ratio vs median price scatter (with regression) |
| 22 | 22_sd_ratio_and_price_dual_axis.png |
D/S ratio and median price dual axis — top 12 cities |
| 23 | 23_supply_demand_over_time_by_city.png |
Supply and demand over time — top 5 cities |
| 24 | 24_sd_ratio_evolution_by_city.png |
D/S ratio evolution over time — top 5 cities |
| 25 | 25_global_supply_demand_ratio_price.png |
Global market: supply, demand, ratio and price over time |
| 26 | 26_seen_vs_price_by_city.png |
Mean views (passive demand) vs median price per city |
| 27 | 27_lagged_correlation_sd_ratio_vs_price.png |
Lagged correlation: does D/S ratio at (t) predict price at (t+1)? |
| Category | Tools |
|---|---|
| Language | Python 3.9 |
| Scraping | requests, BeautifulSoup4 |
| Data | pandas 2.2, numpy 2.0, pyarrow |
| Visualisation | matplotlib 3.9, seaborn 0.13, altair 5 |
| Statistics / ML | scipy.stats, scikit-learn 1.6, XGBoost 2.1, shap, numba |
| Environment | JupyterLab 4.4, streamlit 1.50 |
| Package manager | conda (jupyterlab_new.yaml) |
Alberto Jiménez — datablogcafe.com | GitHub
Repository: github.com/albertjimrod/hispasonic
MIT License — feel free to use and modify.