The idea is simple: humans are naturally great at creating mosaic art. From the Roman Empire to French Neo-Impressionism, we can effortlessly place individual strokes to form a larger, coherent image, balancing local action with global structure.
Large Language Models, however, struggle with this because they fundamentally lack spatial grounding. Autoregressive Mosaics is an attempt to force an LLM trained only on text to paint a picture one discrete pixel at a time. The system gives the model a blank grid (M x N) and a text prompt; the model must infer where to place structure and color step-by-step using only its linguistic priors.
The results are often visually primitive, unstable, or unintentionally abstract, but that is exactly the point. They offer a raw look into how text-only models represent (and fracture) geometry, shape, and visual concepts.
As with any art, outputs are open to interpretation. Squint a little: what do you see? Does the result resemble what you asked for?
This project currently uses:
Qwen/Qwen2.5-14B-Instruct
Qwen2.5-14B-Instruct is a text-first instruction-following language model. It is trained on large-scale mixed corpora (natural language + code) and tuned for instruction completion, reasoning, and structured generation. It is not a native image model in this setup, and it does not receive pixel tensors or vision encoder features here.
That makes the behavior in this project interesting: the model can still produce outputs that resemble visual structure, even though it is only generating text tokens.
If a language model is trained primarily to model text and code, to what extent can it still recover coherent 2D visual concepts when forced to act as a pixel-level or programmatic painter?
Autoregressive Mosaics treats this as an empirical question by constraining generation and observing where geometry emerges, degrades, or collapses.
To explore this phenomenon, the project includes two distinct generation pipelines.
In this approach, the model behaves like a literal cell-by-cell painter.
- In a single forward pass, the LLM generates:
- an ASCII topology grid inside
<ascii>...</ascii> - a symbol-to-color map inside
<palette>...</palette>
- an ASCII topology grid inside
- Each grid cell is directly represented in text, so the model must make an explicit decision per position.
- The backend parses, sanitizes, and force-fits the result to exact
M x Nshape, then maps characters to HEX colors.
Why this fails interestingly:
- The model predicts tokens in a strict 1D sequence.
- 2D consistency (object boundaries, symmetry, position memory) is hard to sustain over long generations.
- Shapes can drift, tear, collapse, or mutate across rows, producing fragmented but often compelling abstractions.
In this approach, the model behaves like a mosaic artist who writes code.
- Instead of raw pixels, the LLM outputs Python rendering logic (
render(canvas)). - The code uses a constrained drawing API (
fill,set_pixel,rect,line,circle,triangle). - A deterministic renderer executes that code and rasterizes the final grid.
Why this performs better:
- The model can express intent in compact symbolic form ("draw a circle at center") rather than committing to every cell token.
- Deterministic geometry handles exact spatial bookkeeping.
- This aligns with LLM strengths: symbolic decomposition, procedural logic, and code synthesis.
- The result is a neuro-symbolic pipeline: language model for high-level plan, strict engine for spatial execution.
ver2-asciicanvas/- ASCII topology + palette generation backend and UI.ver3-codecanvas/- Code-generation neuro-symbolic backend and UI.results/- Sample outputs, visualization script, and project banner.backend.py,index.html- earlier root-level prototype files.
- Python 3.10+
- PyTorch + Transformers stack
- GPU recommended for Qwen 14B
Install typical dependencies in your environment (example names may vary by setup):
pip install fastapi uvicorn torch transformers acceleratecd ver2-asciicanvas
python backend.pyThen open: http://localhost:8123
cd ver3-codecanvas
python backend.pyThen open: http://localhost:8123
Note: both versions default to port 8123, so run one backend at a time.
- This is not a production image generator.
- This is an interpretability-flavored art experiment probing the boundary between text autoregression and spatial reasoning.
- Failures are part of the signal, not just noise.
Copyright © 2026. All Rights Reserved.
This code is provided for viewing purposes only in conjunction with the CVPR art gallery. Copying, modification, distribution, and derivative works without citations are prohibited.
If you reference this work or repository, please cite it as follows:
Plain Text: A. Nedungadi, "Autoregressive Mosaics." GitHub, 2026. [Online]. Available: https://github.com/ashwin-ned/autoregressive-mosaics
BibTeX:
@misc{ned2026autoregressivemosaics,
author = {Nedungadi, Ashwin},
title = {Autoregressive Mosaics},
year = {2026},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{[https://github.com/ashwin-ned/autoregressive-mosaics](https://github.com/ashwin-ned/autoregressive-mosaics)}}
}
