Main by lukefryer1234 · Pull Request #1 · lukefryer1234/Script-for-mafiawiki

lukefryer1234 · 2025-06-17T19:19:30Z

No description provided.

… done so far and provide feedback for Jules to continue.

This commit introduces a Python script (`wiki_scraper.py`) designed to scrape textual content from a wiki or website, starting from a given URL and following links within the same domain. Key features implemented: - HTML Fetching: Downloads web page content using the `requests` library. - HTML Parsing: Uses `BeautifulSoup` to parse HTML, extract text, and identify hyperlinks. - Content Extraction: Allows specifying a CSS selector via a command-line argument (`--selector`) to target the main content area of pages. Defaults to scraping the entire `<body>`. - Crawling: Implements a breadth-first crawling mechanism using a queue to visit pages and a set to track visited URLs, preventing redundant processing and loops. - Link Filtering: Ensures that the crawler only follows links that belong to the same domain as the starting URL. - Content Aggregation: Consolidates text from all scraped pages into a single output. - File Output: Saves the aggregated content to a user-specified file (defaulting to `scraped_content.txt`) in UTF-8 encoding. - Command-Line Interface: Provides a CLI using `argparse` for specifying the start URL (required), output file name (optional), and content CSS selector (optional). - Documentation: Includes comprehensive docstrings within the script and an updated `README.md` with installation prerequisites and usage instructions. The script is designed to be a flexible tool for you when you need to extract and consolidate information from web-based wikis or similar sites.

This commit introduces two main enhancements: 1. Markdown Output Format: * The `wiki_scraper.py` script now supports outputting scraped content in Markdown (`.md`) format, in addition to the existing plain text (`.txt`) format. * A new command-line argument `--format` (choices: `txt`, `md`; default: `txt`) allows you to specify your desired output format. * The output filename extension is automatically adjusted based on the selected format (e.g., `data.md` if `md` is chosen). * The content separator between pages in `crawl_wiki` has been changed to `\n\n` for better Markdown compatibility. 2. Comprehensive "How to Use" Document: * A new `HOW_TO_USE.md` file has been added to the repository. * This document provides detailed instructions on the script's features, installation, command-line arguments (with examples), tips for finding CSS selectors for content extraction, and basic troubleshooting advice. These changes make the script more versatile in its output capabilities and provide you with better documentation for its usage.

This commit introduces a graphical user interface (GUI) for the wiki scraper, built using Tkinter, and includes corresponding documentation updates. Key changes: 1. **New GUI Scraper (`gui_scraper.py`):** * A new `gui_scraper.py` script provides a user-friendly graphical interface for the web scraper. * Features include input fields for Start URL and CSS Selector, radio buttons for output format (TXT/MD), a "Browse..." button to select the output file path, a "START SCRAPING" button, and a scrollable status area for progress and error messages. * The GUI runs the core scraping logic (`crawl_wiki` from `wiki_scraper.py`) in a separate thread to prevent UI freezing. * GUI updates (status messages, button states) are handled in a thread-safe manner using `root.after()`. 2. **Enhanced `wiki_scraper.py` for GUI Integration:** * The `crawl_wiki` function in `wiki_scraper.py` has been updated to accept an optional `status_callback` function. This allows it to send detailed progress updates to the GUI. * If no callback is provided, it defaults to printing status updates to the console, maintaining its CLI functionality. 3. **Updated Documentation:** * `README.md` has been updated to include information about both the CLI (`wiki_scraper.py`) and the new GUI (`gui_scraper.py`) versions, explaining how to run each. * `HOW_TO_USE.md` has been significantly revised to provide comprehensive instructions for both interfaces. It now has dedicated sections for CLI and GUI usage, installation, troubleshooting, and tips for CSS selectors, applicable to both. This GUI makes the scraper more accessible if you prefer a graphical interface over a command-line tool.

google-labs-jules Bot added 4 commits June 17, 2025 18:34

Jules was unable to complete the task in time. Please review the work…

ece3366

… done so far and provide feedback for Jules to continue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main#1

Main#1
lukefryer1234 wants to merge 4 commits into
main-oldfrom
main

lukefryer1234 commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lukefryer1234 commented Jun 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant