Skip to content

Main#1

Open
lukefryer1234 wants to merge 4 commits into
main-oldfrom
main
Open

Main#1
lukefryer1234 wants to merge 4 commits into
main-oldfrom
main

Conversation

@lukefryer1234

Copy link
Copy Markdown
Owner

No description provided.

… done so far and provide feedback for Jules to continue.
This commit introduces a Python script (`wiki_scraper.py`) designed to scrape textual content from a wiki or website, starting from a given URL and following links within the same domain.

Key features implemented:
- HTML Fetching: Downloads web page content using the `requests` library.
- HTML Parsing: Uses `BeautifulSoup` to parse HTML, extract text, and identify hyperlinks.
- Content Extraction: Allows specifying a CSS selector via a command-line argument (`--selector`) to target the main content area of pages. Defaults to scraping the entire `<body>`.
- Crawling: Implements a breadth-first crawling mechanism using a queue to visit pages and a set to track visited URLs, preventing redundant processing and loops.
- Link Filtering: Ensures that the crawler only follows links that belong to the same domain as the starting URL.
- Content Aggregation: Consolidates text from all scraped pages into a single output.
- File Output: Saves the aggregated content to a user-specified file (defaulting to `scraped_content.txt`) in UTF-8 encoding.
- Command-Line Interface: Provides a CLI using `argparse` for specifying the start URL (required), output file name (optional), and content CSS selector (optional).
- Documentation: Includes comprehensive docstrings within the script and an updated `README.md` with installation prerequisites and usage instructions.

The script is designed to be a flexible tool for you when you need to extract and consolidate information from web-based wikis or similar sites.
This commit introduces two main enhancements:

1.  Markdown Output Format:
    *   The `wiki_scraper.py` script now supports outputting scraped content in Markdown (`.md`) format, in addition to the existing plain text (`.txt`) format.
    *   A new command-line argument `--format` (choices: `txt`, `md`; default: `txt`) allows you to specify your desired output format.
    *   The output filename extension is automatically adjusted based on the selected format (e.g., `data.md` if `md` is chosen).
    *   The content separator between pages in `crawl_wiki` has been changed to `\n\n` for better Markdown compatibility.

2.  Comprehensive "How to Use" Document:
    *   A new `HOW_TO_USE.md` file has been added to the repository.
    *   This document provides detailed instructions on the script's features, installation, command-line arguments (with examples), tips for finding CSS selectors for content extraction, and basic troubleshooting advice.

These changes make the script more versatile in its output capabilities and provide you with better documentation for its usage.
This commit introduces a graphical user interface (GUI) for the wiki scraper,
built using Tkinter, and includes corresponding documentation updates.

Key changes:

1.  **New GUI Scraper (`gui_scraper.py`):**
    *   A new `gui_scraper.py` script provides a user-friendly graphical
      interface for the web scraper.
    *   Features include input fields for Start URL and CSS Selector,
      radio buttons for output format (TXT/MD), a "Browse..." button
      to select the output file path, a "START SCRAPING" button, and
      a scrollable status area for progress and error messages.
    *   The GUI runs the core scraping logic (`crawl_wiki` from
      `wiki_scraper.py`) in a separate thread to prevent UI freezing.
    *   GUI updates (status messages, button states) are handled in a
      thread-safe manner using `root.after()`.

2.  **Enhanced `wiki_scraper.py` for GUI Integration:**
    *   The `crawl_wiki` function in `wiki_scraper.py` has been updated
      to accept an optional `status_callback` function. This allows
      it to send detailed progress updates to the GUI.
    *   If no callback is provided, it defaults to printing status
      updates to the console, maintaining its CLI functionality.

3.  **Updated Documentation:**
    *   `README.md` has been updated to include information about both
      the CLI (`wiki_scraper.py`) and the new GUI (`gui_scraper.py`)
      versions, explaining how to run each.
    *   `HOW_TO_USE.md` has been significantly revised to provide
      comprehensive instructions for both interfaces. It now has
      dedicated sections for CLI and GUI usage, installation,
      troubleshooting, and tips for CSS selectors, applicable to both.

This GUI makes the scraper more accessible if you prefer a graphical
interface over a command-line tool.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant