Skip to content

Conversation

allisonwang-db
Copy link
Owner

@allisonwang-db allisonwang-db commented Aug 4, 2025

Summary by CodeRabbit

  • New Features

    • Added documentation and README sections describing code formatting and linting with Ruff, including example commands.
    • Added a Ruff code style badge to the README.
  • Chores

    • Added Ruff as a development dependency and configured its formatting and linting rules.
  • Style

    • Applied consistent code formatting and style improvements across the codebase, including string quoting, whitespace, indentation, and line breaks.
    • Reformatted test files for improved readability and style consistency.
  • Documentation

    • Updated documentation to include instructions for using Ruff for code formatting and linting.

Copy link

coderabbitai bot commented Aug 4, 2025

Walkthrough

This change set introduces and documents the use of the Ruff tool for code formatting and linting across the project. Documentation files were updated to describe Ruff usage and add a badge. The Ruff dependency and configuration were added to the project configuration. The remainder of the changes are code formatting adjustments—such as string quoting, whitespace, and line breaks—applied consistently across code and test files, with no changes to logic or functionality.

Changes

Cohort / File(s) Change Summary
Documentation Updates
README.md, CLAUDE.md
Added Ruff code style badge to README and introduced a "Code Formatting" section in both documentation files, providing instructions and example commands for using Ruff.
Ruff Tool Integration
pyproject.toml
Added Ruff as a development dependency and configured its formatting and linting settings, including line length, quote style, and enabled lint codes.
Formatting: Data Source Implementations
pyspark_datasources/arrow.py, pyspark_datasources/fake.py, pyspark_datasources/github.py, pyspark_datasources/googlesheets.py, pyspark_datasources/huggingface.py, pyspark_datasources/kaggle.py, pyspark_datasources/lance.py, pyspark_datasources/opensky.py, pyspark_datasources/salesforce.py, pyspark_datasources/simplejson.py, pyspark_datasources/stock.py, pyspark_datasources/weather.py
Applied Ruff-driven code formatting: consistent double quotes, whitespace adjustments, line breaks, and trailing commas. No logic or control flow changes.
Formatting: Example and Test Files
examples/salesforce_sink_example.py, tests/test_data_sources.py, tests/test_google_sheets.py
Reformatted code for consistent style, including string quoting, line breaks, whitespace, and argument formatting. No changes to test logic or assertions.

Sequence Diagram(s)

Not applicable: All changes are related to documentation, configuration, and code formatting, with no new features or control flow modifications.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~4 minutes

Poem

Ruff has hopped into our code,
With double quotes and lines bestowed.
Whitespace trimmed, the docs now gleam,
Formatting’s neat as a rabbit’s dream!
🐇✨
No logic changed, just style refined—
A tidy warren, code aligned!

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ruff-linter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (2)
pyspark_datasources/github.py (1)

55-63: Authentication header is built but never sent

header is prepared (and may include an OAuth token) yet the request is executed without it, causing rate-limit issues and failures on private repos.

-        response = requests.get(url)
+        response = requests.get(url, headers=header)

Please add a test for both public and token-authenticated calls.

pyspark_datasources/opensky.py (1)

76-82: Wildcard import violates Ruff (F405) and obscures dependencies.

from pyspark.sql.types import * masks the module’s real dependencies and is explicitly banned by Ruff. Replace with the concrete symbols you use:

-from pyspark.sql.types import *
+from pyspark.sql.types import (
+    StructType,
+    StructField,
+    TimestampType,
+    StringType,
+    DoubleType,
+    BooleanType,
+    ArrayType,
+    IntegerType,
+)

This eliminates the F405 errors on every downstream type reference and clarifies the public surface.

🧹 Nitpick comments (8)
pyproject.toml (1)

40-40: Pin Ruff to an exact version to avoid CI breakages

ruff evolves rapidly and 0.x releases do not guarantee backwards-compatibility. Relying on a caret range (^0.6.0) means the next minor bump (0.7) could suddenly fail the build or re-format the entire repo. Consider pinning to the exact version you have vetted (or at least the minor version) and upgrading deliberately.

-ruff = "^0.6.0"
+ruff = "==0.6.0"
pyspark_datasources/arrow.py (1)

154-162: Catch specific exceptions instead of the broad Exception

Catching every possible exception masks programming errors and makes debugging harder. In this context you mainly need to handle I/O and Arrow parsing problems.

-        except Exception as e:
-            raise RuntimeError(f"Failed to read Arrow file {file_path}: {str(e)}")
+        except (FileNotFoundError, pa.ArrowInvalid, OSError) as e:
+            raise RuntimeError(f"Failed to read Arrow file {file_path}: {e}") from e

This also preserves the original traceback via from e.

README.md (2)

4-4: Badge URL: confirm it stays evergreen

The badge links directly to the main branch of Ruff. If Ruff ever reorganises its assets the endpoint may break.
Consider pinning to a specific tag or mirroring the JSON to avoid a future 404.


104-117: Great to document Ruff—provide an automated gate

Docs tell contributors how to run Ruff manually, but nothing enforces it.
Recommend adding a CI step (e.g. GitHub Actions) or a pre-commit hook so formatting & linting are automatically verified on every PR.

CLAUDE.md (1)

69-76: Mirror README content to avoid drift

The Ruff section duplicates the README. To prevent these docs diverging, consider extracting the commands into a single Markdown include or linking to the README section.

pyspark_datasources/opensky.py (2)

184-186: Use builtin generics to silence UP006 and modernise typing.

Ruff flags Dict[str, int] here (and similar List, Tuple elsewhere). Python 3.9+ allows the builtin dict, list, tuple with subscripts—no need for typing aliases.

-    def initialOffset(self) -> Dict[str, int]:
+    def initialOffset(self) -> dict[str, int]:
         return {"last_fetch": 0}

Apply the same replacement to readBetweenOffsets, read, and any other signatures for full compliance.


187-195: Minor docstring typo hampers grepability.

"""Ensure e MIN_REQUEST_INTERVAL seconds between requests""" contains an stray “e”. It should read:

-        """Ensure e MIN_REQUEST_INTERVAL seconds between requests"""
+        """Ensure MIN_REQUEST_INTERVAL seconds between requests"""

Purely cosmetic but worth fixing while touching the file.

pyspark_datasources/lance.py (1)

86-88: Minor naming nitpick (optional)

msg inside the list-comprehension is perfectly valid, but renaming to commit_msg (or similar) would be marginally clearer and avoid shadowing the more generic term “message” used elsewhere.

-        fragments = [fragment for msg in messages for fragment in msg.fragments]
+        fragments = [fragment for commit_msg in messages for fragment in commit_msg.fragments]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8985da2 and a08f5f0.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (18)
  • CLAUDE.md (1 hunks)
  • README.md (2 hunks)
  • examples/salesforce_sink_example.py (2 hunks)
  • pyproject.toml (1 hunks)
  • pyspark_datasources/arrow.py (4 hunks)
  • pyspark_datasources/fake.py (0 hunks)
  • pyspark_datasources/github.py (1 hunks)
  • pyspark_datasources/googlesheets.py (2 hunks)
  • pyspark_datasources/huggingface.py (2 hunks)
  • pyspark_datasources/kaggle.py (1 hunks)
  • pyspark_datasources/lance.py (2 hunks)
  • pyspark_datasources/opensky.py (10 hunks)
  • pyspark_datasources/salesforce.py (6 hunks)
  • pyspark_datasources/simplejson.py (2 hunks)
  • pyspark_datasources/stock.py (3 hunks)
  • pyspark_datasources/weather.py (4 hunks)
  • tests/test_data_sources.py (5 hunks)
  • tests/test_google_sheets.py (2 hunks)
💤 Files with no reviewable changes (1)
  • pyspark_datasources/fake.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: in pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(t...
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.695Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.

Applied to files:

  • pyspark_datasources/opensky.py
🪛 Ruff (0.12.2)
pyspark_datasources/opensky.py

184-184: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


249-249: Use list instead of List for type annotation

Replace with list

(UP006)


249-249: Use tuple instead of Tuple for type annotation

Replace with tuple

(UP006)


436-436: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


452-452: StructType may be undefined, or defined from star imports

(F405)


453-453: StructType may be undefined, or defined from star imports

(F405)


455-455: StructField may be undefined, or defined from star imports

(F405)


455-455: TimestampType may be undefined, or defined from star imports

(F405)


456-456: StructField may be undefined, or defined from star imports

(F405)


456-456: StringType may be undefined, or defined from star imports

(F405)


457-457: StructField may be undefined, or defined from star imports

(F405)


457-457: StringType may be undefined, or defined from star imports

(F405)


458-458: StructField may be undefined, or defined from star imports

(F405)


458-458: StringType may be undefined, or defined from star imports

(F405)


459-459: StructField may be undefined, or defined from star imports

(F405)


459-459: TimestampType may be undefined, or defined from star imports

(F405)


460-460: StructField may be undefined, or defined from star imports

(F405)


460-460: TimestampType may be undefined, or defined from star imports

(F405)


461-461: StructField may be undefined, or defined from star imports

(F405)


461-461: DoubleType may be undefined, or defined from star imports

(F405)


462-462: StructField may be undefined, or defined from star imports

(F405)


462-462: DoubleType may be undefined, or defined from star imports

(F405)


463-463: StructField may be undefined, or defined from star imports

(F405)


463-463: DoubleType may be undefined, or defined from star imports

(F405)


464-464: StructField may be undefined, or defined from star imports

(F405)


464-464: BooleanType may be undefined, or defined from star imports

(F405)


465-465: StructField may be undefined, or defined from star imports

(F405)


465-465: DoubleType may be undefined, or defined from star imports

(F405)


466-466: StructField may be undefined, or defined from star imports

(F405)


466-466: DoubleType may be undefined, or defined from star imports

(F405)


467-467: StructField may be undefined, or defined from star imports

(F405)


467-467: DoubleType may be undefined, or defined from star imports

(F405)


468-468: StructField may be undefined, or defined from star imports

(F405)


468-468: ArrayType may be undefined, or defined from star imports

(F405)


468-468: IntegerType may be undefined, or defined from star imports

(F405)


469-469: StructField may be undefined, or defined from star imports

(F405)


469-469: DoubleType may be undefined, or defined from star imports

(F405)


470-470: StructField may be undefined, or defined from star imports

(F405)


470-470: StringType may be undefined, or defined from star imports

(F405)


471-471: StructField may be undefined, or defined from star imports

(F405)


471-471: BooleanType may be undefined, or defined from star imports

(F405)


472-472: StructField may be undefined, or defined from star imports

(F405)


472-472: IntegerType may be undefined, or defined from star imports

(F405)

pyspark_datasources/salesforce.py

275-275: Line too long (101 > 100)

(E501)


303-303: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


327-327: Use list instead of List for type annotation

Replace with list

(UP006)


327-327: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


375-375: Line too long (146 > 100)

(E501)


396-396: Line too long (109 > 100)

(E501)

tests/test_data_sources.py

81-81: SalesforceDataSource may be undefined, or defined from star imports

(F405)

examples/salesforce_sink_example.py

352-352: f-string without any placeholders

Remove extraneous f prefix

(F541)


404-404: Local variable custom_data is assigned to but never used

Remove assignment to unused variable custom_data

(F841)

🔇 Additional comments (12)
pyproject.toml (1)

42-52: Ruff config looks minimal – verify that the chosen rule set is sufficient

You currently enable the default flake8 (E, F), isort (I) and pyupgrade (UP) rules only.
If you expect Ruff to mirror Black’s formatting (or additional static-analysis rules such as complexity or docstring checks), remember to extend select = [...] accordingly or document why they are intentionally left out.
No change required, but please double-check this matches the team’s lint policy.

pyspark_datasources/github.py (1)

67-71: Nice style clean-up

Keyword arguments are now formatted consistently; nothing else changed.

pyspark_datasources/simplejson.py (1)

80-80: Change is purely stylistic—no action required

Switching to double-quoted "utf-8" keeps the file in sync with the new Ruff quoting style. No functional impact.

pyspark_datasources/kaggle.py (1)

79-79: LGTM – style improvement only

Refactoring the ValueError onto one line matches the new formatter rules. No behaviour change.

tests/test_google_sheets.py (2)

25-25: Single-line chain is fine—mind long literals

The inline URL pushes the line well past typical 88/100-char limits. Ruff/Black exempts long string literals, but if you later enable E501-style checks it will flag this.


48-48: Same note on line length

Applies to this long-URL call chain as well.

pyspark_datasources/stock.py (1)

78-86: Trailing comma complies with Ruff’s comma-dangle rule – good catch
Adding the trailing comma future-proofs the tuple literal and keeps diffs minimal when new fields are appended. Nothing else to address here.

pyspark_datasources/weather.py (1)

61-68: Schema block reformatted correctly
Wrapping the StructType fields in a list inside parentheses satisfies Ruff’s line-length and indentation rules while improving readability. Looks good.

pyspark_datasources/googlesheets.py (1)

32-33: Condensed URL-validation improves clarity
Collapsing the multi-line condition into a single readable line keeps the guard tight and passes Ruff’s complexity checks. Implementation is unchanged — LGTM.

pyspark_datasources/huggingface.py (2)

63-65: Readability tweak acknowledged—no action required

Adding a blank line after the local load_dataset_builder import complies with Ruff’s E402 guidance and improves visual separation between imports and business logic. No functional impact.


91-93: Consistent formatting looks good

Same comment as above—the blank line after the load_dataset import keeps the style consistent across methods.

pyspark_datasources/lance.py (1)

45-47: Docstring formatting improved

The inserted blank line makes the “Notes” section stand out and matches neighbouring datasource docs. No further changes needed.

Comment on lines 404 to +410
custom_data = streaming_df.select(
col("value").cast("string").alias("Product_Code__c"),
lit("Sample Product").alias("Name"),
(col("value") * 29.99).cast("double").alias("Price__c"),
current_timestamp().alias("Created_Date__c")
current_timestamp().alias("Created_Date__c"),
)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

custom_data variable triggers Ruff F841 (assigned but never used)
The example intentionally leaves the write call commented out, so Ruff marks custom_data as unused. Either prefix with an underscore or suppress with # noqa: F841:

-        custom_data = streaming_df.select(
+        custom_data = streaming_df.select(  # noqa: F841

This preserves the illustrative variable while keeping the example Ruff-clean.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
custom_data = streaming_df.select(
col("value").cast("string").alias("Product_Code__c"),
lit("Sample Product").alias("Name"),
(col("value") * 29.99).cast("double").alias("Price__c"),
current_timestamp().alias("Created_Date__c")
current_timestamp().alias("Created_Date__c"),
)
custom_data = streaming_df.select( # noqa: F841
col("value").cast("string").alias("Product_Code__c"),
lit("Sample Product").alias("Name"),
(col("value") * 29.99).cast("double").alias("Price__c"),
current_timestamp().alias("Created_Date__c"),
)
🧰 Tools
🪛 Ruff (0.12.2)

404-404: Local variable custom_data is assigned to but never used

Remove assignment to unused variable custom_data

(F841)

🤖 Prompt for AI Agents
In examples/salesforce_sink_example.py around lines 404 to 410, the variable
custom_data is assigned but never used, triggering Ruff F841. To fix this,
either rename the variable to _custom_data or add a comment # noqa: F841 at the
end of the assignment line to suppress the warning while keeping the example
code intact.

Comment on lines 268 to 272
written = self._write_to_salesforce(sf, records_buffer, batch_id)
logger.info(f"✅ Batch {batch_id}: Successfully wrote {written} records (buffer flush)")
logger.info(
f"✅ Batch {batch_id}: Successfully wrote {written} records (buffer flush)"
)
total_records_written += written
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Several log-message lines still violate the line-length = 100 Ruff policy
These >100-character f-strings will fail CI once Ruff is enforced. Split the message string and leverage logging’s built-in formatting to avoid eager f-string evaluation:

-                    logger.info(
-                        f"✅ Batch {batch_id}: Successfully wrote {written} records (buffer flush)"
-                    )
+                    logger.info(
+                        "✅ Batch %s: Successfully wrote %s records (buffer flush)",
+                        batch_id,
+                        written,
+                    )
@@
-            logger.info(
-                f"✅ Batch {batch_id}: Successfully wrote {total_records_written} records (total)"
-            )
+            logger.info(
+                "✅ Batch %s: Successfully wrote %s records (total)",
+                batch_id,
+                total_records_written,
+            )
@@
-                logger.error(
-                    f"Error in batch creation for batch {i//self.batch_size + 1}: {str(e)}"
-                )
+                logger.error(
+                    "Error in batch creation for batch %s: %s",
+                    (i // self.batch_size) + 1,
+                    str(e),
+                )

Applying the same pattern to the other long log lines keeps everything within the configured limit and removes unnecessary string interpolation overhead.

Also applies to: 298-300, 381-383

🤖 Prompt for AI Agents
In pyspark_datasources/salesforce.py around lines 268 to 272, the log message
uses an f-string exceeding the 100-character line length limit. To fix this,
split the log message into multiple parts or use logging's built-in formatting
by passing the message template and variables separately instead of an f-string.
Apply the same fix to the similar long log lines at 298-300 and 381-383 to
comply with the Ruff line-length policy and avoid eager string interpolation.

Comment on lines +79 to 83

# Test that the datasource is registered with correct name
assert SalesforceDataSource.name() == "salesforce"

# Test that the data source is streaming-only (no batch writer)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Star-import trips Ruff’s F405 – replace with explicit imports.

from pyspark_datasources import * makes every symbol look undefined to Ruff; the very next assertions referencing SalesforceDataSource already trigger F405. This will fail the new Ruff gate on CI.

-from pyspark_datasources import *
+from pyspark_datasources import (
+    GithubDataSource,
+    FakeDataSource,
+    KaggleDataSource,
+    OpenSkyDataSource,
+    SalesforceDataSource,
+    ArrowDataSource,
+)

Besides satisfying Ruff, this improves readability and avoids accidental name clashes.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Test that the datasource is registered with correct name
assert SalesforceDataSource.name() == "salesforce"
# Test that the data source is streaming-only (no batch writer)
-from pyspark_datasources import *
+from pyspark_datasources import (
+ GithubDataSource,
+ FakeDataSource,
+ KaggleDataSource,
+ OpenSkyDataSource,
+ SalesforceDataSource,
+ ArrowDataSource,
+)
🧰 Tools
🪛 Ruff (0.12.2)

81-81: SalesforceDataSource may be undefined, or defined from star imports

(F405)

🤖 Prompt for AI Agents
In tests/test_data_sources.py around lines 79 to 83, replace the star-import
from pyspark_datasources with explicit imports of only the needed symbols such
as SalesforceDataSource. This will fix Ruff's F405 error by making all imported
names explicit, improving readability and preventing accidental name clashes.
Identify all used classes or functions from pyspark_datasources in this file and
import them explicitly instead of using a wildcard import.

@allisonwang-db allisonwang-db merged commit 7c4e881 into master Aug 4, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant