-
Notifications
You must be signed in to change notification settings - Fork 17
Use ruff linter #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ruff linter #20
Conversation
WalkthroughThis change set introduces and documents the use of the Ruff tool for code formatting and linting across the project. Documentation files were updated to describe Ruff usage and add a badge. The Ruff dependency and configuration were added to the project configuration. The remainder of the changes are code formatting adjustments—such as string quoting, whitespace, and line breaks—applied consistently across code and test files, with no changes to logic or functionality. Changes
Sequence Diagram(s)Not applicable: All changes are related to documentation, configuration, and code formatting, with no new features or control flow modifications. Estimated code review effort🎯 1 (Trivial) | ⏱️ ~4 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (2)
pyspark_datasources/github.py (1)
55-63
: Authentication header is built but never sent
header
is prepared (and may include an OAuth token) yet the request is executed without it, causing rate-limit issues and failures on private repos.- response = requests.get(url) + response = requests.get(url, headers=header)Please add a test for both public and token-authenticated calls.
pyspark_datasources/opensky.py (1)
76-82
: Wildcard import violates Ruff (F405) and obscures dependencies.
from pyspark.sql.types import *
masks the module’s real dependencies and is explicitly banned by Ruff. Replace with the concrete symbols you use:-from pyspark.sql.types import * +from pyspark.sql.types import ( + StructType, + StructField, + TimestampType, + StringType, + DoubleType, + BooleanType, + ArrayType, + IntegerType, +)This eliminates the F405 errors on every downstream type reference and clarifies the public surface.
🧹 Nitpick comments (8)
pyproject.toml (1)
40-40
: Pin Ruff to an exact version to avoid CI breakages
ruff
evolves rapidly and 0.x releases do not guarantee backwards-compatibility. Relying on a caret range (^0.6.0
) means the next minor bump (0.7) could suddenly fail the build or re-format the entire repo. Consider pinning to the exact version you have vetted (or at least the minor version) and upgrading deliberately.-ruff = "^0.6.0" +ruff = "==0.6.0"pyspark_datasources/arrow.py (1)
154-162
: Catch specific exceptions instead of the broadException
Catching every possible exception masks programming errors and makes debugging harder. In this context you mainly need to handle I/O and Arrow parsing problems.
- except Exception as e: - raise RuntimeError(f"Failed to read Arrow file {file_path}: {str(e)}") + except (FileNotFoundError, pa.ArrowInvalid, OSError) as e: + raise RuntimeError(f"Failed to read Arrow file {file_path}: {e}") from eThis also preserves the original traceback via
from e
.README.md (2)
4-4
: Badge URL: confirm it stays evergreenThe badge links directly to the
main
branch of Ruff. If Ruff ever reorganises its assets the endpoint may break.
Consider pinning to a specific tag or mirroring the JSON to avoid a future 404.
104-117
: Great to document Ruff—provide an automated gateDocs tell contributors how to run Ruff manually, but nothing enforces it.
Recommend adding a CI step (e.g. GitHub Actions) or a pre-commit hook so formatting & linting are automatically verified on every PR.CLAUDE.md (1)
69-76
: Mirror README content to avoid driftThe Ruff section duplicates the README. To prevent these docs diverging, consider extracting the commands into a single Markdown include or linking to the README section.
pyspark_datasources/opensky.py (2)
184-186
: Use builtin generics to silence UP006 and modernise typing.Ruff flags
Dict[str, int]
here (and similarList
,Tuple
elsewhere). Python 3.9+ allows the builtindict
,list
,tuple
with subscripts—no need fortyping
aliases.- def initialOffset(self) -> Dict[str, int]: + def initialOffset(self) -> dict[str, int]: return {"last_fetch": 0}Apply the same replacement to
readBetweenOffsets
,read
, and any other signatures for full compliance.
187-195
: Minor docstring typo hampers grepability.
"""Ensure e MIN_REQUEST_INTERVAL seconds between requests"""
contains an stray “e”. It should read:- """Ensure e MIN_REQUEST_INTERVAL seconds between requests""" + """Ensure MIN_REQUEST_INTERVAL seconds between requests"""Purely cosmetic but worth fixing while touching the file.
pyspark_datasources/lance.py (1)
86-88
: Minor naming nitpick (optional)
msg
inside the list-comprehension is perfectly valid, but renaming tocommit_msg
(or similar) would be marginally clearer and avoid shadowing the more generic term “message” used elsewhere.- fragments = [fragment for msg in messages for fragment in msg.fragments] + fragments = [fragment for commit_msg in messages for fragment in commit_msg.fragments]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (18)
CLAUDE.md
(1 hunks)README.md
(2 hunks)examples/salesforce_sink_example.py
(2 hunks)pyproject.toml
(1 hunks)pyspark_datasources/arrow.py
(4 hunks)pyspark_datasources/fake.py
(0 hunks)pyspark_datasources/github.py
(1 hunks)pyspark_datasources/googlesheets.py
(2 hunks)pyspark_datasources/huggingface.py
(2 hunks)pyspark_datasources/kaggle.py
(1 hunks)pyspark_datasources/lance.py
(2 hunks)pyspark_datasources/opensky.py
(10 hunks)pyspark_datasources/salesforce.py
(6 hunks)pyspark_datasources/simplejson.py
(2 hunks)pyspark_datasources/stock.py
(3 hunks)pyspark_datasources/weather.py
(4 hunks)tests/test_data_sources.py
(5 hunks)tests/test_google_sheets.py
(2 hunks)
💤 Files with no reviewable changes (1)
- pyspark_datasources/fake.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: in pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(t...
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.695Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
Applied to files:
pyspark_datasources/opensky.py
🪛 Ruff (0.12.2)
pyspark_datasources/opensky.py
184-184: Use dict
instead of Dict
for type annotation
Replace with dict
(UP006)
249-249: Use list
instead of List
for type annotation
Replace with list
(UP006)
249-249: Use tuple
instead of Tuple
for type annotation
Replace with tuple
(UP006)
436-436: Use dict
instead of Dict
for type annotation
Replace with dict
(UP006)
452-452: StructType
may be undefined, or defined from star imports
(F405)
453-453: StructType
may be undefined, or defined from star imports
(F405)
455-455: StructField
may be undefined, or defined from star imports
(F405)
455-455: TimestampType
may be undefined, or defined from star imports
(F405)
456-456: StructField
may be undefined, or defined from star imports
(F405)
456-456: StringType
may be undefined, or defined from star imports
(F405)
457-457: StructField
may be undefined, or defined from star imports
(F405)
457-457: StringType
may be undefined, or defined from star imports
(F405)
458-458: StructField
may be undefined, or defined from star imports
(F405)
458-458: StringType
may be undefined, or defined from star imports
(F405)
459-459: StructField
may be undefined, or defined from star imports
(F405)
459-459: TimestampType
may be undefined, or defined from star imports
(F405)
460-460: StructField
may be undefined, or defined from star imports
(F405)
460-460: TimestampType
may be undefined, or defined from star imports
(F405)
461-461: StructField
may be undefined, or defined from star imports
(F405)
461-461: DoubleType
may be undefined, or defined from star imports
(F405)
462-462: StructField
may be undefined, or defined from star imports
(F405)
462-462: DoubleType
may be undefined, or defined from star imports
(F405)
463-463: StructField
may be undefined, or defined from star imports
(F405)
463-463: DoubleType
may be undefined, or defined from star imports
(F405)
464-464: StructField
may be undefined, or defined from star imports
(F405)
464-464: BooleanType
may be undefined, or defined from star imports
(F405)
465-465: StructField
may be undefined, or defined from star imports
(F405)
465-465: DoubleType
may be undefined, or defined from star imports
(F405)
466-466: StructField
may be undefined, or defined from star imports
(F405)
466-466: DoubleType
may be undefined, or defined from star imports
(F405)
467-467: StructField
may be undefined, or defined from star imports
(F405)
467-467: DoubleType
may be undefined, or defined from star imports
(F405)
468-468: StructField
may be undefined, or defined from star imports
(F405)
468-468: ArrayType
may be undefined, or defined from star imports
(F405)
468-468: IntegerType
may be undefined, or defined from star imports
(F405)
469-469: StructField
may be undefined, or defined from star imports
(F405)
469-469: DoubleType
may be undefined, or defined from star imports
(F405)
470-470: StructField
may be undefined, or defined from star imports
(F405)
470-470: StringType
may be undefined, or defined from star imports
(F405)
471-471: StructField
may be undefined, or defined from star imports
(F405)
471-471: BooleanType
may be undefined, or defined from star imports
(F405)
472-472: StructField
may be undefined, or defined from star imports
(F405)
472-472: IntegerType
may be undefined, or defined from star imports
(F405)
pyspark_datasources/salesforce.py
275-275: Line too long (101 > 100)
(E501)
303-303: Use dict
instead of Dict
for type annotation
Replace with dict
(UP006)
327-327: Use list
instead of List
for type annotation
Replace with list
(UP006)
327-327: Use dict
instead of Dict
for type annotation
Replace with dict
(UP006)
375-375: Line too long (146 > 100)
(E501)
396-396: Line too long (109 > 100)
(E501)
tests/test_data_sources.py
81-81: SalesforceDataSource
may be undefined, or defined from star imports
(F405)
examples/salesforce_sink_example.py
352-352: f-string without any placeholders
Remove extraneous f
prefix
(F541)
404-404: Local variable custom_data
is assigned to but never used
Remove assignment to unused variable custom_data
(F841)
🔇 Additional comments (12)
pyproject.toml (1)
42-52
: Ruff config looks minimal – verify that the chosen rule set is sufficientYou currently enable the default flake8 (
E
,F
), isort (I
) and pyupgrade (UP
) rules only.
If you expect Ruff to mirror Black’s formatting (or additional static-analysis rules such as complexity or docstring checks), remember to extendselect = [...]
accordingly or document why they are intentionally left out.
No change required, but please double-check this matches the team’s lint policy.pyspark_datasources/github.py (1)
67-71
: Nice style clean-upKeyword arguments are now formatted consistently; nothing else changed.
pyspark_datasources/simplejson.py (1)
80-80
: Change is purely stylistic—no action requiredSwitching to double-quoted
"utf-8"
keeps the file in sync with the new Ruff quoting style. No functional impact.pyspark_datasources/kaggle.py (1)
79-79
: LGTM – style improvement onlyRefactoring the
ValueError
onto one line matches the new formatter rules. No behaviour change.tests/test_google_sheets.py (2)
25-25
: Single-line chain is fine—mind long literalsThe inline URL pushes the line well past typical 88/100-char limits. Ruff/Black exempts long string literals, but if you later enable
E501
-style checks it will flag this.
48-48
: Same note on line lengthApplies to this long-URL call chain as well.
pyspark_datasources/stock.py (1)
78-86
: Trailing comma complies with Ruff’scomma-dangle
rule – good catch
Adding the trailing comma future-proofs the tuple literal and keeps diffs minimal when new fields are appended. Nothing else to address here.pyspark_datasources/weather.py (1)
61-68
: Schema block reformatted correctly
Wrapping theStructType
fields in a list inside parentheses satisfies Ruff’s line-length and indentation rules while improving readability. Looks good.pyspark_datasources/googlesheets.py (1)
32-33
: Condensed URL-validation improves clarity
Collapsing the multi-line condition into a single readable line keeps the guard tight and passes Ruff’s complexity checks. Implementation is unchanged — LGTM.pyspark_datasources/huggingface.py (2)
63-65
: Readability tweak acknowledged—no action requiredAdding a blank line after the local
load_dataset_builder
import complies with Ruff’sE402
guidance and improves visual separation between imports and business logic. No functional impact.
91-93
: Consistent formatting looks goodSame comment as above—the blank line after the
load_dataset
import keeps the style consistent across methods.pyspark_datasources/lance.py (1)
45-47
: Docstring formatting improvedThe inserted blank line makes the “Notes” section stand out and matches neighbouring datasource docs. No further changes needed.
custom_data = streaming_df.select( | ||
col("value").cast("string").alias("Product_Code__c"), | ||
lit("Sample Product").alias("Name"), | ||
(col("value") * 29.99).cast("double").alias("Price__c"), | ||
current_timestamp().alias("Created_Date__c") | ||
current_timestamp().alias("Created_Date__c"), | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
custom_data
variable triggers Ruff F841 (assigned but never used)
The example intentionally leaves the write call commented out, so Ruff marks custom_data
as unused. Either prefix with an underscore or suppress with # noqa: F841
:
- custom_data = streaming_df.select(
+ custom_data = streaming_df.select( # noqa: F841
This preserves the illustrative variable while keeping the example Ruff-clean.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
custom_data = streaming_df.select( | |
col("value").cast("string").alias("Product_Code__c"), | |
lit("Sample Product").alias("Name"), | |
(col("value") * 29.99).cast("double").alias("Price__c"), | |
current_timestamp().alias("Created_Date__c") | |
current_timestamp().alias("Created_Date__c"), | |
) | |
custom_data = streaming_df.select( # noqa: F841 | |
col("value").cast("string").alias("Product_Code__c"), | |
lit("Sample Product").alias("Name"), | |
(col("value") * 29.99).cast("double").alias("Price__c"), | |
current_timestamp().alias("Created_Date__c"), | |
) |
🧰 Tools
🪛 Ruff (0.12.2)
404-404: Local variable custom_data
is assigned to but never used
Remove assignment to unused variable custom_data
(F841)
🤖 Prompt for AI Agents
In examples/salesforce_sink_example.py around lines 404 to 410, the variable
custom_data is assigned but never used, triggering Ruff F841. To fix this,
either rename the variable to _custom_data or add a comment # noqa: F841 at the
end of the assignment line to suppress the warning while keeping the example
code intact.
written = self._write_to_salesforce(sf, records_buffer, batch_id) | ||
logger.info(f"✅ Batch {batch_id}: Successfully wrote {written} records (buffer flush)") | ||
logger.info( | ||
f"✅ Batch {batch_id}: Successfully wrote {written} records (buffer flush)" | ||
) | ||
total_records_written += written |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several log-message lines still violate the line-length = 100
Ruff policy
These >100-character f-strings will fail CI once Ruff is enforced. Split the message string and leverage logging’s built-in formatting to avoid eager f-string evaluation:
- logger.info(
- f"✅ Batch {batch_id}: Successfully wrote {written} records (buffer flush)"
- )
+ logger.info(
+ "✅ Batch %s: Successfully wrote %s records (buffer flush)",
+ batch_id,
+ written,
+ )
@@
- logger.info(
- f"✅ Batch {batch_id}: Successfully wrote {total_records_written} records (total)"
- )
+ logger.info(
+ "✅ Batch %s: Successfully wrote %s records (total)",
+ batch_id,
+ total_records_written,
+ )
@@
- logger.error(
- f"Error in batch creation for batch {i//self.batch_size + 1}: {str(e)}"
- )
+ logger.error(
+ "Error in batch creation for batch %s: %s",
+ (i // self.batch_size) + 1,
+ str(e),
+ )
Applying the same pattern to the other long log lines keeps everything within the configured limit and removes unnecessary string interpolation overhead.
Also applies to: 298-300, 381-383
🤖 Prompt for AI Agents
In pyspark_datasources/salesforce.py around lines 268 to 272, the log message
uses an f-string exceeding the 100-character line length limit. To fix this,
split the log message into multiple parts or use logging's built-in formatting
by passing the message template and variables separately instead of an f-string.
Apply the same fix to the similar long log lines at 298-300 and 381-383 to
comply with the Ruff line-length policy and avoid eager string interpolation.
|
||
# Test that the datasource is registered with correct name | ||
assert SalesforceDataSource.name() == "salesforce" | ||
|
||
# Test that the data source is streaming-only (no batch writer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Star-import trips Ruff’s F405 – replace with explicit imports.
from pyspark_datasources import *
makes every symbol look undefined to Ruff; the very next assertions referencing SalesforceDataSource
already trigger F405. This will fail the new Ruff gate on CI.
-from pyspark_datasources import *
+from pyspark_datasources import (
+ GithubDataSource,
+ FakeDataSource,
+ KaggleDataSource,
+ OpenSkyDataSource,
+ SalesforceDataSource,
+ ArrowDataSource,
+)
Besides satisfying Ruff, this improves readability and avoids accidental name clashes.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Test that the datasource is registered with correct name | |
assert SalesforceDataSource.name() == "salesforce" | |
# Test that the data source is streaming-only (no batch writer) | |
-from pyspark_datasources import * | |
+from pyspark_datasources import ( | |
+ GithubDataSource, | |
+ FakeDataSource, | |
+ KaggleDataSource, | |
+ OpenSkyDataSource, | |
+ SalesforceDataSource, | |
+ ArrowDataSource, | |
+) |
🧰 Tools
🪛 Ruff (0.12.2)
81-81: SalesforceDataSource
may be undefined, or defined from star imports
(F405)
🤖 Prompt for AI Agents
In tests/test_data_sources.py around lines 79 to 83, replace the star-import
from pyspark_datasources with explicit imports of only the needed symbols such
as SalesforceDataSource. This will fix Ruff's F405 error by making all imported
names explicit, improving readability and preventing accidental name clashes.
Identify all used classes or functions from pyspark_datasources in this file and
import them explicitly instead of using a wildcard import.
Summary by CodeRabbit
New Features
Chores
Style
Documentation