Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ refactor scanner.py to reduce repetative if statements. #228

Merged
merged 15 commits into from
Jul 6, 2023

Conversation

sudiptob2
Copy link
Contributor

@sudiptob2 sudiptob2 commented Jun 25, 2023

closes #227
related to #153

I've created this PR to improve the way we express the idea. Most of the crawlers have already been implemented, but we still have a few more to go, specifically #229, #230, and #231.

However, we won't be able to completely remove the old crawl.py just yet. We need to have a discussion about the following points:

  • Where should we put the remaining pieces of code from crawl.py?
  • We couldn't include the storage_bucket and gke related crawler in the double loop method. Any ideas on how to handle that?
  • the signature of the project_list crawler does not match with the interface, how to refactor this?
    Also, let's brainstorm ways to polish and enhance this draft.

mentions: @mshudrak @0xDeva @peb-peb and others.

@sudiptob2
Copy link
Contributor Author

sudiptob2 commented Jun 25, 2023

I've created this PR to improve the way we express the idea. Most of the crawlers have already been implemented, but we still have a few more to go, specifically #229, #230, and #231.

However, we won't be able to completely remove the old crawl.py just yet. We need to have a discussion about the following points:

  • Where should we put the remaining pieces of code from crawl.py?
  • We couldn't include the storage_bucket and gke related crawler in the double loop method. Any ideas on how to handle that?
  • the signature of the project_list crawler does not match with the interface, how to refactor this?
    Also, let's brainstorm ways to polish and enhance this draft.

mentions: @mshudrak @0xDeva @peb-peb and others.

@under-hill
Copy link
Collaborator

What if we add scan_config.get(config_setting, {}) as a third parameter to ICrawler.crawl allowing each crawler implementation to have access to resource-specific information from the scan_config? This should help with "storage_buckets", and others that may want more specific config settings in the future

@under-hill
Copy link
Collaborator

For GKE we can move the specific credential logic into a gke_client.py's get_service(), although this is playing a bit fast & loose with the IClient interface's stated contract for get_service

@sudiptob2
Copy link
Contributor Author

What if we add scan_config.get(config_setting, {}) as a third parameter to ICrawler.crawl allowing each crawler implementation to have access to resource-specific information from the scan_config? This should help with "storage_buckets", and others that may want more specific config settings in the future.

make sense 👍
should we make the third config parameter optional? Because there might be places that just want to call the crawl method.

@under-hill
Copy link
Collaborator

make sense 👍 should we make the third config parameter optional? Because there might be places that just want to call the crawl method.

Yep optional makes sense

@mshudrak
Copy link
Collaborator

Ok, let's discuss questions one by one. Starting with the last one.
the signature of the project_list crawler does not match with the interface, how to refactor this?
Could you please elaborate on that? Do you mean def crawl(self, service: discovery.Resource) -> List[Dict[str, Any]]: here does not match others or something else?

@mshudrak
Copy link
Collaborator

I am a bit fallen behind of refactoring, apologizing if that was already discussed.

We couldn't include the storage_bucket and gke related crawler in the double loop method. Any ideas on how to handle that? - optional third parameter seems to makes sense here. As for GKE, yea one option is to move this logic into get_service() (obtaining gke_client: container_v1.services.cluster_manager.client.ClusterManagerClient for get_gke_clusters

@mshudrak
Copy link
Collaborator

Where should we put the remaining pieces of code from crawl.py? - so, if we address storage and GKE we have the following functions left:

def infinite_defaultdict(): - I'd just move that into the scanner.py for now or create another misc.py file and move models.py into misc.py too. There are some functions in the scanner.py that can be added there too.
def get_sas_for_impersonation( #231 addressing it?
def get_service_accounts - #231 addressing it?

@sudiptob2
Copy link
Contributor Author

Ok, let's discuss questions one by one. Starting with the last one.
the signature of the project_list crawler does not match the interface, how to refactor this?
Could you please elaborate on that? Do you mean def crawl(self, service: discovery.Resource) -> List[Dict[str, Any]]: here does not match others or something else?

Hi @mshudrak thanks for your response 🙏 You might already be up to speed with the conversation, but I'll attempt to provide further clarification in this comment. Hopefully, it will be helpful 💪

The Crawler interface is defined in the interface_crawler.py. Now if we look at get_bucket_names, get_gke_clusters, get_gke_images, get_sas_for_impersonation in the crawl.py these methods have slightly different signature than the interface. So we are discussing about that 🙂

If we can address above-mentioned function the only leftover will be def infinite_defaultdict() and moving it into into misc.py make sense. But if we want to make it quick as of now we can move it to scanner.py

#231 addresses the get_service_accounts. My plan is to discuss and refactor different signature ones (including
get_sas_for_impersonation) with this PR. We can make subtasks if required.

@mshudrak
Copy link
Collaborator

mshudrak commented Jun 27, 2023

Thanks for the explanation. All right, I think I understood it right. It makes sense to have optional flag for storage related calls and making specific implementation in get_service for get_gke_clusters. As for the get_gke_images, I think we can easily refactor it to receive credentials instead of access_token. Access_token can easily be fetched from credentials. We actually do it here

project_result['gke_images'] = crawl.get_gke_images(project_id,

before passing into this function.
get_sas_for_impersonation is a bit different. It might not actually be a good fit for the crawler interface since it is just parsing iam_policy. The best place would be in the future misc.py but for now we can move it into the scanner.py

@sudiptob2
Copy link
Contributor Author

sudiptob2 commented Jun 29, 2023

@under-hill , As I was refactoring get_bucket_names(), I discovered that the return type of this method is different, causing a violation of the interface contract. I'm unsure whether to keep it within the double loop or relocate it to misc.py and call outside of the double loop.

Edit: I have created this PR #234 with a possible refactoring attempt. Pleae take a look @under-hill @0xDeva @mshudrak

@sudiptob2 sudiptob2 mentioned this pull request Jun 29, 2023
3 tasks
@mshudrak
Copy link
Collaborator

mshudrak commented Jul 2, 2023

@sudiptob2 I'd relocate it to misc.py.

Edit: Ok, I see you created a separate PR with refactoring. No strong preference on that. Let's keep it as you implemented, then.

@sudiptob2
Copy link
Contributor Author

Hello everyone, this PR is getting quite large. As part of my work, I have moved the gke crawlers to misc_crawler.py and relocated other helper methods to scanner.py.

At this point, I believe it's appropriate to mark this PR as ready for review. I would love to hear your thoughts on this approach. If there are no bugs and merged, we can address any refinement suggestions through separate tickets. What do you think?

mentions: @mshudrak @under-hill @0xDeva

@sudiptob2 sudiptob2 marked this pull request as ready for review July 5, 2023 04:52
@mshudrak
Copy link
Collaborator

mshudrak commented Jul 5, 2023

Makes sense, I will review this one :)

@mshudrak mshudrak self-requested a review July 5, 2023 17:15
Copy link
Collaborator

@mshudrak mshudrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Thanks Sudipto. I left some minor comments. PTAL.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Licensed under the Apache License, Version 2.0 (the "License");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this extra space here and below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

# limitations under the License.


"""The module to query GCP resources via RestAPI.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add new description for this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description added

import requests
from google.cloud import container_v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to change order of these two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventioanlly imports are sorted more generic to least gneric order. Basically I have used the following convention.

Standard library imports
Third-party imports
Application-specific imports

This is also described in google's python styleguide here. [Refer to the example code block also]

previous_response=response
)
while request is not None:
response = request.execute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why pylint did not catch it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting, 4 spaces shouldn't be there in the first place.

crawler_config = scan_config.get(crawler_name)
# add gcs output path to the config.
# this path is used by the storage bucket crawler as of now.
crawler_config['gcs_output_path'] = gcs_output_path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a better solution rather than sending it into every crawler. We can think about that in the future but for now it is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this would be a good refinement ticket. Will create once once this PR is complete.

@@ -435,14 +257,14 @@ def crawl_loop(initial_sa_tuples: List[Tuple[str, Credentials, List[str]]],
if impers is not None and impers.get('impersonate', False) is True:
iam_client = iam_client_for_credentials(credentials)
if is_set(scan_config, 'iam_policy') is False:
iam_policy = crawl.get_iam_policy(
iam_policy = CrawlerFactory.create_crawler('iam_policy').crawl(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a short comment on why this crawler is used outside of new loop

Copy link
Contributor Author

@sudiptob2 sudiptob2 Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the additional crawlers just after the new loop and added a section comment. Now I think it is more understandable why they are outside of the loop.

@sudiptob2 sudiptob2 requested a review from mshudrak July 6, 2023 07:02
Copy link
Collaborator

@mshudrak mshudrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@mshudrak mshudrak merged commit 27fd173 into google:main Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

refactor repetative if statements in the scanner.py
3 participants