♻️ refactor scanner.py to reduce repetative if statements. #228

sudiptob2 · 2023-06-25T08:26:29Z

closes #227
related to #153

I've created this PR to improve the way we express the idea. Most of the crawlers have already been implemented, but we still have a few more to go, specifically #229, #230, and #231.

However, we won't be able to completely remove the old crawl.py just yet. We need to have a discussion about the following points:

Where should we put the remaining pieces of code from crawl.py?
We couldn't include the storage_bucket and gke related crawler in the double loop method. Any ideas on how to handle that?
the signature of the project_list crawler does not match with the interface, how to refactor this?
Also, let's brainstorm ways to polish and enhance this draft.

mentions: @mshudrak @0xDeva @peb-peb and others.

sudiptob2 · 2023-06-25T08:48:56Z

I've created this PR to improve the way we express the idea. Most of the crawlers have already been implemented, but we still have a few more to go, specifically #229, #230, and #231.

However, we won't be able to completely remove the old crawl.py just yet. We need to have a discussion about the following points:

Where should we put the remaining pieces of code from crawl.py?
We couldn't include the storage_bucket and gke related crawler in the double loop method. Any ideas on how to handle that?
the signature of the project_list crawler does not match with the interface, how to refactor this?
Also, let's brainstorm ways to polish and enhance this draft.

mentions: @mshudrak @0xDeva @peb-peb and others.

under-hill · 2023-06-25T22:37:16Z

What if we add scan_config.get(config_setting, {}) as a third parameter to ICrawler.crawl allowing each crawler implementation to have access to resource-specific information from the scan_config? This should help with "storage_buckets", and others that may want more specific config settings in the future

under-hill · 2023-06-25T22:47:00Z

For GKE we can move the specific credential logic into a gke_client.py's get_service(), although this is playing a bit fast & loose with the IClient interface's stated contract for get_service

sudiptob2 · 2023-06-26T05:42:38Z

What if we add scan_config.get(config_setting, {}) as a third parameter to ICrawler.crawl allowing each crawler implementation to have access to resource-specific information from the scan_config? This should help with "storage_buckets", and others that may want more specific config settings in the future.

make sense 👍
should we make the third config parameter optional? Because there might be places that just want to call the crawl method.

under-hill · 2023-06-26T14:22:13Z

make sense 👍 should we make the third config parameter optional? Because there might be places that just want to call the crawl method.

Yep optional makes sense

mshudrak · 2023-06-26T23:09:52Z

Ok, let's discuss questions one by one. Starting with the last one.
the signature of the project_list crawler does not match with the interface, how to refactor this?
Could you please elaborate on that? Do you mean def crawl(self, service: discovery.Resource) -> List[Dict[str, Any]]: here does not match others or something else?

mshudrak · 2023-06-27T00:01:58Z

I am a bit fallen behind of refactoring, apologizing if that was already discussed.

We couldn't include the storage_bucket and gke related crawler in the double loop method. Any ideas on how to handle that? - optional third parameter seems to makes sense here. As for GKE, yea one option is to move this logic into get_service() (obtaining gke_client: container_v1.services.cluster_manager.client.ClusterManagerClient for get_gke_clusters

mshudrak · 2023-06-27T00:08:04Z

Where should we put the remaining pieces of code from crawl.py? - so, if we address storage and GKE we have the following functions left:

def infinite_defaultdict(): - I'd just move that into the scanner.py for now or create another misc.py file and move models.py into misc.py too. There are some functions in the scanner.py that can be added there too.
def get_sas_for_impersonation( #231 addressing it?
def get_service_accounts - #231 addressing it?

sudiptob2 · 2023-06-27T08:54:52Z

Ok, let's discuss questions one by one. Starting with the last one.
the signature of the project_list crawler does not match the interface, how to refactor this?
Could you please elaborate on that? Do you mean def crawl(self, service: discovery.Resource) -> List[Dict[str, Any]]: here does not match others or something else?

Hi @mshudrak thanks for your response 🙏 You might already be up to speed with the conversation, but I'll attempt to provide further clarification in this comment. Hopefully, it will be helpful 💪

The Crawler interface is defined in the interface_crawler.py. Now if we look at get_bucket_names, get_gke_clusters, get_gke_images, get_sas_for_impersonation in the crawl.py these methods have slightly different signature than the interface. So we are discussing about that 🙂

If we can address above-mentioned function the only leftover will be def infinite_defaultdict() and moving it into into misc.py make sense. But if we want to make it quick as of now we can move it to scanner.py

#231 addresses the get_service_accounts. My plan is to discuss and refactor different signature ones (including
get_sas_for_impersonation) with this PR. We can make subtasks if required.

…factor-repetative-ifs

mshudrak · 2023-06-27T20:34:46Z

Thanks for the explanation. All right, I think I understood it right. It makes sense to have optional flag for storage related calls and making specific implementation in get_service for get_gke_clusters. As for the get_gke_images, I think we can easily refactor it to receive credentials instead of access_token. Access_token can easily be fetched from credentials. We actually do it here

gcp_scanner/src/gcp_scanner/scanner.py

Line 327 in ba8dd9a

project_result['gke_images'] = crawl.get_gke_images(project_id,

before passing into this function.
get_sas_for_impersonation is a bit different. It might not actually be a good fit for the crawler interface since it is just parsing iam_policy. The best place would be in the future misc.py but for now we can move it into the scanner.py

sudiptob2 · 2023-06-29T05:04:14Z

@under-hill , As I was refactoring get_bucket_names(), I discovered that the return type of this method is different, causing a violation of the interface contract. I'm unsure whether to keep it within the double loop or relocate it to misc.py and call outside of the double loop.

Edit: I have created this PR #234 with a possible refactoring attempt. Pleae take a look @under-hill @0xDeva @mshudrak

mshudrak · 2023-07-02T17:14:34Z

@sudiptob2 I'd relocate it to misc.py.

Edit: Ok, I see you created a separate PR with refactoring. No strong preference on that. Let's keep it as you implemented, then.

…factor-repetative-ifs

sudiptob2 · 2023-07-05T04:52:28Z

Hello everyone, this PR is getting quite large. As part of my work, I have moved the gke crawlers to misc_crawler.py and relocated other helper methods to scanner.py.

At this point, I believe it's appropriate to mark this PR as ready for review. I would love to hear your thoughts on this approach. If there are no bugs and merged, we can address any refinement suggestions through separate tickets. What do you think?

mentions: @mshudrak @under-hill @0xDeva

mshudrak · 2023-07-05T17:15:40Z

Makes sense, I will review this one :)

mshudrak

Overall looks good. Thanks Sudipto. I left some minor comments. PTAL.

mshudrak · 2023-07-05T17:17:58Z

src/gcp_scanner/crawler/misc_crawler.py

-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
+#   Licensed under the Apache License, Version 2.0 (the "License");


Do we really need this extra space here and below?

mshudrak · 2023-07-05T17:18:31Z

src/gcp_scanner/crawler/misc_crawler.py

-# limitations under the License.
-
-
-"""The module to query GCP resources via RestAPI.


Could you please add new description for this file?

description added

mshudrak · 2023-07-05T17:19:04Z

src/gcp_scanner/crawler/misc_crawler.py

 import requests
+from google.cloud import container_v1


Why do you want to change order of these two?

Conventioanlly imports are sorted more generic to least gneric order. Basically I have used the following convention.

Standard library imports Third-party imports Application-specific imports

This is also described in google's python styleguide here. [Refer to the example code block also]

mshudrak · 2023-07-05T17:20:24Z

src/gcp_scanner/crawler/source_repo_crawler.py

-            previous_response=response
-      )
+      while request is not None:
+        response = request.execute()


I am curious why pylint did not catch it...

Hmm, interesting, 4 spaces shouldn't be there in the first place.

src/gcp_scanner/scanner.py

mshudrak · 2023-07-05T17:32:39Z

src/gcp_scanner/scanner.py

+            crawler_config = scan_config.get(crawler_name)
+          # add gcs output path to the config.
+          # this path is used by the storage bucket crawler as of now.
+          crawler_config['gcs_output_path'] = gcs_output_path


I wonder if there is a better solution rather than sending it into every crawler. We can think about that in the future but for now it is fine.

Right, this would be a good refinement ticket. Will create once once this PR is complete.

mshudrak · 2023-07-05T17:39:33Z

src/gcp_scanner/scanner.py

@@ -435,14 +257,14 @@ def crawl_loop(initial_sa_tuples: List[Tuple[str, Credentials, List[str]]],
      if impers is not None and impers.get('impersonate', False) is True:
        iam_client = iam_client_for_credentials(credentials)
        if is_set(scan_config, 'iam_policy') is False:
-          iam_policy = crawl.get_iam_policy(
+          iam_policy = CrawlerFactory.create_crawler('iam_policy').crawl(


We can add a short comment on why this crawler is used outside of new loop

I moved the additional crawlers just after the new loop and added a section comment. Now I think it is more understandable why they are outside of the loop.

mshudrak

LGTM, thanks.

sudiptob2 added 2 commits June 25, 2023 14:25

♻️ initial refactor of scanner.py to reduce repetitive if statements.

d03eecb

🚨 Removing lint warnings.

98dd633

sudiptob2 added 3 commits June 27, 2023 20:39

Merge branch 'main' of github.com:google/gcp_scanner into feat/227/re…

a58b87a

…factor-repetative-ifs

♻️ service accounts list and endpoints moved into the loop.

f160191

♻️ remove unused variables.

361159f

sudiptob2 mentioned this pull request Jun 29, 2023

Feat/bucket names crawler #234

Merged

3 tasks

sudiptob2 added 6 commits July 3, 2023 23:26

Merge branch 'main' of github.com:google/gcp_scanner into feat/227/re…

cc37b08

…factor-repetative-ifs

♻️ update signature of the crawl method to match with the interface.

4f97df7

♻️ move storage bucket under the crawl_client loop.

7f44233

♻️ move sas_for_impersonation into scanner.py

dfc023a

♻️ move infinite default-dict to scanner.

c8fa60f

♻️ move miscellaneous crawler into misc_crawler.py

ab51e3e

sudiptob2 marked this pull request as ready for review July 5, 2023 04:52

mshudrak self-requested a review July 5, 2023 17:15

mshudrak requested changes Jul 5, 2023

View reviewed changes

peb-peb mentioned this pull request Jul 5, 2023

feat ✨: make the scan loop asynchronous #237

Closed

6 tasks

sudiptob2 added 4 commits July 6, 2023 09:36

🎨 formatted copyright comment.

a1585ee

🎨 add docstring to the module.

f1581af

🎨 add explanation comment.

9a6ca25

🎨 add explanation comment for additional crawlers call.

5cd058b

sudiptob2 requested a review from mshudrak July 6, 2023 07:02

mshudrak approved these changes Jul 6, 2023

View reviewed changes

mshudrak merged commit 27fd173 into google:main Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

♻️ refactor scanner.py to reduce repetative if statements. #228

♻️ refactor scanner.py to reduce repetative if statements. #228

sudiptob2 commented Jun 25, 2023 •

edited

Loading

sudiptob2 commented Jun 25, 2023 •

edited

Loading

under-hill commented Jun 25, 2023

under-hill commented Jun 25, 2023

sudiptob2 commented Jun 26, 2023

under-hill commented Jun 26, 2023

mshudrak commented Jun 26, 2023

mshudrak commented Jun 27, 2023

mshudrak commented Jun 27, 2023

sudiptob2 commented Jun 27, 2023

mshudrak commented Jun 27, 2023 •

edited

Loading

sudiptob2 commented Jun 29, 2023 •

edited

Loading

mshudrak commented Jul 2, 2023 •

edited

Loading

sudiptob2 commented Jul 5, 2023

mshudrak commented Jul 5, 2023

mshudrak left a comment

mshudrak Jul 5, 2023

sudiptob2 Jul 6, 2023

mshudrak Jul 5, 2023

sudiptob2 Jul 6, 2023

mshudrak Jul 5, 2023

sudiptob2 Jul 6, 2023

mshudrak Jul 5, 2023

sudiptob2 Jul 6, 2023

mshudrak Jul 5, 2023

sudiptob2 Jul 6, 2023

mshudrak Jul 5, 2023

sudiptob2 Jul 6, 2023 •

edited

Loading

mshudrak left a comment

		# limitations under the License.


		"""The module to query GCP resources via RestAPI.

♻️ refactor scanner.py to reduce repetative if statements. #228

♻️ refactor scanner.py to reduce repetative if statements. #228

Conversation

sudiptob2 commented Jun 25, 2023 • edited Loading

sudiptob2 commented Jun 25, 2023 • edited Loading

under-hill commented Jun 25, 2023

under-hill commented Jun 25, 2023

sudiptob2 commented Jun 26, 2023

under-hill commented Jun 26, 2023

mshudrak commented Jun 26, 2023

mshudrak commented Jun 27, 2023

mshudrak commented Jun 27, 2023

sudiptob2 commented Jun 27, 2023

mshudrak commented Jun 27, 2023 • edited Loading

sudiptob2 commented Jun 29, 2023 • edited Loading

mshudrak commented Jul 2, 2023 • edited Loading

sudiptob2 commented Jul 5, 2023

mshudrak commented Jul 5, 2023

mshudrak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sudiptob2 Jul 6, 2023 • edited Loading

Choose a reason for hiding this comment

mshudrak left a comment

Choose a reason for hiding this comment

sudiptob2 commented Jun 25, 2023 •

edited

Loading

sudiptob2 commented Jun 25, 2023 •

edited

Loading

mshudrak commented Jun 27, 2023 •

edited

Loading

sudiptob2 commented Jun 29, 2023 •

edited

Loading

mshudrak commented Jul 2, 2023 •

edited

Loading

sudiptob2 Jul 6, 2023 •

edited

Loading