Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perceval git backend error "item['data']['Author']" #841

Closed
lukaszgryglicki opened this issue Apr 10, 2020 · 11 comments
Closed

Perceval git backend error "item['data']['Author']" #841

lukaszgryglicki opened this issue Apr 10, 2020 · 11 comments

Comments

@lukaszgryglicki
Copy link
Contributor

2020-04-10 19:11:27,931 Error enriching raw from git (https://github.com/cloudfoundry/bosh-aws-cpi-release): 'Author'
Traceback (most recent call last):
  File "/repos/grimoirelab-elk/grimoire_elk/elk.py", line 525, in enrich_backend
    total_ids = load_identities(ocean_backend, enrich_backend)
  File "/repos/grimoirelab-elk/grimoire_elk/elk.py", line 285, in load_identities
    for identity in identities:
  File "/repos/grimoirelab-elk/grimoire_elk/enriched/git.py", line 139, in get_identities
    if item['data']['Author']:
KeyError: 'Author'
@valeriocos
Copy link
Member

Hi @lukaszgryglicki, I'm trasferring this issue to ELK

@valeriocos valeriocos transferred this issue from chaoss/grimoirelab-perceval Apr 11, 2020
@valeriocos
Copy link
Member

I'm not able to replicate the issue (I'm using the lastest versions of ELK and Perceval) on the repo https://github.com/cloudfoundry/bosh-aws-cpi-release.

Collection for git: starting...
  2020-04-11 10:10:05,062 Reading projects data from  ./projects.json 
  2020-04-11 10:10:05,063 [git] collection phase starts
  2020-04-11 10:10:05,063 [git] collection starts for https://github.com/cloudfoundry/bosh-aws-cpi-release
  2020-04-11 10:10:05,254 Created index http://localhost:9200/git_cloudfoundry_x
  2020-04-11 10:10:05,318 Alias {'alias': 'git-raw', 'index': 'git_cloudfoundry_x'} created on http://localhost:9200/git_cloudfoundry_x.
  2020-04-11 10:10:05,325 [git] Incremental from: None for https://github.com/cloudfoundry/bosh-aws-cpi-release
  2020-04-11 10:10:22,452 Fetching commits: 'https://github.com/cloudfoundry/bosh-aws-cpi-release' git repository from 1970-01-01 00:00:00+00:00 to 2100-01-01 00:00:00+00:00; all branches
  2020-04-11 10:10:27,126 Fetch process completed: 1521 commits fetched
  2020-04-11 10:10:27,209 [git] Done collection for https://github.com/cloudfoundry/bosh-aws-cpi-release
  2020-04-11 10:10:27,210 [git] collection finished for https://github.com/cloudfoundry/bosh-aws-cpi-release
  2020-04-11 10:10:27,210 [git] collection phase finished in 00:00:22
Collection for git: finished after 00:00:22 hours
  2020-04-11 10:10:27,235 Loading raw data finished!
  2020-04-11 10:10:27,236 Reading projects data from  ./projects.json 
  2020-04-11 10:10:37,315 [git] enrichment phase starts
  2020-04-11 10:10:37,521 Created index http://localhost:9200/git_cloudfoundryenriched_x
  2020-04-11 10:10:37,533 [git] enrichment starts for https://github.com/cloudfoundry/bosh-aws-cpi-release
  2020-04-11 10:10:37,637 Alias {'alias': 'git', 'index': 'git_cloudfoundryenriched_x'} created on http://localhost:9200/git_cloudfoundryenriched_x.
  2020-04-11 10:10:37,651 Alias {'alias': 'git_author', 'index': 'git_cloudfoundryenriched_x'} created on http://localhost:9200/git_cloudfoundryenriched_x.
  2020-04-11 10:10:37,665 Alias {'alias': 'git_enrich', 'index': 'git_cloudfoundryenriched_x'} created on http://localhost:9200/git_cloudfoundryenriched_x.
  2020-04-11 10:10:37,679 Alias {'alias': 'affiliations', 'index': 'git_cloudfoundryenriched_x'} created on http://localhost:9200/git_cloudfoundryenriched_x.
  2020-04-11 10:10:37,694 Alias {'alias': 'all_enriched', 'index': 'git_cloudfoundryenriched_x'} created on http://localhost:9200/git_cloudfoundryenriched_x.
  2020-04-11 10:10:52,449 [git] Done enrichment for https://github.com/cloudfoundry/bosh-aws-cpi-release
  2020-04-11 10:10:52,450 [git] enrichment finished for https://github.com/cloudfoundry/bosh-aws-cpi-release
  2020-04-11 10:10:52,450 [git] enrichment phase finished in 0:00:15
  2020-04-11 10:10:52,451 [git] data retention start
  2020-04-11 10:10:52,472 [git] data retention end
  2020-04-11 10:10:52,472 [git] identities retention end
  2020-04-11 10:10:52,472 [git] autorefresh start
  2020-04-11 10:10:52,502 [git] Refreshing identities
  2020-04-11 10:11:05,570 [git] autorefresh end
  2020-04-11 10:11:05,571 [git] no studies phase
  2020-04-11 10:11:05,571 [git] autorefresh for studies start
  2020-04-11 10:11:05,571 [git] autorefresh for studies end
  2020-04-11 10:11:05,571 Loading enriched data finished!

Process finished with exit code 0

@lukaszgryglicki
Copy link
Contributor Author

Using all grimoire tools from master - image compiled about 3 days ago, exact commandline:

p2o.py --enrich --index xyz-raw --index-enrich xyz -e [redacted] -g --bulk-size 500 --scroll-size 500 --db-host [redacted] --db-[redacted] [redacted] --db-user [redacted] --db-password [redacted] git https://github.com/cloudfoundry/auction

Hmm - and I see that error for many other coundfoundry repos, strange...

@valeriocos
Copy link
Member

I don't see that error with the master branch. I have also checked the logs of the cloudfoundry instance Bitergia mantains, and the error doesn't appear there.

  2020-04-11 11:33:53,712 Reading projects data from  ./projects.json 
  2020-04-11 11:33:53,712 [git] collection phase starts
  2020-04-11 11:33:53,713 [git] collection starts for https://github.com/cloudfoundry/auction
Collection for git: starting...
  2020-04-11 11:33:53,756 [git] Incremental from: None for https://github.com/cloudfoundry/auction
  2020-04-11 11:33:55,259 Fetching commits: 'https://github.com/cloudfoundry/auction' git repository from 1970-01-01 00:00:00+00:00 to 2100-01-01 00:00:00+00:00; all branches
  2020-04-11 11:33:56,217 Fetch process completed: 295 commits fetched
Collection for git: finished after 00:00:02 hours
  2020-04-11 11:33:56,349 [git] Done collection for https://github.com/cloudfoundry/auction
  2020-04-11 11:33:56,350 [git] collection finished for https://github.com/cloudfoundry/auction
  2020-04-11 11:33:56,350 [git] collection phase finished in 00:00:02
  2020-04-11 11:33:56,381 Loading raw data finished!
  2020-04-11 11:33:56,381 Reading projects data from  ./projects.json 
  2020-04-11 11:34:06,405 [git] enrichment phase starts
  2020-04-11 11:34:06,489 [git] enrichment starts for https://github.com/cloudfoundry/auction
  2020-04-11 11:34:10,401 [git] Done enrichment for https://github.com/cloudfoundry/auction
  2020-04-11 11:34:10,402 [git] enrichment finished for https://github.com/cloudfoundry/auction
  2020-04-11 11:34:10,402 [git] enrichment phase finished in 0:00:03
  2020-04-11 11:34:10,402 [git] data retention start
  2020-04-11 11:34:10,422 [git] data retention end
  2020-04-11 11:34:10,422 [git] identities retention end
  2020-04-11 11:34:10,422 [git] autorefresh start
  2020-04-11 11:34:10,451 [git] Refreshing identities
  2020-04-11 11:34:33,569 [git] autorefresh end
  2020-04-11 11:34:33,569 [git] no studies phase
  2020-04-11 11:34:33,569 [git] autorefresh for studies start
  2020-04-11 11:34:33,569 [git] autorefresh for studies end
  2020-04-11 11:34:33,569 Loading enriched data finished!

Process finished with exit code 0

The problem you see could be related to a recent change to anonymize Git data (see pointers below) which hasn't been applied to p2o, however the anonymize param is set to False by default and shouldn't affect the working of p2o.

Can you try to add the anonymize param here https://github.com/chaoss/grimoirelab-elk/blob/master/utils/p2o.py#L58 and set the default value to False? If it is convenient for you, you could improve the parser by adding the anonymze param (ref: https://github.com/chaoss/grimoirelab-elk/blob/master/grimoire_elk/utils.py#L321).

Thanks (and sorry for the inconvenience)

@lukaszgryglicki
Copy link
Contributor Author

Will definitely do after the holidays, thanks!

@vchrombie
Copy link
Member

Hi
Recently, I saw similar issue, #803 (comment).

Not sure if it is relevant but thought of referring it here.

@valeriocos
Copy link
Member

Hi @vchrombie , thank you for the pointer! I'm not sure the errors are related, since this one seems to be linked to a new feature included at #824.

@lukaszgryglicki
Copy link
Contributor Author

So adding anonymize param doesn't change anything: Error is exactly here: https://github.com/chaoss/grimoirelab-elk/blob/master/grimoire_elk/enriched/git.py#L139

Code assumes that item['data']['Author'] is always present, while it is not the case sometimes.

I've made a breakpoint:

if 'Author' not in item['data']:
        pdb.set_trace()

And here you go:

(Pdb) item
{'search_fields': {'item_id': '560485695', 'owner': 'cloudfoundry', 'repo': 'auction'}, 'tag': 'https://github.com/cloudfoundry/auction', 'timestamp': 1584425106.573957, 'origin': 'https://github.com/cloudfoundry/auction', 'category': 'issue', 'uuid': '194628cca35bb6fb73bd1130004a20b8eae6305c', 'metadata__timestamp': '2020-03-17T06:05:06.573957+00:00', 'backend_version': '0.25.1', 'updated_on': 1582223197.0, 'metadata__updated_on': '2020-02-20T18:26:37+00:00', 'backend_name': 'GitHub', 'perceval_version': '0.12.27', 'data': {'assignees_data': [], 'assignee_data': {}, 'comments': 1, 'assignees': [], 'reactions': {'-1': 0, 'hooray': 0, 'url': 'https://api.github.com/repos/cloudfoundry/auction/issues/8/reactions', 'confused': 0, 'heart': 0, 'total_count': 0, '+1': 0, 'eyes': 0, 'rocket': 0, 'laugh': 0}, 'assignee': None, 'updated_at': '2020-02-20T18:26:37Z', 'user_data': {'received_events_url': 'https://api.github.com/users/pommi/received_events', 'followers': 26, 'following_url': 'https://api.github.com/users/pommi/following{/other_user}', 'public_repos': 42, 'repos_url': 'https://api.github.com/users/pommi/repos', 'updated_at': '2020-02-29T12:42:56Z', 'site_admin': False, 'blog': 'http://pommi.nethuis.nl/', 'starred_url': 'https://api.github.com/users/pommi/starred{/owner}{/repo}', 'html_url': 'https://github.com/pommi', 'avatar_url': 'https://avatars0.githubusercontent.com/u/548668?v=4', 'email': None, 'company': '@mendix ', 'subscriptions_url': 'https://api.github.com/users/pommi/subscriptions', 'events_url': 'https://api.github.com/users/pommi/events{/privacy}', 'login': 'pommi', 'following': 2, 'name': 'Pim van den Berg', 'organizations_url': 'https://api.github.com/users/pommi/orgs', 'node_id': 'MDQ6VXNlcjU0ODY2OA==', 'gists_url': 'https://api.github.com/users/pommi/gists{/gist_id}', 'organizations': [{'members_url': 'https://api.github.com/orgs/mendix/members{/member}', 'login': 'mendix', 'avatar_url': 'https://avatars2.githubusercontent.com/u/133443?v=4', 'repos_url': 'https://api.github.com/orgs/mendix/repos', 'node_id': 'MDEyOk9yZ2FuaXphdGlvbjEzMzQ0Mw==', 'description': 'Mendix is the fastest & easiest low-code platform used by businesses to develop mobile & web apps at scale', 'public_members_url': 'https://api.github.com/orgs/mendix/public_members{/member}', 'url': 'https://api.github.com/orgs/mendix', 'hooks_url': 'https://api.github.com/orgs/mendix/hooks', 'issues_url': 'https://api.github.com/orgs/mendix/issues', 'id': 133443, 'events_url': 'https://api.github.com/orgs/mendix/events'}], 'followers_url': 'https://api.github.com/users/pommi/followers', 'bio': None, 'url': 'https://api.github.com/users/pommi', 'hireable': None, 'location': 'Netherlands', 'gravatar_id': '', 'created_at': '2011-01-05T13:50:08Z', 'public_gists': 16, 'type': 'User', 'id': 548668}, 'labels_url': 'https://api.github.com/repos/cloudfoundry/auction/issues/8/labels{/name}', 'repository_url': 'https://api.github.com/repos/cloudfoundry/auction', 'labels': [{'url': 'https://api.github.com/repos/cloudfoundry/auction/labels/scheduled', 'default': False, 'color': 'BFD4F2', 'node_id': 'MDU6TGFiZWwzNzA0MTU4Mjg=', 'name': 'scheduled', 'description': None, 'id': 370415828}], 'pull_request': {'diff_url': 'https://github.com/cloudfoundry/auction/pull/8.diff', 'url': 'https://api.github.com/repos/cloudfoundry/auction/pulls/8', 'patch_url': 'https://github.com/cloudfoundry/auction/pull/8.patch', 'html_url': 'https://github.com/cloudfoundry/auction/pull/8'}, 'html_url': 'https://github.com/cloudfoundry/auction/pull/8', 'comments_url': 'https://api.github.com/repos/cloudfoundry/auction/issues/8/comments', 'body': 'This PR is part of multiple PRs across [rep](https://github.com/cloudfoundry/rep), [auctioneer](https://github.com/cloudfoundry/auctioneer) and [auction](https://github.com/cloudfoundry/auction) to add an optional weighted bin pack first fit component to the scheduling algorithm of Cloud Foundry Diego for scheduling LRPs.\r\n\r\nPRs/issues involved:\r\n* [rep#30](https://github.com/cloudfoundry/rep/pull/30)\r\n* [auctioneer#8](https://github.com/cloudfoundry/auctioneer/pull/8)\r\n* [auction#8](https://github.com/cloudfoundry/auction/pull/8)\r\n* [diego-release#448](https://github.com/cloudfoundry/diego-release/pull/448)\r\n\r\n### What is this change about?\r\n\r\nThese PRs combined introduce a new setting for auctioneer:\r\n```\r\ndiego.auctioneer.bin_pack_first_fit_weight\r\n  description: "Factor to bias against BOSH instance index number of a cell. Instead of spreading containers equally across all cell\r\ns, cells with a lower index number will be deployed to first when this setting is > 0. (0.0 - 1.0)"\r\n  default: 0.0\r\n```\r\n\r\nWhen `bin_pack_first_fit_weight` is set to a value > 0, it will make diego-cells with a lower BOSH instance index number more attractive to deploy LRPs to by adding "weight x diego-cell index" to the score of a diego-cell. Diego-cells will be filled up more instead spreading the LRPs across all diego-cells equally.\r\n\r\nSetting `bin_pack_first_fit_weight` to 0.0 (the default) will effectively disable the optional weighted bin pack first fit component. Everything still keeps working as it was previously.\r\n\r\n### What problem it is trying to solve?\r\n\r\nWith the current deployment algorithm in diego it spreads the LRPs across all diego-cell instances equally. At Mendix we have 64GB memory diego-cells. We need to have the possibility for our customers to deploy 16G containers at all times. This means that we need to have at least 1 diego-cell with 16G memory available at all times. When the LRPs are spread equally you end up with a situation where on average 25% (16/64) of the memory on all our diego-cells is not used.\r\n\r\n### What is the impact if the change is not made?\r\n\r\nIn our case 25% of our diego-cell resources are wasted. We have been running a diego-release with these changes on top for the past months in our production Cloud Foundry foundations. We are saving ten-thousands of $$ monthly on AWS EC2 costs.\r\n\r\n### How should this change be described in diego-release release notes?\r\n\r\nauctioneer\r\n* Add an optional weighted bin pack first fit component to the scheduling algorithm of Cloud Foundry Diego for scheduling LRPs\r\n\r\n### Please provide any contextual information.\r\n\r\nBlog post with more information: https://cloud-infra.engineer/saving-costs-with-a-new-scheduler-in-cloud-foundry-diego/\r\n\r\n### Tag your pair, your PM, and/or team!\r\n\r\nI\'m not sure who to tag.', 'reactions_data': [], 'user': {'received_events_url': 'https://api.github.com/users/pommi/received_events', 'following_url': 'https://api.github.com/users/pommi/following{/other_user}', 'login': 'pommi', 'followers_url': 'https://api.github.com/users/pommi/followers', 'organizations_url': 'https://api.github.com/users/pommi/orgs', 'node_id': 'MDQ6VXNlcjU0ODY2OA==', 'gists_url': 'https://api.github.com/users/pommi/gists{/gist_id}', 'site_admin': False, 'repos_url': 'https://api.github.com/users/pommi/repos', 'starred_url': 'https://api.github.com/users/pommi/starred{/owner}{/repo}', 'url': 'https://api.github.com/users/pommi', 'type': 'User', 'html_url': 'https://github.com/pommi', 'avatar_url': 'https://avatars0.githubusercontent.com/u/548668?v=4', 'gravatar_id': '', 'id': 548668, 'subscriptions_url': 'https://api.github.com/users/pommi/subscriptions', 'events_url': 'https://api.github.com/users/pommi/events{/privacy}'}, 'locked': False, 'node_id': 'MDExOlB1bGxSZXF1ZXN0MzcxNDY3NTc5', 'closed_at': None, 'comments_data': [{'reactions_data': [], 'issue_url': 'https://api.github.com/repos/cloudfoundry/auction/issues/8', 'user': {'received_events_url': 'https://api.github.com/users/cf-gitbot/received_events', 'following_url': 'https://api.github.com/users/cf-gitbot/following{/other_user}', 'login': 'cf-gitbot', 'followers_url': 'https://api.github.com/users/cf-gitbot/followers', 'organizations_url': 'https://api.github.com/users/cf-gitbot/orgs', 'node_id': 'MDQ6VXNlcjU1ODkzNjg=', 'gists_url': 'https://api.github.com/users/cf-gitbot/gists{/gist_id}', 'site_admin': False, 'repos_url': 'https://api.github.com/users/cf-gitbot/repos', 'starred_url': 'https://api.github.com/users/cf-gitbot/starred{/owner}{/repo}', 'url': 'https://api.github.com/users/cf-gitbot', 'type': 'User', 'html_url': 'https://github.com/cf-gitbot', 'avatar_url': 'https://avatars3.githubusercontent.com/u/5589368?v=4', 'gravatar_id': '', 'id': 5589368, 'subscriptions_url': 'https://api.github.com/users/cf-gitbot/subscriptions', 'events_url': 'https://api.github.com/users/cf-gitbot/events{/privacy}'}, 'node_id': 'MDEyOklzc3VlQ29tbWVudDU4MjQ5MTY1MA==', 'updated_at': '2020-02-05T16:28:38Z', 'user_data': {'received_events_url': 'https://api.github.com/users/cf-gitbot/received_events', 'followers': 4, 'following_url': 'https://api.github.com/users/cf-gitbot/following{/other_user}', 'public_repos': 1, 'repos_url': 'https://api.github.com/users/cf-gitbot/repos', 'updated_at': '2020-02-03T19:47:54Z', 'site_admin': False, 'blog': '', 'starred_url': 'https://api.github.com/users/cf-gitbot/starred{/owner}{/repo}', 'html_url': 'https://github.com/cf-gitbot', 'avatar_url': 'https://avatars3.githubusercontent.com/u/5589368?v=4', 'email': None, 'company': None, 'subscriptions_url': 'https://api.github.com/users/cf-gitbot/subscriptions', 'events_url': 'https://api.github.com/users/cf-gitbot/events{/privacy}', 'login': 'cf-gitbot', 'following': 0, 'name': None, 'organizations_url': 'https://api.github.com/users/cf-gitbot/orgs', 'node_id': 'MDQ6VXNlcjU1ODkzNjg=', 'gists_url': 'https://api.github.com/users/cf-gitbot/gists{/gist_id}', 'organizations': [{'members_url': 'https://api.github.com/orgs/pivotal-cf/members{/member}', 'login': 'pivotal-cf', 'avatar_url': 'https://avatars0.githubusercontent.com/u/5497370?v=4', 'repos_url': 'https://api.github.com/orgs/pivotal-cf/repos', 'node_id': 'MDEyOk9yZ2FuaXphdGlvbjU0OTczNzA=', 'description': '', 'public_members_url': 'https://api.github.com/orgs/pivotal-cf/public_members{/member}', 'url': 'https://api.github.com/orgs/pivotal-cf', 'hooks_url': 'https://api.github.com/orgs/pivotal-cf/hooks', 'issues_url': 'https://api.github.com/orgs/pivotal-cf/issues', 'id': 5497370, 'events_url': 'https://api.github.com/orgs/pivotal-cf/events'}], 'followers_url': 'https://api.github.com/users/cf-gitbot/followers', 'bio': None, 'url': 'https://api.github.com/users/cf-gitbot', 'hireable': None, 'location': None, 'gravatar_id': '', 'created_at': '2013-10-01T21:01:48Z', 'public_gists': 0, 'type': 'User', 'id': 5589368}, 'id': 582491650, 'url': 'https://api.github.com/repos/cloudfoundry/auction/issues/comments/582491650', 'reactions': {'-1': 0, 'hooray': 0, 'url': 'https://api.github.com/repos/cloudfoundry/auction/issues/comments/582491650/reactions', 'confused': 0, 'heart': 0, 'total_count': 0, '+1': 0, 'eyes': 0, 'rocket': 0, 'laugh': 0}, 'html_url': 'https://github.com/cloudfoundry/auction/pull/8#issuecomment-582491650', 'created_at': '2020-02-05T16:28:38Z', 'body': 'We have created an issue in Pivotal Tracker to manage this: \n\nhttps://www.pivotaltracker.com/story/show/171108075 \n\nThe labels on this github issue will be updated when the story is started.', 'author_association': 'COLLABORATOR'}], 'events_url': 'https://api.github.com/repos/cloudfoundry/auction/issues/8/events', 'author_association': 'NONE', 'url': 'https://api.github.com/repos/cloudfoundry/auction/issues/8', 'title': 'Bin Pack First Fit', 'state': 'open', 'number': 8, 'created_at': '2020-02-05T16:28:34Z', 'milestone': None, 'id': 560485695}, 'classified_fields_filtered': None}
(Pdb) item['data']['Author']
*** KeyError: 'Author'
(Pdb) item['data']['Commit']
*** KeyError: 'Commit'

I will create a PR that fixes this - just skip enriching items without Author, Commit...

Code does: if item['data']['Author'] while it should if 'Author' in item['data'] IMHO.

@lukaszgryglicki
Copy link
Contributor Author

Ooooh, I have an idea. This data looks rather like data from GitHub... that index is very old, maybe somebody just put GitHub data in that index and now p2o.py is trying to enrich git data on GitHub index. I'll try dropping that index and running command on the new index instead.

@lukaszgryglicki
Copy link
Contributor Author

Yes, it seems to be due to junk data in those old indices. Running without raw and enriched indexes is OK, this can be closed.

@valeriocos
Copy link
Member

Great, thank you for your feedback @lukaszgryglicki

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants