Skip to content

RHELMISC-7214: Try to rerun CI when queue_test failed with crash with [RFC] #612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Jedoku
Copy link
Contributor

@Jedoku Jedoku commented Mar 2, 2025

No description provided.

@Jedoku Jedoku requested review from akihikodaki and kostyanf14 March 2, 2025 23:04
Copy link
Contributor

@akihikodaki akihikodaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title and the content of the change does not match.
Why does RToolsHCK#list_test_results return nil instead of raising an error? It sounds like a bug.

@Jedoku
Copy link
Contributor Author

Jedoku commented Mar 12, 2025

The title and the content of the change does not match. Why does RToolsHCK#list_test_results return nil instead of raising an error? It sounds like a bug.

This bug on Microsoft side, and i cant reproduce it locally

@akihikodaki
Copy link
Contributor

The title and the content of the change does not match. Why does RToolsHCK#list_test_results return nil instead of raising an error? It sounds like a bug.

This bug on Microsoft side, and i cant reproduce it locally

I think it is a bug of rToolsHCK. listtestresults in tools/toolsHCK.ps1 creates a JSON array. It is not written by Microsoft.

@kostyanf14
Copy link
Contributor

Why does RToolsHCK#list_test_results return nil instead of raising an error? It sounds like a bug.

@akihikodaki

When we look into logs with this error, we can see action/queue_test: test = bbcc1b46-d0bf-46c8-85b4-2cd62df34a20 and then list_test_results did not find any results.
I do not agree that list_test_results must return an error in this case. The list_test_results is like a filter for all HLK results. We provide data that does not have any results. Ok maybe return [] instead of nil - I agree that nil is strange.

Just a small example in ruby:

3.3.7 :001 > x = [1,2,3]
 => [1, 2, 3] 
3.3.7 :002 > x.select { |t| t > 1 }
 => [2, 3] 
3.3.7 :003 > x.select { |t| t > 5 }
 => [] 

What do you think? [] or nil is a correct behavior?

@Jedoku I don't agree with your solution. Tools#list_test_results is just a wrapper and it should not decide what to do if data is provided without any error. Currently, nil is a VALID data. Maybe someone tried to get all results and provided a test without it, then he will get nil - that means no result.

The fix for this issue should be in the Tests class. The Tests class knows that the test was queued and then results MUST exist. If the results are missing in this case it means action/queue_test failed (IDK the reason for missing error from API). So, the Tests class must perform the queue_test operation again.

You fix do nothing with this problem.

@kostyanf14
Copy link
Contributor

@Jedoku

Solition looks like this:

    def wait_queued_test(id)
      # Add timeout to avoid infinite loop ? (5 minutes)
      # Need to check existing logs to see timeout cases
      loop do
        sleep 5
        results = @tools.list_test_results(id, @target['key'], @client.name, @tag)
        return false if results == nil

        last_result = results.max_by { |k| k['instanceid'].to_i }

        check_test_queued_time

        return true if last_result['status'] == 'InQueue'
        return true if last_result['status'] == 'Running'
        return true if test_finished?(last_result)
      end
    end
    
    def queue_test(test, wait: false)
      for i in 1..5 do
        @tools.queue_test(test_id: test['id'],
                          target_key: @target['key'],
                          machine: @client.name,
                          tag: @tag,
                          support: test_support(test),
                          parameters: test_parameters(test['name']))

        @tests_extra[test['id']] ||= {}
        @tests_extra[test['id']]['queued_at'] = DateTime.now

        @last_queued_id = test['id']

        return unless wait

        return if wait_queued_test(test['id'])
      end

      raise "Failed to queue test #{test['name']} after 5 attempts"
    end

@akihikodaki
Copy link
Contributor

Why does RToolsHCK#list_test_results return nil instead of raising an error? It sounds like a bug.

@akihikodaki

When we look into logs with this error, we can see action/queue_test: test = bbcc1b46-d0bf-46c8-85b4-2cd62df34a20 and then list_test_results did not find any results. I do not agree that list_test_results must return an error in this case. The list_test_results is like a filter for all HLK results. We provide data that does not have any results. Ok maybe return [] instead of nil - I agree that nil is strange.

Just a small example in ruby:

3.3.7 :001 > x = [1,2,3]
 => [1, 2, 3] 
3.3.7 :002 > x.select { |t| t > 1 }
 => [2, 3] 
3.3.7 :003 > x.select { |t| t > 5 }
 => [] 

What do you think? [] or nil is a correct behavior?

@kostyanf14 listtestresults in tools/toolsHCK.ps1 always creates a JSON array even when it finds no results. Here, nil is more likely to represent:

  • a bug in the communication code
  • a failure to handle an exception in PowerShell or Ruby

In either case, a correct fix would be to raise an error and trigger a retry.

@kostyanf14
Copy link
Contributor

listtestresults in tools/toolsHCK.ps1 always creates a JSON array even when it finds no results.

NO.
For example https://github.com/HCK-CI/rtoolsHCK/blob/master/tools/toolsHCK.ps1#L1841, no results - exceptions

In logs, we see 5 times (tools already implement retry mechanism)

W, [2025-03-16T12:16:59.503462 #734694]  WARN -- : Tools action failure
   -- A target that matches the target's key given was not found in the specified machine, aborting...

this is from https://github.com/HCK-CI/rtoolsHCK/blob/master/tools/toolsHCK.ps1#L1820

@akihikodaki
Copy link
Contributor

listtestresults in tools/toolsHCK.ps1 always creates a JSON array even when it finds no results.

NO. For example https://github.com/HCK-CI/rtoolsHCK/blob/master/tools/toolsHCK.ps1#L1841, no results - exceptions

In logs, we see 5 times (tools already implement retry mechanism)

W, [2025-03-16T12:16:59.503462 #734694]  WARN -- : Tools action failure
   -- A target that matches the target's key given was not found in the specified machine, aborting...

this is from https://github.com/HCK-CI/rtoolsHCK/blob/master/tools/toolsHCK.ps1#L1820

Oh, I see. Raising an exception in such a case is a questionable behavior, but at least the exception should be delivered to the user instead of returning nil.

@kostyanf14
Copy link
Contributor

Raising an exception in such a case is a questionable behavior,

Are you about the A target that matches the target's key given was not found in the specified machine, error? Yes, this is strange, but this is HLK API behavior. The same question to HLK API why QueueTest does not raise an exception when it fails?

but at least the exception should be delivered to the user instead of returning nil.

This is done @logger.warn("Tools action failure#{failure_message}") in Tools#act_with_tools. We got this error and returned nil. Can change this to raise an exception but it will require rewriting all the code.

I am not sure that raising exceptions will help us to fix the real issue.

@akihikodaki
Copy link
Contributor

Raising an exception in such a case is a questionable behavior,

Are you about the A target that matches the target's key given was not found in the specified machine, error? Yes, this is strange, but this is HLK API behavior. The same question to HLK API why QueueTest does not raise an exception when it fails?

I referred to the throw statement in the mentioned line.

but at least the exception should be delivered to the user instead of returning nil.

This is done @logger.warn("Tools action failure#{failure_message}") in Tools#act_with_tools. We got this error and returned nil. Can change this to raise an exception but it will require rewriting all the code.

I am not sure that raising exceptions will help us to fix the real issue.

It's better to change it to raise an exception. Returning nil does not tell about the real cause of the error and makes diagnostics difficult.

@kostyanf14
Copy link
Contributor

I referred to the throw statement in the mentioned line.

Ok.
But A target that matches the target's key given was not found in the specified machine is also a strange thing. Target should exist but test or results no. Why HLK does not report targets is a question.

@kostyanf14
Copy link
Contributor

It's better to change it to raise an exception. Returning nil does not tell about the real cause of the error and makes diagnostics difficult.

Can we split this fix into 2 PRs:

  1. Fix the issue with the current nil behavior for a quick fix
  2. raise an exception and rewrite all code that expects nil

What do you think?

@akihikodaki
Copy link
Contributor

But A target that matches the target's key given was not found in the specified machine is also a strange thing. Target should exist but test or results no. Why HLK does not report targets is a question.

I have no idea. Usually this kind of bugs should be investigated by adding more logs; it will be nice if HLK has some kind of operation logs but I'm not sure if HLK has such a feature.

Can we split this fix into 2 PRs:

1. Fix the issue with the current `nil` behavior for a quick fix

2. raise an exception and rewrite all code that expects `nil`

What do you think?

I think we can skip the quick fix and just do 2. This crash is something hard to reproduce (i.e., not frequent) so I guess we don't need to hurry to make a quick fix.

@akihikodaki
Copy link
Contributor

By the way, I think this pull request should be closed. Apparently Tools#act_with_tools already performs retries so this change is redundant.

@kostyanf14
Copy link
Contributor

I have no idea. Usually this kind of bugs should be investigated by adding more logs; it will be nice if HLK has some kind of operation logs but I'm not sure if HLK has such a feature.

Unfortunately no, but we can just try to dump all the HLK data that we got and then try to analyze it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants