RHELMISC-7214: Try to rerun CI when queue_test failed with crash with [RFC] #612

Jedoku · 2025-03-02T23:04:16Z

No description provided.

akihikodaki

The title and the content of the change does not match.
Why does RToolsHCK#list_test_results return nil instead of raising an error? It sounds like a bug.

Jedoku · 2025-03-12T09:22:08Z

The title and the content of the change does not match. Why does RToolsHCK#list_test_results return nil instead of raising an error? It sounds like a bug.

This bug on Microsoft side, and i cant reproduce it locally

akihikodaki · 2025-03-12T09:31:25Z

The title and the content of the change does not match. Why does RToolsHCK#list_test_results return nil instead of raising an error? It sounds like a bug.

This bug on Microsoft side, and i cant reproduce it locally

I think it is a bug of rToolsHCK. listtestresults in tools/toolsHCK.ps1 creates a JSON array. It is not written by Microsoft.

kostyanf14 · 2025-03-17T15:21:01Z

Why does RToolsHCK#list_test_results return nil instead of raising an error? It sounds like a bug.

@akihikodaki

When we look into logs with this error, we can see action/queue_test: test = bbcc1b46-d0bf-46c8-85b4-2cd62df34a20 and then list_test_results did not find any results.
I do not agree that list_test_results must return an error in this case. The list_test_results is like a filter for all HLK results. We provide data that does not have any results. Ok maybe return [] instead of nil - I agree that nil is strange.

Just a small example in ruby:

3.3.7 :001 > x = [1,2,3]
 => [1, 2, 3] 
3.3.7 :002 > x.select { |t| t > 1 }
 => [2, 3] 
3.3.7 :003 > x.select { |t| t > 5 }
 => []

What do you think? [] or nil is a correct behavior?

@Jedoku I don't agree with your solution. Tools#list_test_results is just a wrapper and it should not decide what to do if data is provided without any error. Currently, nil is a VALID data. Maybe someone tried to get all results and provided a test without it, then he will get nil - that means no result.

The fix for this issue should be in the Tests class. The Tests class knows that the test was queued and then results MUST exist. If the results are missing in this case it means action/queue_test failed (IDK the reason for missing error from API). So, the Tests class must perform the queue_test operation again.

You fix do nothing with this problem.

kostyanf14 · 2025-03-17T16:18:05Z

@Jedoku

Solition looks like this:

    def wait_queued_test(id)
      # Add timeout to avoid infinite loop ? (5 minutes)
      # Need to check existing logs to see timeout cases
      loop do
        sleep 5
        results = @tools.list_test_results(id, @target['key'], @client.name, @tag)
        return false if results == nil

        last_result = results.max_by { |k| k['instanceid'].to_i }

        check_test_queued_time

        return true if last_result['status'] == 'InQueue'
        return true if last_result['status'] == 'Running'
        return true if test_finished?(last_result)
      end
    end
    
    def queue_test(test, wait: false)
      for i in 1..5 do
        @tools.queue_test(test_id: test['id'],
                          target_key: @target['key'],
                          machine: @client.name,
                          tag: @tag,
                          support: test_support(test),
                          parameters: test_parameters(test['name']))

        @tests_extra[test['id']] ||= {}
        @tests_extra[test['id']]['queued_at'] = DateTime.now

        @last_queued_id = test['id']

        return unless wait

        return if wait_queued_test(test['id'])
      end

      raise "Failed to queue test #{test['name']} after 5 attempts"
    end

akihikodaki · 2025-03-18T04:16:37Z

Why does RToolsHCK#list_test_results return nil instead of raising an error? It sounds like a bug.

@akihikodaki

When we look into logs with this error, we can see action/queue_test: test = bbcc1b46-d0bf-46c8-85b4-2cd62df34a20 and then list_test_results did not find any results. I do not agree that list_test_results must return an error in this case. The list_test_results is like a filter for all HLK results. We provide data that does not have any results. Ok maybe return [] instead of nil - I agree that nil is strange.

Just a small example in ruby:
3.3.7 :001 > x = [1,2,3]
 => [1, 2, 3] 
3.3.7 :002 > x.select { |t| t > 1 }
 => [2, 3] 
3.3.7 :003 > x.select { |t| t > 5 }
 => [] 
What do you think? [] or nil is a correct behavior?

@kostyanf14 listtestresults in tools/toolsHCK.ps1 always creates a JSON array even when it finds no results. Here, nil is more likely to represent:

a bug in the communication code
a failure to handle an exception in PowerShell or Ruby

In either case, a correct fix would be to raise an error and trigger a retry.

kostyanf14 · 2025-03-18T06:45:51Z

listtestresults in tools/toolsHCK.ps1 always creates a JSON array even when it finds no results.

NO.
For example https://github.com/HCK-CI/rtoolsHCK/blob/master/tools/toolsHCK.ps1#L1841, no results - exceptions

In logs, we see 5 times (tools already implement retry mechanism)

W, [2025-03-16T12:16:59.503462 #734694]  WARN -- : Tools action failure
   -- A target that matches the target's key given was not found in the specified machine, aborting...

this is from https://github.com/HCK-CI/rtoolsHCK/blob/master/tools/toolsHCK.ps1#L1820

akihikodaki · 2025-03-18T06:51:27Z

listtestresults in tools/toolsHCK.ps1 always creates a JSON array even when it finds no results.

NO. For example https://github.com/HCK-CI/rtoolsHCK/blob/master/tools/toolsHCK.ps1#L1841, no results - exceptions

In logs, we see 5 times (tools already implement retry mechanism)
W, [2025-03-16T12:16:59.503462 #734694]  WARN -- : Tools action failure
   -- A target that matches the target's key given was not found in the specified machine, aborting...
this is from https://github.com/HCK-CI/rtoolsHCK/blob/master/tools/toolsHCK.ps1#L1820

Oh, I see. Raising an exception in such a case is a questionable behavior, but at least the exception should be delivered to the user instead of returning nil.

kostyanf14 · 2025-03-18T07:48:23Z

Raising an exception in such a case is a questionable behavior,

Are you about the A target that matches the target's key given was not found in the specified machine, error? Yes, this is strange, but this is HLK API behavior. The same question to HLK API why QueueTest does not raise an exception when it fails?

but at least the exception should be delivered to the user instead of returning nil.

This is done @logger.warn("Tools action failure#{failure_message}") in Tools#act_with_tools. We got this error and returned nil. Can change this to raise an exception but it will require rewriting all the code.

I am not sure that raising exceptions will help us to fix the real issue.

akihikodaki · 2025-03-18T07:53:44Z

Raising an exception in such a case is a questionable behavior,

Are you about the A target that matches the target's key given was not found in the specified machine, error? Yes, this is strange, but this is HLK API behavior. The same question to HLK API why QueueTest does not raise an exception when it fails?

I referred to the throw statement in the mentioned line.

but at least the exception should be delivered to the user instead of returning nil.

This is done @logger.warn("Tools action failure#{failure_message}") in Tools#act_with_tools. We got this error and returned nil. Can change this to raise an exception but it will require rewriting all the code.

I am not sure that raising exceptions will help us to fix the real issue.

It's better to change it to raise an exception. Returning nil does not tell about the real cause of the error and makes diagnostics difficult.

kostyanf14 · 2025-03-18T07:58:45Z

I referred to the throw statement in the mentioned line.

Ok.
But A target that matches the target's key given was not found in the specified machine is also a strange thing. Target should exist but test or results no. Why HLK does not report targets is a question.

kostyanf14 · 2025-03-18T08:00:52Z

It's better to change it to raise an exception. Returning nil does not tell about the real cause of the error and makes diagnostics difficult.

Can we split this fix into 2 PRs:

Fix the issue with the current nil behavior for a quick fix
raise an exception and rewrite all code that expects nil

What do you think?

akihikodaki · 2025-03-18T08:11:28Z

But A target that matches the target's key given was not found in the specified machine is also a strange thing. Target should exist but test or results no. Why HLK does not report targets is a question.

I have no idea. Usually this kind of bugs should be investigated by adding more logs; it will be nice if HLK has some kind of operation logs but I'm not sure if HLK has such a feature.

Can we split this fix into 2 PRs:

1. Fix the issue with the current `nil` behavior for a quick fix

2. raise an exception and rewrite all code that expects `nil`

What do you think?

I think we can skip the quick fix and just do 2. This crash is something hard to reproduce (i.e., not frequent) so I guess we don't need to hurry to make a quick fix.

akihikodaki · 2025-03-18T08:14:35Z

By the way, I think this pull request should be closed. Apparently Tools#act_with_tools already performs retries so this change is redundant.

kostyanf14 · 2025-03-18T08:19:21Z

I have no idea. Usually this kind of bugs should be investigated by adding more logs; it will be nice if HLK has some kind of operation logs but I'm not sure if HLK has such a feature.

Unfortunately no, but we can just try to dump all the HLK data that we got and then try to analyze it.

RHELMISC-7214: Try to rerun CI when queue_test failed with crash with

052d94f

Jedoku requested review from akihikodaki and kostyanf14 March 2, 2025 23:04

akihikodaki requested changes Mar 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RHELMISC-7214: Try to rerun CI when queue_test failed with crash with [RFC] #612

RHELMISC-7214: Try to rerun CI when queue_test failed with crash with [RFC] #612

Uh oh!

Jedoku commented Mar 2, 2025

Uh oh!

akihikodaki left a comment

Uh oh!

Jedoku commented Mar 12, 2025

Uh oh!

akihikodaki commented Mar 12, 2025

Uh oh!

kostyanf14 commented Mar 17, 2025

Uh oh!

kostyanf14 commented Mar 17, 2025

Uh oh!

akihikodaki commented Mar 18, 2025

Uh oh!

kostyanf14 commented Mar 18, 2025

Uh oh!

akihikodaki commented Mar 18, 2025

Uh oh!

kostyanf14 commented Mar 18, 2025

Uh oh!

akihikodaki commented Mar 18, 2025

Uh oh!

kostyanf14 commented Mar 18, 2025

Uh oh!

kostyanf14 commented Mar 18, 2025

Uh oh!

akihikodaki commented Mar 18, 2025

Uh oh!

akihikodaki commented Mar 18, 2025

Uh oh!

kostyanf14 commented Mar 18, 2025

Uh oh!

Uh oh!

RHELMISC-7214: Try to rerun CI when queue_test failed with crash with [RFC] #612

Are you sure you want to change the base?

RHELMISC-7214: Try to rerun CI when queue_test failed with crash with [RFC] #612

Uh oh!

Conversation

Jedoku commented Mar 2, 2025

Uh oh!

akihikodaki left a comment

Choose a reason for hiding this comment

Uh oh!

Jedoku commented Mar 12, 2025

Uh oh!

akihikodaki commented Mar 12, 2025

Uh oh!

kostyanf14 commented Mar 17, 2025

Uh oh!

kostyanf14 commented Mar 17, 2025

Uh oh!

akihikodaki commented Mar 18, 2025

Uh oh!

kostyanf14 commented Mar 18, 2025

Uh oh!

akihikodaki commented Mar 18, 2025

Uh oh!

kostyanf14 commented Mar 18, 2025

Uh oh!

akihikodaki commented Mar 18, 2025

Uh oh!

kostyanf14 commented Mar 18, 2025

Uh oh!

kostyanf14 commented Mar 18, 2025

Uh oh!

akihikodaki commented Mar 18, 2025

Uh oh!

akihikodaki commented Mar 18, 2025

Uh oh!

kostyanf14 commented Mar 18, 2025

Uh oh!

Uh oh!