(Codex-friendly problem spec)
-
Goal: Implement a single-file program (e.g.
main.c) that:- Reads one test case from stdin in the format below.
- Outputs one line: a sequence of actions (a_1,\dots,a_T) that maximizes the expected total reward minus switching penalty.
- Works for all valid inputs within the constraints, not just the sample.
-
Language: C (C11 / GCC 10).
-
I/O rules (very important):
- Input: read from
stdinonly. - Output: exactly one line in the format
[+ integers separated by", "+]e.g.[1, 1, 1, 0, 0] - No extra debug prints.
- Input: read from
You control a smart sleep-aid pillow. Each minute (t = 1,2,\dots,T), the brain is in one of (N) discrete states (e.g. Awake, Light Sleep, Deep Sleep, REM, Waked Up).
You can choose one of (M) soundscapes (actions) to play: [ a_t \in {0,1,\dots,M-1} ]
- The brain’s state transitions stochastically according to transition matrices provided in the input.
- Each state has a reward (R(s, t)), possibly time-varying (periodic cosine wandering).
- Changing soundscape from minute (t-1) to (t) incurs a switching penalty depending on the state transition.
Your job is to output a sequence of actions (A = [a_1,\dots,a_T]) that maximizes the expected total sleep score over the session.
Let:
- (T): total duration (minutes).
- (N): number of sleep states.
- (M): number of actions (soundscapes).
- (S_t \in {S_0,\dots,S_{N-1}}): state at minute (t).
- (a_t \in {0,\dots,M-1}): chosen action at minute (t).
- (L): cycle length of reward wandering (if (L = 0), rewards are static).
- (\mu): average reward centroid.
- (R_{\text{initial}}(s)): base reward of state (s) given in input.
- (R(s,t)): reward of state (s) at time (t).
- (P^{(a)}_{i,j}): probability of transitioning from state (i) to state (j) under action (a).
- (C_{i,j}): switching penalty when the brain transitions from state (i) to state (j) and we changed action at this step.
- (I(\cdot)): indicator function (1 if condition holds, else 0).
- If (L = 0): [ R(s,t) = R_{\text{initial}}(s) ]
- If (L > 0): reward wanders periodically: [ \mu = \frac{1}{N} \sum_{i=0}^{N-1} R_{\text{initial}}(S_i) ] [ R(s,t) = \mu + (R_{\text{initial}}(s) - \mu)\cdot \cos\left(\frac{2\pi t}{L}\right) ]
At time (t \ge 2), if the action changes ((a_t \neq a_{t-1})), a penalty is applied that depends on the state transition ((S_{t-1} \to S_t)):
[ \text{Penalty}t = I(a_t \neq a{t-1}) \cdot C_{S_{t-1},S_t} ]
At (t=1), treat (a_0 = 0) (the “initial” action) for the indicator.
Uniform penalty mode: In the actual input format used here, Type=0 means “no switching cost”: all (C_{i,j}=0). Type=1 means a general penalty matrix with values from the input.
Given an action sequence (A=[a_1,\dots,a_T]), the scoring system computes the expected total sleep score:
[ J(A) = \mathbb{E}\left[\sum_{t=1}^{T} \left( R(S_t,t) - \text{Penalty}_t \right)\right] ]
The expectation is over the random Markov transitions defined by the matrices (P^{(a)}).
Your algorithm does not need to simulate randomness for scoring. You just need to output (A); the judge will compute (J(A)) using exact matrix operations.
All data is read from standard input.
High-level structure:
T N M L Type
R0 R1 ... R(N-1)
[Switching Penalty Matrix] // N x N integers
[Transition Matrix for Action 0] // N x N doubles
[Transition Matrix for Action 1] // N x N doubles
...
[Transition Matrix for Action M-1] // N x N doubles
T N M L Type
-
T– int- Total duration (minutes).
- Constraints: (1 \le T \le 10000).
-
N– int- Number of states.
- Constraints: (1 \le N \le 10).
-
M– int- Number of actions.
- Constraints: (1 \le M \le 500).
-
L– int- Reward wandering cycle length.
- If
L = 0, rewards are static: (R(s,t) = R_{\text{initial}}(s)).
-
Type– int (mode identifier)Type = 0: uniform penalty mode (effectively no switching penalty, all (C_{i,j}=0)).Type = 1: general matrix penalty mode (use full (C_{i,j}) from input).
R0 R1 ... R(N-1)
Nintegers.Rkis the base reward (R_{\text{initial}}(S_k)) for state (S_k).
Next is an (N \times N) integer matrix:
// N lines follow:
c00 c01 ... c0(N-1)
c10 c11 ... c1(N-1)
...
c(N-1)0 ... c(N-1)(N-1)
-
If
Type = 1:C[i][j] = cijis the cost of switching soundscape when the brain transitions from stateito statej.
-
If
Type = 0:- All entries are zero.
- Matrix is still provided for format consistency, but your algorithm can treat all penalties as zero.
Then we have M blocks. Each block is an (N \times N) matrix of doubles, giving the transition probabilities for one action:
For each action a = 0, 1, ..., M-1:
// block for action a, consists of N lines:
p00 p01 ... p0(N-1) // from state 0
p10 p11 ... p1(N-1) // from state 1
...
p(N-1)0 ... p(N-1)(N-1) // from state N-1
pijis (P^{(a)}{i,j} = P(S_t = j | S{t-1} = i, a_t = a)).- All
pijare floating-point numbers, typically in[0.0, 1.0]. - For each row (fixed
i), the sum overjis guaranteed to be 1.0.
You must print a single line:
[a1, a2, ..., aT]
-
The line must:
- Start with
[and end with]. - Contain exactly
Tintegers. - Each integer is an action ID in the range
0toM-1. - Between two integers, use exactly
", "(comma + space).
- Start with
-
Example (for
T = 20):
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0]
Any illegal output (e.g. action ID < 0 or ≥ M, wrong format) will cause that test case’s score to be 0.
The online judge (e.g. CodeGrade) will run your program on 10 different test inputs. You only see the sample; the rest are hidden.
Three groups:
-
Basic cases (30%)
Type = 0(no penalty).L = 0(static rewards).- Rough size:
T ≤ 50,M = 10,N = 5.
-
Performance cases (40%)
Type = 1(matrix penalty).L = 0(static rewards).- Rough size:
1000 ≤ T ≤ 3000,M = 80,N = 40.
-
Ultimate cases (30%)
Type = 1(matrix penalty).L = T(rewards fully time-variant over the whole horizon).- Rough size:
1000 ≤ T ≤ 3000,M = 80,N = 40.
Your single program must handle all of these ranges robustly (time and memory).
For each of the 10 hidden test cases (i=1,...,10):
-
The judge computes:
-
( \text{ScoreRaw}_i = J(A_i) ) (expected total reward of your sequence (A_i)).
-
There is a standard baseline score (\text{ScoreStandard}_i) for each test.
-
Your base score on that test: [ \text{ScoreBase}_i = \min(\text{ScoreRaw}_i, \text{ScoreStandard}_i) ]
-
If you beat the baseline ((\text{ScoreRaw}_i > \text{ScoreStandard}_i)), you also get a bonus that depends on:
- How much you exceed the standard.
- Your program’s running time (T^{\text{use}}_i) on that test.
-
-
Your overall competition score is the sum over all 10 test cases of base + bonus.
Key implications for the code generator:
- Correctness first: invalid output or runtime error ⇒ 0 on that test.
- Quality of policy matters: try to genuinely maximize expected reward.
- Speed matters: asymptotically faster algorithms can earn more bonus on large tests.
When generating the C solution, obey the following:
-
Input parsing
-
Use
scanf/fscanf(stdin, ...)to read:T,N,M,L,Type.Nintegers for base rewards.N x Nintegers for penalty matrix (store or ignore ifType == 0).Mblocks ofN x Ndoubles for transition matrices.
-
Allocate arrays dynamically based on
NandM(e.g.malloc).
-
-
Data structures
-
Suggested:
double P[M][N][N]or flatteneddouble *Pif stack is insufficient.int C[N][N]for penalties.int baseReward[N].
-
For planning, you may maintain state distributions
double dist[N]and/or DP arrays.
-
-
Algorithm (high-level only)
-
This is a finite-horizon stochastic control / MDP problem.
-
Naive brute-force over all action sequences is impossible (M^T).
-
The algorithm should:
- Use the transition matrices and rewards to evaluate expected value of candidate policies.
- Consider switching penalty when changing actions.
- Exploit the relatively small
N(≤ 10) to do dynamic programming or clever approximate planning.
-
-
Output
-
After computing your action sequence
a[0..T-1], print it once as:printf("["); for (int t = 0; t < T; ++t) { if (t > 0) printf(", "); printf("%d", a[t]); } printf("]");
-
Optionally append a newline.
-
-
No extra output
- Do not print debugging lines, labels, or spaces outside the required format.
20 5 3 20 1
0 20 50 100 0
0 10 20 50 0
10 0 10 20 0
20 10 0 10 0
50 20 10 0 0
0 0 0 0 0
1.0 0.0 0.0 0.0 0.0
0.9 0.1 0.0 0.0 0.0
0.5 0.5 0.0 0.0 0.0
0.1 0.1 0.8 0.0 0.0
0.0 0.0 0.1 0.9 0.0
0.5 0.5 0.0 0.0 0.0
0.1 0.8 0.1 0.0 0.0
0.0 0.2 0.7 0.1 0.0
0.0 0.0 0.2 0.8 0.0
1.0 0.0 0.0 0.0 0.0
0.2 0.2 0.2 0.2 0.2
0.2 0.2 0.2 0.2 0.2
0.2 0.2 0.2 0.2 0.2
0.2 0.2 0.2 0.2 0.2
1.0 0.0 0.0 0.0 0.0
T=20,N=5,M=3,L=20,Type=1.- Next
5lines: base rewards and penalty matrix (see full PDF if needed). - Next
3blocks of5x5doubles: transition matrices for actions 0, 1, 2.
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0]
This is just an example action sequence. The judge will compute its expected score using the given model.