Skip to content

Add arena helper functions and improve block insertion #210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 2, 2025

Conversation

AW-AlanWu
Copy link
Contributor

@AW-AlanWu AW-AlanWu commented May 31, 2025

This PR introduces several helper functions into the arena allocator and refactors the block insertion strategy in arena_alloc to improve performance. As a result, total allocation times have been measurably reduced.

Below is the benchmark script and corresponding results demonstrating the speedup:

Hardware info

CPU: AMD Ryzen 5 7535HS with Radeon Graphics
Cores/Threads: 12 threads
RAM: 30Gi
OS: Ubuntu 24.04.2 LTS
Kernel: 6.11.0-26-generic
Compiler: gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

Benchmark Script on Stage-0

Benchmark Script
#!/bin/bash
export LC_ALL=C

n=15
warmup=5

total_user=0
total_sys=0
total_elapsed=0
total_rss=0

tmp_file="$(mktemp)"

echo "Warming up $warmup times..."
for i in $(seq 1 $warmup); do
    ./out/shecc ./src/main.c >/dev/null 2>&1
done

echo "Running $n benchmarks..."
for i in $(seq 1 $n); do
    /usr/bin/time -f "%U %S %e %M" -o "$tmp_file" ./out/shecc ./src/main.c >/dev/null 2>&1

    read user sys elapsed rss < "$tmp_file"

    echo "Run $i: user=${user}s sys=${sys}s elapsed=${elapsed}s maxrss=${rss}KB"

    total_user=$(awk "BEGIN {printf \"%.6f\", $total_user + $user}")
    total_sys=$(awk "BEGIN {printf \"%.6f\", $total_sys + $sys}")
    total_elapsed=$(awk "BEGIN {printf \"%.6f\", $total_elapsed + $elapsed}")
    total_rss=$(awk "BEGIN {print $total_rss + $rss}")
done

rm -f "$tmp_file"

avg_user=$(awk "BEGIN {printf \"%.6f\", $total_user / $n}")
avg_sys=$(awk "BEGIN {printf \"%.6f\", $total_sys / $n}")
avg_elapsed=$(awk "BEGIN {printf \"%.6f\", $total_elapsed / $n}")
avg_rss=$(awk "BEGIN {printf \"%.2f\", $total_rss / $n}")

echo "----------------------------------"
echo "Average user time:    ${avg_user}s"
echo "Average system time:  ${avg_sys}s"
echo "Average elapsed time: ${avg_elapsed}s"
echo "Average max RSS:      ${avg_rss} KB"

Before (Stage-0)

Result
Warming up 5 times...
Running 15 benchmarks...
Run 1: user=0.16s sys=0.15s elapsed=0.32s maxrss=291848KB
Run 2: user=0.16s sys=0.15s elapsed=0.32s maxrss=291528KB
Run 3: user=0.14s sys=0.16s elapsed=0.30s maxrss=291656KB
Run 4: user=0.14s sys=0.16s elapsed=0.31s maxrss=291912KB
Run 5: user=0.16s sys=0.16s elapsed=0.32s maxrss=291912KB
Run 6: user=0.15s sys=0.17s elapsed=0.33s maxrss=291652KB
Run 7: user=0.15s sys=0.16s elapsed=0.32s maxrss=291848KB
Run 8: user=0.16s sys=0.15s elapsed=0.32s maxrss=291720KB
Run 9: user=0.15s sys=0.15s elapsed=0.31s maxrss=291524KB
Run 10: user=0.15s sys=0.16s elapsed=0.31s maxrss=291848KB
Run 11: user=0.15s sys=0.16s elapsed=0.32s maxrss=291720KB
Run 12: user=0.15s sys=0.15s elapsed=0.31s maxrss=291912KB
Run 13: user=0.16s sys=0.14s elapsed=0.31s maxrss=291720KB
Run 14: user=0.14s sys=0.16s elapsed=0.31s maxrss=291784KB
Run 15: user=0.16s sys=0.14s elapsed=0.31s maxrss=291656KB
----------------------------------
Average user time:    0.152000s
Average system time:  0.154667s
Average elapsed time: 0.314667s
Average max RSS:      291749.33 KB

After (Stage-0)

Result
Warming up 5 times...
Running 15 benchmarks...
Run 1: user=0.08s sys=0.15s elapsed=0.23s maxrss=294344KB
Run 2: user=0.09s sys=0.14s elapsed=0.24s maxrss=294344KB
Run 3: user=0.06s sys=0.16s elapsed=0.22s maxrss=294408KB
Run 4: user=0.06s sys=0.15s elapsed=0.22s maxrss=294536KB
Run 5: user=0.06s sys=0.16s elapsed=0.22s maxrss=294344KB
Run 6: user=0.06s sys=0.15s elapsed=0.22s maxrss=294344KB
Run 7: user=0.05s sys=0.16s elapsed=0.22s maxrss=294472KB
Run 8: user=0.07s sys=0.15s elapsed=0.22s maxrss=294408KB
Run 9: user=0.06s sys=0.16s elapsed=0.23s maxrss=294408KB
Run 10: user=0.06s sys=0.15s elapsed=0.22s maxrss=294600KB
Run 11: user=0.06s sys=0.15s elapsed=0.22s maxrss=294472KB
Run 12: user=0.07s sys=0.17s elapsed=0.25s maxrss=294088KB
Run 13: user=0.07s sys=0.15s elapsed=0.23s maxrss=294024KB
Run 14: user=0.08s sys=0.14s elapsed=0.23s maxrss=294088KB
Run 15: user=0.06s sys=0.15s elapsed=0.21s maxrss=294536KB
----------------------------------
Average user time:    0.066000s
Average system time:  0.152667s
Average elapsed time: 0.225333s
Average max RSS:      294361.07 KB

Benchmark Script on Stage-1

Benchmark Script
#!/bin/bash
export LC_ALL=C

n=15
warmup=5

total_user=0
total_sys=0
total_elapsed=0
total_rss=0

tmp_file="$(mktemp)"

echo "Warming up $warmup times..."
for i in $(seq 1 $warmup); do
    qemu-arm ./out/shecc-stage1.elf ./src/main.c >/dev/null 2>&1
done

echo "Running $n benchmarks..."
for i in $(seq 1 $n); do
    /usr/bin/time -f "%U %S %e %M" -o "$tmp_file" qemu-arm ./out/shecc-stage1.elf ./src/main.c >/dev/null 2>&1

    read user sys elapsed rss < "$tmp_file"

    echo "Run $i: user=${user}s sys=${sys}s elapsed=${elapsed}s maxrss=${rss}KB"

    total_user=$(awk "BEGIN {printf \"%.6f\", $total_user + $user}")
    total_sys=$(awk "BEGIN {printf \"%.6f\", $total_sys + $sys}")
    total_elapsed=$(awk "BEGIN {printf \"%.6f\", $total_elapsed + $elapsed}")
    total_rss=$(awk "BEGIN {print $total_rss + $rss}")
done

rm -f "$tmp_file"

avg_user=$(awk "BEGIN {printf \"%.6f\", $total_user / $n}")
avg_sys=$(awk "BEGIN {printf \"%.6f\", $total_sys / $n}")
avg_elapsed=$(awk "BEGIN {printf \"%.6f\", $total_elapsed / $n}")
avg_rss=$(awk "BEGIN {printf \"%.2f\", $total_rss / $n}")

echo "----------------------------------"
echo "Average user time:    ${avg_user}s"
echo "Average system time:  ${avg_sys}s"
echo "Average elapsed time: ${avg_elapsed}s"
echo "Average max RSS:      ${avg_rss} KB"

Before (Stage-1)

Result
Warming up 5 times...
Running 15 benchmarks...
Run 1: user=1.36s sys=1.12s elapsed=2.49s maxrss=638864KB
Run 2: user=1.35s sys=1.14s elapsed=2.50s maxrss=639104KB
Run 3: user=1.38s sys=1.10s elapsed=2.49s maxrss=638936KB
Run 4: user=1.35s sys=1.14s elapsed=2.49s maxrss=638864KB
Run 5: user=1.36s sys=1.10s elapsed=2.47s maxrss=638664KB
Run 6: user=1.36s sys=1.09s elapsed=2.46s maxrss=638884KB
Run 7: user=1.34s sys=1.10s elapsed=2.46s maxrss=638920KB
Run 8: user=1.36s sys=1.10s elapsed=2.47s maxrss=638796KB
Run 9: user=1.39s sys=1.10s elapsed=2.50s maxrss=638600KB
Run 10: user=1.34s sys=1.10s elapsed=2.45s maxrss=638800KB
Run 11: user=1.40s sys=1.08s elapsed=2.49s maxrss=638684KB
Run 12: user=1.36s sys=1.09s elapsed=2.47s maxrss=639060KB
Run 13: user=1.41s sys=1.07s elapsed=2.49s maxrss=638712KB
Run 14: user=1.36s sys=1.12s elapsed=2.49s maxrss=639016KB
Run 15: user=1.35s sys=1.10s elapsed=2.48s maxrss=638500KB
----------------------------------
Average user time:    1.364667s
Average system time:  1.103333s
Average elapsed time: 2.480000s
Average max RSS:      638826.93 KB

After (Stage-1)

Result
Warming up 5 times...
Running 15 benchmarks...
Run 1: user=0.73s sys=1.03s elapsed=1.77s maxrss=435132KB
Run 2: user=0.72s sys=1.06s elapsed=1.80s maxrss=434956KB
Run 3: user=0.70s sys=1.05s elapsed=1.76s maxrss=435108KB
Run 4: user=0.69s sys=1.05s elapsed=1.76s maxrss=435104KB
Run 5: user=0.67s sys=1.06s elapsed=1.74s maxrss=435448KB
Run 6: user=0.69s sys=1.04s elapsed=1.74s maxrss=435168KB
Run 7: user=0.68s sys=1.05s elapsed=1.75s maxrss=435396KB
Run 8: user=0.68s sys=1.06s elapsed=1.75s maxrss=435556KB
Run 9: user=0.69s sys=1.09s elapsed=1.79s maxrss=434892KB
Run 10: user=0.69s sys=1.05s elapsed=1.74s maxrss=435560KB
Run 11: user=0.69s sys=1.06s elapsed=1.76s maxrss=434988KB
Run 12: user=0.72s sys=1.02s elapsed=1.75s maxrss=434968KB
Run 13: user=0.68s sys=1.05s elapsed=1.75s maxrss=435432KB
Run 14: user=0.69s sys=1.05s elapsed=1.76s maxrss=434868KB
Run 15: user=0.71s sys=1.04s elapsed=1.76s maxrss=435092KB
----------------------------------
Average user time:    0.695333s
Average system time:  1.050667s
Average elapsed time: 1.758667s
Average max RSS:      435177.87 KB

Summary by Bito

This pull request enhances arena memory management by introducing new helper functions and refactoring the block insertion strategy in `arena_alloc`, leading to improved performance and reduced allocation times. Additionally, error handling has been improved with clearer failure messages, enhancing code robustness.

@jserv jserv requested review from ChAoSUnItY, DrXiao and fennecJ June 1, 2025 06:10
@AW-AlanWu AW-AlanWu force-pushed the chore/arena branch 2 times, most recently from 9575ef2 to 53bec08 Compare June 1, 2025 07:57
Copy link
Collaborator

@DrXiao DrXiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the new functions arena_calloc(), arena_realloc(), arena_strdup() and arena_memdup(), they have been added but are not used by any callers, and I am unsure whether it is appropriate to keep these unused functions in the codebase.

@AW-AlanWu
Copy link
Contributor Author

AW-AlanWu commented Jun 1, 2025

Based on my current observations, most of the memory allocation in shecc follow a unified allocate-and-release model. This makes almost all scenarios suitable for replacing the current use of malloc, calloc, and free with an arena allocator. I plan to start working on this replacement soon.

In the short term, I'm also working on improving the current strbuf_t implementation by introducing a general-purpose dynamic array in shecc. This new array structure can not only replace the existing strbuf_t, but also serve as a replacement for many of the fixed-size arrays currently used in shecc. In scenarios with frequent push operations, using arena_realloc can lead to performance improvements.

Additionally, I’m exploring optimizations like improving cache locality by replacing the linked list used in the internal structure of the hashmap.

Copy link
Collaborator

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make git commit messages always informative based on the motivations and proposed code changes.

Copy link

Bito Review Skipped - No Changes Detected

Bito didn't review this pull request because we did not detect any changes in the pull request to review.

Copy link
Collaborator

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using backticks in commit messages. Backticks can be easily confused with single quotes on some terminals, reducing readability. Plain text or single quotes provide sufficient clarity and emphasis.

Copy link
Collaborator

@ChAoSUnItY ChAoSUnItY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall changes are fine. Can you also show how this patch would affect stage 1 binary compilation performance?

@AW-AlanWu
Copy link
Contributor Author

Overall changes are fine. Can you also show how this patch would affect stage 1 binary compilation performance?

The Stage-1 benchmark has already been added to the initial conversation

Since I don't have ARM hardware on hand, I can only use QEMU-ARM for simulation to run benchmarks.
Are the results obtained this way acceptable?

@jserv
Copy link
Collaborator

jserv commented Jun 2, 2025

Since I don't have ARM hardware on hand, I can only use QEMU-ARM for simulation to run benchmarks. Are the results obtained this way acceptable?

No, wait for the empirical confirmation from @ChAoSUnItY .

Refactor 'arena_alloc' to insert newly allocated blocks at the head of
the block list, improving allocation efficiency.
This reduces pointer chasing during allocation and enables recently
freed memory to be reused more quickly.

Introduce 'arena_calloc', 'arena_memdup', 'arena_strdup', and
'arena_realloc' to extend the arena allocator's API:

- 'arena_calloc': Allocate zero-initialized memory blocks.
- 'arena_memdup': Duplicate arbitrary data into arena-managed memory.
- 'arena_strdup': Copy null-terminated strings into the arena.
- 'arena_realloc': Resize existing arena-managed allocations.

These improvements lay the foundation for replacing standard 'malloc',
'calloc', 'realloc', and 'free' with the arena allocator, providing
better performance, consistent memory handling, and simpler batch
deallocation throughout the project.
Copy link
Collaborator

@ChAoSUnItY ChAoSUnItY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've confirmed the following benchmark result generated from stage 1 compilation on Raspberry Pi 4B:

Before

Result
Warming up 5 times...
Running 15 benchmarks...
Run 1: user=5.28s sys=1.92s elapsed=7.21s maxrss=628864KB
Run 2: user=5.12s sys=2.02s elapsed=7.15s maxrss=628864KB
Run 3: user=5.16s sys=1.96s elapsed=7.13s maxrss=628864KB
Run 4: user=5.26s sys=1.92s elapsed=7.20s maxrss=628864KB
Run 5: user=5.17s sys=2.00s elapsed=7.18s maxrss=628864KB
Run 6: user=5.19s sys=1.97s elapsed=7.16s maxrss=628864KB
Run 7: user=5.25s sys=1.95s elapsed=7.21s maxrss=628736KB
Run 8: user=5.20s sys=1.96s elapsed=7.18s maxrss=628864KB
Run 9: user=5.13s sys=1.96s elapsed=7.11s maxrss=628864KB
Run 10: user=5.24s sys=1.89s elapsed=7.14s maxrss=628864KB
Run 11: user=5.26s sys=1.93s elapsed=7.21s maxrss=628864KB
Run 12: user=5.33s sys=1.90s elapsed=7.24s maxrss=628864KB
Run 13: user=5.21s sys=1.99s elapsed=7.21s maxrss=628864KB
Run 14: user=5.25s sys=1.96s elapsed=7.22s maxrss=628864KB
Run 15: user=5.06s sys=2.11s elapsed=7.18s maxrss=628864KB
----------------------------------
Average user time:    5.207333s
Average system time:  1.962667s
Average elapsed time: 7.182000s
Average max RSS:      628855.47 KB

After

Result
Warming up 5 times...
Running 15 benchmarks...
Run 1: user=0.74s sys=1.71s elapsed=2.47s maxrss=424832KB
Run 2: user=0.73s sys=1.77s elapsed=2.51s maxrss=424960KB
Run 3: user=0.74s sys=1.71s elapsed=2.47s maxrss=424960KB
Run 4: user=0.80s sys=1.66s elapsed=2.47s maxrss=424832KB
Run 5: user=0.81s sys=1.64s elapsed=2.46s maxrss=424960KB
Run 6: user=0.72s sys=1.73s elapsed=2.47s maxrss=424960KB
Run 7: user=0.70s sys=1.74s elapsed=2.46s maxrss=424960KB
Run 8: user=0.77s sys=1.69s elapsed=2.47s maxrss=424832KB
Run 9: user=0.77s sys=1.68s elapsed=2.47s maxrss=424960KB
Run 10: user=0.75s sys=1.73s elapsed=2.49s maxrss=424960KB
Run 11: user=0.66s sys=1.80s elapsed=2.46s maxrss=424960KB
Run 12: user=0.66s sys=1.79s elapsed=2.46s maxrss=424960KB
Run 13: user=0.71s sys=1.75s elapsed=2.47s maxrss=424832KB
Run 14: user=0.75s sys=1.70s elapsed=2.47s maxrss=424960KB
Run 15: user=0.70s sys=1.77s elapsed=2.48s maxrss=424960KB
----------------------------------
Average user time:    0.734000s
Average system time:  1.724667s
Average elapsed time: 2.472000s
Average max RSS:      424925.87 KB
Benchmark machine
  `.::///+:/-.        --///+//-:``    chaos@raspberrypi 
 `+oooooooooooo:   `+oooooooooooo:    ----------------- 
  /oooo++//ooooo:  ooooo+//+ooooo.    OS: Raspbian GNU/Linux 12 (bookworm) aarch64 
  `+ooooooo:-:oo-  +o+::/ooooooo:     Host: Raspberry Pi 4 Model B Rev 1.5 
   `:oooooooo+``    `.oooooooo+-      Kernel: 6.6.51+rpt-rpi-v8 
     `:++ooo/.        :+ooo+/.`       Uptime: 32 mins 
        ...`  `.----.` ``..           Packages: 1604 (dpkg) 
     .::::-``:::::::::.`-:::-`        Shell: bash 5.2.15 
    -:::-`   .:::::::-`  `-:::-       Resolution: 1920x1080 
   `::.  `.--.`  `` `.---.``.::`      DE: wlroots 
       .::::::::`  -::::::::` `       Theme: PiXflat [GTK3] 
 .::` .:::::::::- `::::::::::``::.    Icons: PiXflat [GTK3] 
-:::` ::::::::::.  ::::::::::.`:::-   Terminal: lxterminal 
::::  -::::::::.   `-::::::::  ::::   Terminal Font: Monospace 10 
-::-   .-:::-.``....``.-::-.   -::-   CPU: (4) @ 1.800GHz 
 .. ``       .::::::::.     `..`..    Memory: 692MiB / 3791MiB 
   -:::-`   -::::::::::`  .:::::`
   :::::::` -::::::::::` :::::::.                             
   .:::::::  -::::::::. ::::::::                              
    `-:::::`   ..--.`   ::::::.
      `...`  `...--..`  `...`
            .::::::::::
             `.-::::-`

@jserv jserv merged commit c508a38 into sysprog21:master Jun 2, 2025
6 checks passed
@jserv
Copy link
Collaborator

jserv commented Jun 2, 2025

Thank @AW-AlanWu for contributing!

@AW-AlanWu AW-AlanWu deleted the chore/arena branch June 2, 2025 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants