Skip to content

Commit eb4b1a1

Browse files
authored
Merge pull request #134 from JinZhou5042/jinzhou
update Q2 progress & next steps
2 parents a2832fc + 1caad37 commit eb4b1a1

File tree

1 file changed

+18
-3
lines changed

1 file changed

+18
-3
lines changed

pages/postdocs/jinzhou.md

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,10 @@ institution: University of Notre Dame
1414
1515
project_title: Scalable Data Analysis Applications for High Energy Physics
1616
project_goal: >
17-
- Accelerate the execution of CMS analysis applications.
18-
- Reduce storage consumption to enable more ambitious computations.
19-
- Enhance fault tolerance by breaking long tasks into smaller ones and implementing effective checkpointing strategies.
17+
- Accelerate CMS analysis workflows, focusing on those using Coffea, Dask, and TaskVine.
18+
- Reduce storage usage in data-intensive workflows to support more ambitious computations.
19+
- Improve fault tolerance on unreliable clusters through replication and checkpointing.
20+
- Explore graph optimization strategies to minimize makespan using real-time information.
2021
2122
mentors:
2223
- Douglas Thain (Cooperative Computing Lab, University of Notre Dame)
@@ -37,4 +38,18 @@ current_status: >
3738
* Develop an algorithm that divides long running tasks in DV5 into smaller ones, which reduces the overhead of rerunning tasks on worker evictions but increases the latency of scheduling a large number of small tasks, so the next plan would be trying to strike a balance between task scheduling and fault tolerance.
3839
* Develop an algorithm that checkpoints remote temp files on time to reduce the risk of losing critical files.
3940
41+
<br>
42+
<b>2025 Q2 </b>
43+
<br>
44+
45+
* Progress
46+
* Paper “Effectively Exploiting Node-Local Storage For Data-Intensive Scientific Workflows” submitted to SC’ 25.
47+
* Implemented checkpointing and replication strategies in TaskVine, both significantly improve workflow performance on unreliable clusters.
48+
* Resolved fundamental issues and inefficiencies in TaskVine; the scheduler now handles very large workflows efficiently. Our most recent success was that we completed an 8-million-task workflow in 20 hours.
49+
* Developing a web-based visualization tool for TaskVine logs, optimized for fast log parsing, CSV generation, and displaying key statistics. Available on [GitHub](https://github.com/cooperative-computing-lab/taskvine-report-tool).
50+
* Next steps
51+
* Discussed with team members how to improve scheduling efficiency by better handling pending and ready tasks—an issue that has caused severe slowdowns on unreliable clusters and remained unresolved for over half a year.
52+
* Finalizing our recent fixes and improvements in TaskVine, make sure we have a stable Conda release by the end of June and all our users are happy to use it.
53+
* Study the implications and challenges when scheduling massive workflows with millions of tasks.
54+
4055
---

0 commit comments

Comments
 (0)