Skip to content

Commit e0b163b

Browse files
author
Jieying Luo
committed
Incorporating the suggestions from comments.
1 parent 3819fa6 commit e0b163b

File tree

1 file changed

+29
-10
lines changed

1 file changed

+29
-10
lines changed

rfcs/20230123-pjrt-plugin.md

+29-10
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,11 @@ Status LoadPjrtPlugin(string library_path, map<string, string> config_values) {
8080
if (function_prt != nullptr) {
8181
PJRT_Api* api = function_prt();
8282
plugin_name = parse(library_path);
83+
if (auto it = global_pjrt_plugin_map.find(plugin_name);
84+
it != global_pjrt_plugin_map.end()) {
85+
// Allows to the same plugin multiple times.
86+
return;
87+
}
8388
PluginInfo plugin_info(api, config_values);
8489
global_pjrt_plugin_map[plugin_name] = plugin_info;
8590
}
@@ -159,15 +164,14 @@ Open questions:
159164
client? For example, these two functions in
160165
[tpu_initializer_helper.cc](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/stream_executor/tpu/tpu_initializer_helper.cc#L269-L270)
161166
should be run when initializing TPU. Currently they are called every time a
162-
PJRT TPU client is created. Shall we add another method InitializePlugin and
163-
only run it once? Alternatively, the plugin can implement it in
164-
`PJRT_Client_Create` and run it once in the first time a client was created.
165-
* Do we want to create PJRT clients for every plugin that is found? Will that
166-
involve some initialization for the device which should not run multiple
167-
times?
167+
PJRT TPU client is created. One solution is to place this initizliation in
168+
method `GetPjrtApi`. It will run the first time that a plugin is loaded.
169+
* It is not desired to create PJRT clients for every plugin that is found as
170+
that will increase resource utilization. Framework decides what behavior
171+
they prefer.
168172
* We may need to store loaded PluginInfo in a nested map `map<device_type,
169-
<plugin_name, PluginInfo>>` if we want to know what plugins are
170-
available for a device (e.g. current PJRT GPU and IREE GPU) and only
173+
<plugin_name, PluginInfo>>` if the framework wants to know what plugins
174+
are available for a device (e.g. current PJRT GPU and IREE GPU) and only
171175
create one PJRT client per device.
172176

173177
### <a name="heading=h.bjuf0soco0sj"></a> Config values for creating PJRT client(s)
@@ -207,15 +211,30 @@ it is used, or created in a plugin custom op kernel. Created PJRT clients will
207211
be saved in the
208212
[global TensorFlow ResourceManager](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/tfrt/common/pjrt_util.h#L26).
209213

214+
Whether multiple frameworks can create a client for the same device is up to the
215+
specific hardware and plugin implementation. If the hardware only allows
216+
exclusive access, then the software will abide by that constraint. Otherwise,
217+
some plugins may use the hardware in an exclusive way (ie. allocate all memory).
218+
The openxla plugin that we envision now will default to allowing multiple clients
219+
and supporting on demand memory allocation (with a possible option for more greedy
220+
access as an optimization for cases that need it).
221+
210222
For more information about PJRT, please consult the PJRT
211223
[C header](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/pjrt/c/pjrt_c_api.h),
212224
[C++ header](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/pjrt/pjrt_client.h),
213225
[plugin mechanism design doc](https://docs.google.com/document/d/1FieBwiflD6DT9Z38wFIrnNxrUto1AOP3UuibQ4vpvEQ/edit)
214226
and
215227
[integration guide](https://docs.google.com/document/d/1EL62DkASMFZbk89K6MIPNGUFBTbkC21v5-uL5fB8VsM/edit).
216228

217-
## Future improvments
218-
* Windows support.
229+
## Open questions and future improvments from the comments
230+
* Windows support.
231+
* Test suites for PJRT implementations.
232+
* How to handle versioning or variants (e.g. x86, x86-64 and ARM CPUs, different
233+
CUDA compute capability GPUs, DPUs, etc.).
234+
* How does a device broadcast its capability or supported API calls?
235+
* Support unload and reload a PJRT plugin (hot-swap/plug).
236+
* Single vs Multi-process access/ ownership (maybe covered by IRFT).
237+
* Command queues/buffer management APIs access, simple commands vs. transactional.
219238

220239
## Initial contribution
221240

0 commit comments

Comments
 (0)