JNI to Java with project panama

Sorry about the formatting.

This code is present in zcl on the foreign-abi branch.

- ---------------------------------------------------------------------- -

		   JNI to Java with project panama

			   Michael  Zucchi
			   notzed at gmail

- ---------------------------------------------------------------------- -

This is a summary of the work of converting the zcl binding for OpenCL
from JNI to jdk.incubator.foriegn aka 'project panama'.


Background.
===========

zcl is a library for calling OpenCL from Java that maps the OpenCL 2.1
api to a set of Java classes.  The OpenCL api is highly object
oriented so it maps consistently and logically this way.  In
addition it supports garbage collection for reclaiming all allocated
resources, or manual reclamation should it be desired.

It was partially an exploratory work testing ideas on API design and
comparing auto-generated with hand-rolled bindings that leveraged the
C compilation environment for code reuse.

The garbage collection allows for some interesting use of lambda
functions, for example functions which retain references to
working buffers to avoid the overhead of dynamic allocation.


Issues.
=======

Problems encountered.


-- memory segments --

Java allocated MemorySegments have some restrictions on use (but only
from java).  The most severe of these is that such segments must be
closed on the same thread as they were created.  This means they can
not be garbage collected using a reference queue via a cleaner thread.

OpenCL has a function which allows an application-allocated memory
block to be the backing for a buffer or image - using
CL_MEM_USE_HOST_PTR.  This must remain allocated beyond the lifetime
of creation function - for the lifetime of the buffer or image.
zcl/jni simply retains a reference to the ByteBuffer used at buffer
creation time and this is automatically reclaimed by the system.

One possible workaround for this is to use malloc() and free() to
create and manage the memory explicitly which would allow it to be
freed anywhere.  But there's no public way to tell who allocated the
segment so in the context of a function which accepts a MemorySegment
this is either a potential crash or requires additional types to
manage.  The latter is not ideal since it is just recreating
MemorySegment functionality.

zcl/panama punts this whole problem to the application with a new API
requirement (and not just a recommendation) that CLMemory types
created in this way be explicitly closed.  This is doable as although
it supports GC for all objects they can also be explicitly released at
any time.  Forgetting to call release() (or calling it on the wrong
thread) will at worse result in a memory leak.

OpenCL accepts some other application-created data whose lifetime
matches an object.  clSetEventCallback(),
clSetMemObjectDestructorCallback(), and the pfn_notify parameter to
clCreateContext().  A native up-call stub must be created to be passed
to these functions.  Fortunately these up-call stubs have special rules
which allow them to be freed from any thread; thus they can be easily
cleaned up by GC or safely changed.  This important for
clsetEventCallback() as a typical use-case is a hidden event -
possibly created in a lambda function - that has no direct way of
being undone.

Static constants are another area where the MemorySegment API is
currently fiddly.  To be able to use use them from other threads one
must use acquire() every time in every accessor and every time they
are passed to a MethodHandle.  This is special case code that is only
required due to the current design of MemorySegment and only affects
access from Java.

This occurs in zcl/panama with CLImageformat.  The solution is the
same one used in zcl/jni - instead of being structures they are simple
POJOs and they are marshalled every time they are used or retrieved.
This is acceptable for OpenCL as it only has two small structures in
the API.


-- direct byte buffers --

zcl/jni uses direct byte buffers for some apis.

One particular function of import is enqueueMapBuffer.  jni can
trivially wrap a ByteBuffer in a native memory block and trivially get
the same address back later when it needs it.

To retain the same interface requires auxiliary data structures to
track the relationship between a ByteBuffer and a MemoryAddress.  Hash
tables cannot be used as ByteBuffer hashes on state information.

Also of note is that to even implement the function requires the use
of "scary" ForeignUsafe.

Really the zcl API should probably be changed, enqueueMapImage is
quite shit.


-- direct array access --

zcl/jni uses two approaches for accessing Java arrays of primitives.

For short arrays it just copies the data using the ArrayRegion
functions.

For potentially large ones it uses GetPrimitiveArrayCritical to allow
direct access to the memory.  These functions are only called in
non-wait contexts so seem to work OK in practice.  As a big bonus the
function works for any primitive type so the C code can share almost
all of the implementation.

zcl/panama approaches this also in two ways.

For short arrays they are just marshalled to/from java types and
stack-allocated memory.

For large arrays there is no equivalent function to
GetPrimitiveArrayCritical so all transfers must also be marshalled to
holding buffers.  A further complication is that all array access in
Java is type specific and to handle that one either has to cut and
paste or try to use MethodHandles (which are completely unchecked
until run time) and/or lambda functions to reduce code duplication.

This last issue already causes a lot of bulk in zcl for the sake of
convenience but with the added performance hit it's hard to say if
it's even worth providing such interfaces.  In zcl/jni it was worth it
because it was faster and easier to use than going via a ByteBuffer
for when synchronous transfers were adequate.


-- bulk --

C is just more concise and flexible language than Java and the macro
processor allows for all sorts of things that can't be done without
one.  It's easier to make polymorphic functions that work with
primitive types which saves a lot of repetitive typing.  For example
the JNI interface presents some interfaces that work regardless of the
java type - GetPrimitiveArrayCritical for example - that requires
bulky type-specific repetition to recreate in Java.

The java code is thus bigger and less elegant.  It's just not as fun
to write.  Fighting with generics templates suxors.

In a partial state of completion the binding is about 50% larger in
terms of total class size compared to the jni classes and compiled
libzcl.so.


Notes.
======

Interesting details.


-- gc --

Garbage collection is supported using the same mechanism as in zcl.
Namely the java objects (CLObject etc) are registered to a reference
queue which is tracked based on the pointer (MemoryAddress).  When an
object is no longer reachable then a class-specific static method
release(MemoryAddress) is invoked to free the resource.  In addition,
all objects are 'uniqified', so that the original object is always
retrieved from it's pointer for any of the query functions which
allows shared state to operate.  Any of the functions which set
callbacks retain a reference to the callback so it doesn't go away,
but it is also automatically reclaimed.

In addition to automatic gc, applications can directly release any
resource - but must take responsibility for ensuring they are not
further accessed.


-- down-calls --

The code generator creates code to instantiate the method handles for
the relevant native calls, and creates stub functions which will
invoke the method handle with the correct argument types.

Currently the static methods created take the raw foreign API types
such as MemoryAddress.  This means each high level entry point must
resolve these addresses and ensure the types are correct.

Perhaps this should change so that at least the main library types are
resolved directly as it's akin to every non-primitive argument being
void *.  I did have a Pointer<> abstraction to allow some compile-time
type checking, but this just made a lot of the code messier and
somewhat slower, so I removed it all and when I did that I moved all
functions to use the raw types.  I was hoping the function calls could
just be exposed as is but there is too much call-specific scaffolding
required for it to be automated.

As method handle.invoke throws Throwable ... this is a pita to deal
with at every invocation because you have to bounce any internal
exceptions in ways that make sense.  But I left it exposed from the
static bindings because an exception frame is almost always needed
anyway.


-- up-calls --

Native callbacks (up-calls) are resolved by raw-type arguments to a
common interface, for example:

 void (*func)(cl_mem *mem, void *data);
 void (*func)(cl_program *prog, void *data);

are both mapped to:

public interface Call_pLpv_v {
    public void fn(MemoryAddress arg_0, MemoryAddress arg_1)
        throws Throwable;
}

The convention isn't great but it's based on JNI types:

 pL = pointer to object
 pv = pointer to void
 _v returning void

Each callback has a stub() method which will convert an instance of
the matching type into an up-call stub.

In addition to the low-level interface there is a high level on which
is application visible.  For example for the one above it is:

public interface CLNotify {
    public void notify(T source);
}

To simplify application use, this java-friendly version of the
interface provides a static call() method which will create a
Callback<> object which will bounce from the raw native interface to
the java friendly one.  The Callback is registered with the reference
queue and can be automatically (or manually) reclaimed.  As a
convenience they will also take a null argument and return a Calllback
which will resolve to a MemoryAddress.NULL but otherwise do nothing,
avoiding the need for special case code where they are used as they
are always optional.

Thus an application can simply call:

 CLMemory.setMemObjectDestructorCallback((CLMemory m)->do_stuff(n));

setMemObjectDesctorCallback() will simply call:

 Callback cb = CLNotify.call(notify, CLMemory::create);

And then pass it to C using:

 clSetMemObjectDestructorCallback( .., cb.addr(), MemoryAddress.NULL);

And it is linked from native pointer to application lambda using a
single mapping interface and retaining the reference keeps it alive as
long as necessary.

An alternative I explored was to create a higher level interface that
contained the types.  It was just too ugly to expose to the
application, and required more complication to make use of if mapping
to application interfaces.  For example CLNotify would need a special
case for each possible type.


-- 'stack' allocator --

Most calls involve creating temporary work memory in which to marshal
arguments or store results - in C this would just be a stack variable
or a call to alloca().

Allocating many MemorySegment values and tracking them for accurate
disposal is somewhat inefficient and quickly becomes intractable when
you start nesting multiple dependent allocations.

To simplify the common case of marshalling call arguments there is a
stack allocator in API.Memory which can cheaply allocate blocks
and free them all at once.

There many functions in api.Native and CLObject which take an
allocator, and this stack allocator is just one implementation.


-- code generator --

The code generator uses a gcc plugin to access all the C type and
function information by hooking into the precompiled header mechanism.
This is dumped to a perl-syntax file ("perl, son") which can be easily
processed.  Unfortunately this doesn't have access to #defines which
is how cl.h defines all constants so another simple script extracts
those.

The code generator is an attempt to crate a reusable component across
multiple projects, but it really isn't going to be able to do that.
It's probably better writing an API-specific bit of perl to dump the
gcc-generated metadata out to something usable.  It also got messed up
converting from jextract-style annotation interfaces to foreign-abi
low-level interfaces and experimenting with different approaches.

So for now it's a huge disgusting mess, but zcl already consisted of
hand-written classes and OpenCL is almost entirely a handle driven API
so it is needed primarily to generate the library methods and callback
templates.  api.Native defines all the complex code which these use.

Library method handles are resolved using a signature that is
compatible with the jextract version I used previously.  All the
metadata required such as memory layouts are only created on the fly
and discarded immediately - including for up-calls.

OpenCL only requires two structures: cl_image_desc and
cl_image_format.  So these were hard-coded rather than using the code
generator.  The code generator rules for StudlyCaps generates shitty
type names, so they too were all overridden.


API changes.
============

Some things relevant to the panama version, perhaps due to it's
limitations.  Some are bug fixes.

1. createBuffer() can take a MemorySegment as well as ByteBuffer.

2. CL_MEM_USE_HOST_PTR now uses a MemorySegment to track the
   application data.  As a result:

    - release() must be called explicitly to avoid leaks.
    - release() must be called on the original creator thread.
    - getHostPtr() now creates a ByteBuffer using
      segment.asByteBuffer().order(nativeOrder())

3. All functions that use arrays require internal copying.  As the
   OpenCL functions always copy as well this is a double copy.

4. CLImageDesc uses longs rather than ints for sizes (bugfix).

5. The exceptions are still a bloody mess.

7. Things that take ByteBuffer will behave differently.  They will
   cover [position .. limit) rather than [0 .. capacity).

8. CL_MEM_USE_HOST_PTR state is now handled in CLMemory.  Thus CLImage
   now handles it.

9. createImage and the various image copy functions don't perform data
   range validation.

A. None of the SVM stuff is implemented.  I've never had a working
   implementation to ever test it anyway.

B. None of the extension stuff is implemented.

C. enqueueMap* retains the same api but it has to track all mapped
   memory blocks so the unmap operation can determine the
   MemorySegment that underlies the ByteBuffer that exposed by the
   api.  The api can probably be changed, particularly enqueueMapImage
   is quite shit.

Contact

notzed on various mail servers, primarily gmail.com.


Copyright (C) 2020 Michael Zucchi, All Rights Reserved.