Understanding the Vector API in Java 21

‍

This is our second post on what's new in Java21 and how to make the best use of it.

We discuss the Vector API in detail here as it's now in its 6th Incubator in Java21, and it's pretty mature and relevant to modern-day computations.

Vector API add these values to Java:

Expresses vector computations easily within Java code that compiles to efficient vector instructions on supported CPUs
Offers a well-tested API to clearly and concisely express a wide range of vector computations, consisting of sequences of vector operations composed within loops and managed with control flow
Achieves far superior performance as compared to equivalent scalar computations
Its CPU architecture is agnostic, enabling implementations on multiple architectures supporting vector instructions
Offers a reliable performance and runtime compilation on x64 as well as AArch64 architectures

It’s in incubation mainly because it’s waiting on Project Valhalla to provide value objects ( i.e. class instances that lack object identity and are used for the data value(s) they hold), otherwise, it's almost fit to go as a preview feature. Hotspot JVM does some vectorisations on its own but does not convert every calculation to possible vectorization.Numerous domains can benefit from this explicit vector API including machine learning, linear algebra, cryptography, finance, and code within the JDK itself.

Let's start with a short introduction to vector computations and how they differ from equivalent scalar computations.

A scalar computation reads a single/individual piece of data from different memory locations, applies some operations to them and stores the results in another memory location.

A vector computation can read several pieces of data at once and applies several operations on them, within a single processor instruction.E.g.

Vectors A & B and the Vector computation result

A single vector instruction would act on all the 32 numbers in both VectorA and VectorB above, performing 16 additions in a single vector operation and storing the results in ResulA+B vector, instead of 16 repetitions of a loop.

Talking in terms of processors, a scalar processor acts on a single piece of data at a time, whereas a vector processor acts on several pieces of data(Arrays and Vectors) with a single instruction.A superscalar processor issues several instructions at a time, each of which operates on one piece of data. E.g. the MIPS pipelined processor is a scalar processor. Vector processors were popular for supercomputers in the 1980s and 1990s because they efficiently handled the long vectors of data common in scientific computations. Modern high-performance microprocessors are superscalar because issuing several independent instructions is more flexible than processing vectors.

However, modern processors also include hardware to handle short vectors of data that are common in multimedia and graphics applications. These are called single instruction multiple data (SIMD) units. In a SIMD unit, when a binary is operation applied to two vectors with the same number of lanes(elements), this operation is applied on corresponding two scalar values from each lane of each vector. This parallelism results in more work being performed in each CPU cycle, leading to significant performance improvement.In contrast, a scalar operation would read each of the next two corresponding values one by one(sequentially), apply the operation to them, and repeat the process for all lanes in the two vectors.

The vector API essentially does that it identifies a bunch of scalar instructions that can be grouped/rearranged into vectors and applies grouped vector operation on them for efficiency. This is also termed creating a multiword object(vectors) and applying a superword(vector) operation on them. Processors include SIMD registers, these registers are extra wide: a 512-bit SIMD register can hold 16 32-bit words or 16 short or 4 double values, and the same operation can be applied to all of them in one go.

Element storage capacity of Vectors determined by dividing Vector size by element size

Here, the vector size divided by the element size determines how many elements can be stored( length/the number of lanes) in the vector. These days it's not uncommon to have vector sizes of 2048 bits which can store 64 integers of 32 bits each.

SIMD do not use threads, instead, they use many operating units, all units operating the same operation at the same time in the same CPU cycle, sharing the same program but different data values.

A Vector processor consists of:

Vector registers - store a vector of operands and operate simultaneously on their contents at once.
Vectorised and pipelined functional units - apply the same operation to all elements of a single vector OR each pair of corresponding elements in 2 vectors.
Vector instructions - these instructions load vector registers and perform floating-point operations on theme.g. a simple for loop like the one below -

 
for (i = 0; i < n; i++) a[i] += b[i];

just requires a single load, add, and store to execute.

When you apply a mathematical operation on each element of a vector, it is also called distributing an operation across the lanes(an element occupies a lane). We will not go into much detail about how certain hardware operates, but Vector computations do accelerate many computations exponentially.

Let's get back to Java21 now.

Though the HotSpot JVM supports auto-vectorization, it does so only on a limited basis and with very obvious cases. The Vector API now transfers power into the programmers' hands to write vector algorithms that are in accordance with their specific business needs and are robust and predictable. It gives the authority and flexibility to harness the power of Vector algorithms by implementing them on complex processes like hashing, encryption, and specialised comparisons.

Java21 has the jdk.incubator.vector package that holds the classes relevant to our discussion. A vector is represented by the abstract class Vector<E>, where the element types (E) supported is Byte/Short/Integer/Long/Float/Double, corresponding to the scalar primitive types byte/short/int/long/float/double, respectively.

Next, there are six abstract subclasses of Vector<E>, one for each supported element type: ByteVector, ShortVector, IntVector, LongVector, FloatVector, and DoubleVector.

To make group operations simpler the Vector API defines collective methods like add/multiply that can be applied on two or more vectors, in a single instruction.e.g. Vector1.add(Vector2)

The VectorOperators class defines static constants representing lane-wise operations that can be applied to Vectors, like absolute/division/XOR/and/greater than/convert etc.

If you want an operation to be performed only to some elements(lanes) of a vector, based on certain conditions, the API provides VectorMask<E>, a mask having only true/false values.

Generally, it is of the same length as the vector holding the actual data values. For every true lane in the mask, an operation is performed on the corresponding lane of the data vector. If the mask is false, the API allows the programmer to specify an alternate operation. There are a few more specialized classes like VectorShuffle<E> and VectorSpecies<E>, but we won't discuss them here as some of our readers might be beginners on this topic.The code samples below will illustrate this concept.

A scalar program to add integers in 2 arrays would be like

 
int[] scalarA = {1,2,3,4,5} ;
   int[] scalarB = {6,7,8,9,10} ;
   int[] scalarResult = new int[5] ;`

  // Inefficient loop
   for (int i = 0; i < scalarResult.length; i++) {
   scalarResult[i] = scalarA[i] + scalarB[i];
   }

The for loop in the above program runs 5 times, in every run it reads 2 integers from memory locations, adds them, and stores the result in a 3rd memory location.

The HotSpot JVM might perform auto-vectorization internally, but we want more control over vectorization, hence we write our own code.

The same program when written using Vector computation will be:

 
int[] scalarA = ...;
   int[] scalarB = ... ;
   int[] scalarResult = new int[scalarA.length] ;

   static final VectorSpecies species = IntVector.SPECIES_PREFERRED ; 

/* **IntVector** is the species that suits our data, making it static final helps the compiler better optimize the vector computation. */ 
   var vectorA = IntVector.fromArray(species, scalarA, 0) ; // 0 is the array index to start from
   var vectorB = IntVector.fromArray(species, scalarB, 0) ;
   var interimResult = vectorA.add(vectorB);

   interimResult.intoArray(scalarResult, 0);

In just 1 CPU instruction, we have added the two arrays and stored it in scalarResult, without the need to loop through the data.

Now, an obvious question would be: what if the arrays have 25 elements and our SIMD can hold only 8 integers as its having only 256 bits?

Fortunately, the vector API gives us a length() method to know the capacity limit, then use loopBound() method to make data vectors of the maximum allowed size, and design our data and code accordingly.

i.e. VectorSpecies<E> class has a length() method that returns the number of lanes in a vector of this species , on your platform(your SIMD).

It also has a loopBound(int length) function which returns the largest multiple of VLENGTH that is less than or equal to the given length value.

E.g. if your SIMD supports 8 integers only and your array length is 25, loopBound(8) will return 3, i.e. your data can be divided into 3 vectors of length 8 each, so the addition can take place in 3 repetitions of the for loop ( you decide the best way to handle the remaining 25th element).The program below processes 24 integers via Vector addition and the remaining element separately.

 
static final VectorSpecies < Int > species = IntVector.SPECIES_PREFERRED;
for (int i = 0; i < species.loopBound(scalarA.length); i += species.length()) {
  var vectorA = IntVector.fromArray(species, scalarA, i); // **i**  is the next index to start from, 0 initially
  var vectorB = IntVector.fromArray(species, scalarB, i);
  var interimResult = vectorA.add(vectorB);

  interimResult.intoArray(scalarResult, 0);
}
// Now for the remaining data (or 25th element in our example above)
for (int index2 = i; index2 < scalarA.length(); index2++) {
  scalarResult[index2] = scalarA[index2] + scalarB[index2];
}

In the real world though, it's better to either use padding or arrange/align your data to fit into the size allowed by your SIMD and rearrange the computation results later.

Today, as the data volumes keep increasing, Vectorization can improve and optimise computations manifold. Even your simple operations like string conversions/array comparison etc. or array can benefit from it.

For more fine-grained information on the level of performance optimisation gained for various operations/data types, you can check this OpenJDK benchmarks: Here

Pratik Dwivedi

November 16, 2023