| Interface | Description |
|---|---|
| BinarySparseDataset |
Binary sparse dataset.
|
| DataFrame |
An immutable collection of data organized into named columns.
|
| Dataset<T> |
An immutable collection of data objects.
|
| Instance<T> |
An immutable instance.
|
| SparseDataset |
List of Lists sparse matrix format.
|
| Tuple |
A tuple is an immutable finite ordered list (sequence) of elements.
|
| Class | Description |
|---|---|
| AbstractTuple |
Abstract tuple base class.
|
| IndexDataFrame |
A data frame with a new index instead of the default [0, n) row index.
|
| Enum | Description |
|---|---|
| CategoricalEncoder |
Categorical variable encoder.
|
The cost of having many objects is that each object in a JVM must have some metadata that is associated with it. For example, the java.lang.Class value that represents the type of that object, or the length of an array object. The most common approach is to place this metadata at the start of the object, creating an object header.
For a large or complex object, the size of the header is relatively insignificant. For a small object, however, the size of the header can become significant. For byte[1], 64 bits of metadata are often required for a single 8-bit value. Additionally, the JVM is likely to add at least 3 bytes of padding to ensure that the subsequent object in the heap starts on an aligned address. The total extra memory requirement for 8 bits of data is therefore 88 bits. Every object has a similar associated overhead, so the more objects you have, the greater the effect on system resources.
The structure of Java arrays can exaggerate this overhead. Consider an array of Complex objects. Each instance of the Complex class has two double values, of 64 bits each, plus the object header. Assuming that the header is just the class reference, and occupies only 32 bits, each Point instance is 8 bytes of data and 4 bytes of extra overhead. An array of 10 Complex objects consists of the header (class + length = 8 bytes), plus 10 object references (assuming 4 bytes each = 40 bytes). If each element of the array contains a unique Complex object, the total is 160 bytes of data, but 88 bytes of additional overhead.
The data locality of a tree of objects also has huge impact to compute efficiency. Modern hardware relies heavily on caching and prefetching to provide efficient access. Caching exploits the observation that memory that was recently accessed is likely to be accessed again soon, so keeping the most recently accessed data in very fast memory usually results in the best performance. Data is cached in small blocks, which are known as cache lines, to exploit another observation: data that is stored in sequence is often accessed in sequence. Code that accesses array[i] often proceeds to access array[i+1].
When a data structure is composed of many different objects, an operation on the information might need to access several objects to locate the actual data. However, a tree of related objects cannot be guaranteed to be close enough in memory to appear in the same block of cached memory. Some JVM configurations attempt to keep related objects close to each other in memory, but this result is not always possible. Even when the JVM can place objects next to each other, the space that is required by the object header lies between the objects, possibly disrupting the benefit.