...

Repository queries taking collection parameters are limited to 32K collection items. (This is caused by Hibernate PostgreSQL driver having limit of 32767 bind variables per statement.)
For example, org.onap.cps.spi.repository.FragmentRepository#deleteByAnchorIngiven a FragmentRepository method:

Code Block

language	java

    Collection<AnchorEntity> findAllByDataspaceAndNameIn(DataspaceEntity dataspaceEntityList<FragmentEntity> findAllByAnchorIdAndXpathIn(int anchorId, Collection<String> anchorNamesxpaths);

This This would throw an exception if supplied with a collection of 32,768 items767 xpaths.

The reason is that JPA/Hibernate queries would translate the command to the following SQL:

Code Block

language	sql

SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?,?,?...,?)

Each collection item (xpath) is supplied as an SQL query parameter. Hibernate with Postgres has a limit of 32,767 query parameters per SQL statement, so an exception will be thrown with too many.

CPS-1574 is to remove those limits (and write tests to verify).

Proposal #1: List partitioning

...

Tests

As part of CPS-1574, tests have been added to confirm the presence of the limit:

Code Block

language	javagroovy

    Collection<AnchorEntity> findAllByDataspaceAndNameIn(DataspaceEntity dataspaceEntity, Collection<String> anchorNames);def 'Delete datanodes from multiple anchors limit exceeded: 32,766 (~ 2^15) anchors.'() {
        given: 'more than 32,766 anchor names'
            def anchorNames = (0..32_766).collect { "size-of-this-name-does-not-matter-for-limit-" + it }
        when: 'single defaultoperation Collection<AnchorEntity>is findAllByDataspaceAndNameInBatch(finalexecuted DataspaceEntityto dataspaceEntity,delete all datanodes in given anchors'
            objectUnderTest.deleteDataNodes(CPS_PERFORMANCE_TEST_DATASPACE, anchorNames, OffsetDateTime.now())
        then: 'a database exception is thrown'
            thrown(DataAccessResourceFailureException.class)
    }

Once the solution is implemented, these tests will be updated to show that no exception is thrown, with even more items e.g. 40,000.

Proposal #1: Batching at 32,000

This solution adds a wrapper method that partitions the input collection into lists of 32,000 and calls the original repository method on each batch:

Code Block

language	java

@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath IN :xpaths", nativeQuery = true)
List<FragmentEntity> findAllByAnchorIdAndXpathIn(int finalanchorId, List<String>Collection<String> anchorNamesxpaths);

{default List<FragmentEntity> findAllByAnchorIdAndXpathInBatch(final int anchorId, final List<String> xpaths) {
   Collection<AnchorEntity> List<FragmentEntity> results = new ArrayList<>(anchorNamesxpaths.size());
   
    for (final Collection<String> anchorNamesBatchxpathsBatch : Lists.partition(anchorNamesxpaths, 32000)) {
  
         results.addAll(findAllByDataspaceAndNameInfindAllByAnchorIdAndXpathIn(dataspaceEntityanchorId, anchorNamesBatchxpathsBatch));
    }
    return results;
}

This solution is largely self-documenting, as the 32K limit is clearly seen in the code.

This solution still involves having a separate query parameter for each collection item - we are avoiding the limit by simply splitting into multiple SQL queries:

Code Block

language	sql

SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?,?,?...,?)
SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?,?,?...,?)
SELECT * FROM fragment WHERE return results;
    }

...

anchor_id = ? AND xpath IN (?,?,?,?)

Observations

Performance testing is inconclusive (so far)

Testing with a batch size of 1000 (not 32000) showed a performance drop of 30% at 20000 items. More testing is needed to see if this applies with batch size of 32000, or if the test involved bad luck.

Advantage: Solution is easy to understand

This code is largely self-documenting, as the 32K limit is clearly seen in the code.

Batching on repository methods versus API methods

This solution should work equally well if batching at the public API methods instead of each repository method.

Proposal #2: Postgres arrays

This solution uses a third-party library to add Postgres array support to queries.

Adding the dependency to cps-ri/pom.xml:

Code Block

language	xml

<dependency>
    <groupId>io.hypersistence</groupId>
    <artifactId>hypersistence-utils-hibernate-52</artifactId>
</dependency>

This solution involves creating two versions of each repository method taking a collection parameter - one taking a Java array, and one using a Collection:

Code Block

language	java

@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath = ANY(:xpaths)", nativeQuery = true)
List<FragmentEntity> findAllByAnchorIdAndXpathIn(@Param("anchorId") int anchorId, @Param("xpaths") String[] xpaths);

default List<FragmentEntity> findAllByAnchorIdAndXpathIn(final int anchorId, final Collection<String> xpaths) {
    return findAllByAnchorIdAndXpathIn(anchorId, xpaths.toArray(new String[0]));
}

This would result in a single SQL query, given any number of xpaths:

Code Block

language	sql

SELECT * FROM fragment WHERE anchor_id = ? AND xpath = ANY(?)

As seen, only two SQL query parameters will be ever used. This is because the array is binary-encoded and sent as a single parameter.

Testing so far has shown no noticeable performance difference with master branch.

Observations

Performance is unchanged

Testing shows there was no increase or decrease in performance compared to master branch after arrays were used.

Disadvantage: extra dependency

~~Extra dependencies could add possible risks of dependency issues, security issues, etc. in future.~~

The new dependency is an updated version of an existing dependency (hibernate-types-52) - the developer renamed the library at some point. More info here: https://vladmihalcea.com/hibernate-types-hypersistence-utils/

Disadvantage: Limited array type support

Postgres array support is limited to primitive data types and String, that is int[], long[], float[], double[], boolean[], String[]

For example, given a FragmentRepository method such as:

Code Block

language	java

void deleteByAnchorIn(Collection<AnchorEntity> anchorEntities);

Changing to an array of AnchorEntity will not work. The following will throw an exception:

Code Block

language	java

void  Collection<AnchorEntity> findAllByDataspaceAndNameIn(DataspaceEntity dataspaceEntity, Collection<String> anchorNames); 
deleteByAnchorIn(AnchorEntity[] anchorEntities);

Rather, the method would have to be changed to use anchor IDs:

Code Block

language	java

void deleteByAnchorIdIn(int[] anchorIds);

Side note: in the above case, if using auto-generated SQL from the method name, the SQL generated using the anchor ID will be simpler than when using AnchorEntity objects, as Hibernate will generate a JOIN between FragmentEntity and AnchorEntity in the first case, but will only use FragmentEntity in the last case. So while the types are more restrictive, it may result in more optimal SQL code.

Advantage: Code already using Postgres arrays can be simplified

The current FragmentRepository has code already using Postgres arrays to support "LIKE ANY" logic (using LIKE with a collection). However, it currently uses query parameters, and is thus limited to 32K parameters.

Code Block

language	java

    @Override
    defaultpublic Collection<AnchorEntity>void findAllByDataspaceAndNameInBatchdeleteByAnchorIdAndXpathLikeAny(final DataspaceEntityint dataspaceEntityanchorId, final Collection<String> xpathPatterns) {
        final String queryString = "DELETE FROM fragment f WHERE f.anchor_id = ? AND xpath LIKE ANY (array[:parameterPlaceholders])";
        executeUpdateWithAnchorIdAndCollection(queryString, anchorId, xpathPatterns);
    }

    // Accept security hotspot as placeholders in SQL query are created internally, not from user input.
    @SuppressWarnings("squid:S2077")
    private void executeUpdateWithAnchorIdAndCollection(final String sqlTemplate, final int anchorId, final List<String>Collection<String> anchorNamescollection) {
        if (anchorNames!collection.sizeisEmpty()) {
 < 32000) {         final String parameterPlaceholders  return findAllByDataspaceAndNameIn(dataspaceEntity, anchorNames);= String.join(",", Collections.nCopies(collection.size(), "?"));
            final String queryStringWithParameterPlaceholders =
 }           Collection<AnchorEntity> results = new ArrayList<>(anchorNames.size());sqlTemplate.replaceFirst(":parameterPlaceholders\\b", parameterPlaceholders);

            final Query query = entityManager.createNativeQuery(queryStringWithParameterPlaceholders);
            query.setParameter(1, anchorId);
            int parameterIndex = 2;
            for (final Collection<String>String anchorNamesBatchparameterValue : Lists.partition(anchorNames, 32000)) {
       collection) {
                query.setParameter(parameterIndex++, parameterValue);
            }
            query.executeUpdate();
        }
    }

Using native Postgres array support, this code can be written directly in the FragmentRepository interface using Query annotation, without the 32K limitation:

Code Block

language	java

    @Modifying
    @Query(value = "DELETE FROM fragment f WHERE anchor_id = :anchorId AND xpath LIKE ANY (:xpathPatterns)", nativeQuery = true)
    void results.addAlldeleteByAnchorIdAndXpathLikeAny(findAllByDataspaceAndNameIn(dataspaceEntity, anchorNamesBatch));@Param("anchorId") int anchorId, @Param("xpathPatterns") String[] xpathPatterns);

    default void deleteByAnchorIdAndXpathLikeAny(int  }anchorId, Collection<String> xpathPatterns) {
        return resultsdeleteByAnchorIdAndXpathLikeAny(anchorId, xpathPatterns.toArray(new String[0]));
    }

Proposal #2: PostGres arrays

...

Comparison

More testing is needed to confirm if list partitioning drops performance with large datasets.
Solution #1 is explicit using 32K batches, so easy to understand what it does
Solution #2 has shorter code, but not very obvious to the reader what it does
- We could add a comment linking to this ticket to explain the 32K limit and why arrays are used
Solution #2 simplifies code in the cases when we need Postgres array support.
It's possible to use both solutions, where most appropriate.

Version	Old Version 1	New Version Current
Changes made by	Daniel Hanrahan	Daniel Hanrahan
Saved on	Apr 19, 2023	Apr 20, 2023

Versions Compared

Key

Proposal #1: List partitioning

Tests

Proposal #1: Batching at 32,000

Observations

Performance testing is inconclusive (so far)

Advantage: Solution is easy to understand

Batching on repository methods versus API methods

Proposal #2: Postgres arrays

Observations

Performance is unchanged

Disadvantage: extra dependency

Disadvantage: Limited array type support

Advantage: Code already using Postgres arrays can be simplified

Proposal #2: PostGres arrays

Comparison

Content Comparison

Versions Compared

Key

Proposal #1: List partitioning

Tests

Proposal #1: Batching at 32,000

Observations

Performance testing is inconclusive (so far)

Advantage: Solution is easy to understand

Batching on repository methods versus API methods

Proposal #2: Postgres arrays

Observations

Performance is unchanged

Disadvantage: extra dependency

Disadvantage: Limited array type support

Advantage: Code already using Postgres arrays can be simplified

Proposal #2: PostGres arrays

Comparison