Background

Repository queries taking collection parameters are limited to 32K collection items.
For example, given a FragmentRepository method:

@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath IN :xpaths", nativeQuery = true)
List<FragmentEntity> findAllByAnchorIdAndXpathIn(int anchorId, Collection<String> xpaths);

This would throw an exception if supplied with a collection of 32,767 xpaths.

The reason is that JPA/Hibernate queries would translate the command to the following SQL:

SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?,?,?...,?)

Each collection item (xpath) is supplied as an SQL query parameter. Hibernate with Postgres has a limit of 32,767 query parameters per SQL statement, so an exception will be thrown with too many.

CPS-1574 is to remove those limits (and write tests to verify).

Tests

As part of CPS-1574, tests have been added to confirm the presence of the limit:

    def 'Delete datanodes from multiple anchors limit exceeded: 32,766 (~ 2^15) anchors.'() {
        given: 'more than 32,766 anchor names'
            def anchorNames = (0..32_766).collect { "size-of-this-name-does-not-matter-for-limit-" + it }
        when: 'single operation is executed to delete all datanodes in given anchors'
            objectUnderTest.deleteDataNodes(CPS_PERFORMANCE_TEST_DATASPACE, anchorNames, OffsetDateTime.now())
        then: 'a database exception is thrown'
            thrown(DataAccessResourceFailureException.class)
    }

Once the solution is implemented, these tests will be updated to show that no exception is thrown, with even more items e.g. 40,000.

Proposal #1: List partitioning

This solution adds a wrapper method that partitions the input collection into lists of 32,000 and calls the original repository method on each batch:

@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath IN :xpaths", nativeQuery = true)
List<FragmentEntity> findAllByAnchorIdAndXpathIn(int anchorId, Collection<String> xpaths);

default List<FragmentEntity> findAllByAnchorIdAndXpathInBatch(final int anchorId, final List<String> xpaths) {
    List<FragmentEntity> results = new ArrayList<>(xpaths.size());
    for (final Collection<String> xpathsBatch : Lists.partition(xpaths, 32000)) {
        results.addAll(findAllByAnchorIdAndXpathIn(anchorId, xpathsBatch));
    }
    return results;
}

In addition, if we wish to avoid the overhead of adding the results to a collection when not needed, we may directly call the original when the collection size is less than 32,000:

@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath IN :xpaths", nativeQuery = true)
List<FragmentEntity> findAllByAnchorIdAndXpathIn(int anchorId, Collection<String> xpaths);

default List<FragmentEntity> findAllByAnchorIdAndXpathInBatch(final int anchorId, final List<String> xpaths) {
    if (xpaths.size() < 32000) {
        return findAllByAnchorIdAndXpathIn(anchorId, xpathsBatch);
    }
    List<FragmentEntity> results = new ArrayList<>(xpaths.size());
    for (final Collection<String> xpathsBatch : Lists.partition(xpaths, 32000)) {
        results.addAll(findAllByAnchorIdAndXpathIn(anchorId, xpathsBatch));
    }
    return results;
}

Proposal #2: Postgres arrays

This solution uses a third-party library to add Postgres array support to queries.

Adding the dependency to cps-ri/pom.xml:

<dependency>
    <groupId>io.hypersistence</groupId>
    <artifactId>hypersistence-utils-hibernate-52</artifactId>
</dependency>

This solution involves creating two versions of each repository method taking a collection parameter - one taking a Java array, and one using a Collection:

@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath = ANY(:xpaths)", nativeQuery = true)
List<FragmentEntity> findAllByAnchorIdAndXpathIn(@Param("anchorId") int anchorId, @Param("xpaths") String[] xpaths);

default List<FragmentEntity> findAllByAnchorIdAndXpathIn(final int anchorId, final Collection<String> xpaths) {
    return findAllByAnchorIdAndXpathIn(anchorId, xpaths.toArray(new String[0]));
}

This would result in the following SQL, given any number of xpaths:

SELECT * FROM fragment WHERE anchor_id = ? AND xpath = ANY(?)

As seen, only two SQL query parameters will be ever used. This is because the array is binary-encoded and sent as a single parameter.

Using Postgres arrays can simplify existing code

The current FragmentRepository has code already using Postgres arrays to support "LIKE ANY" logic (using LIKE with a collection). However, it uses query parameters, and is thus limited to 32K parameters.

    @Override
    public void deleteByAnchorIdAndXpathLikeAny(final int anchorId, final Collection<String> xpathPatterns) {
        final String queryString = "DELETE FROM fragment f WHERE f.anchor_id = ? AND xpath LIKE ANY (array[:parameterPlaceholders])";
        executeUpdateWithAnchorIdAndCollection(queryString, anchorId, xpathPatterns);
    }

    // Accept security hotspot as placeholders in SQL query are created internally, not from user input.
    @SuppressWarnings("squid:S2077")
    private void executeUpdateWithAnchorIdAndCollection(final String sqlTemplate, final int anchorId, final Collection<String> collection) {
        if (!collection.isEmpty()) {
            final String parameterPlaceholders = String.join(",", Collections.nCopies(collection.size(), "?"));
            final String queryStringWithParameterPlaceholders =
                sqlTemplate.replaceFirst(":parameterPlaceholders\\b", parameterPlaceholders);

            final Query query = entityManager.createNativeQuery(queryStringWithParameterPlaceholders);
            query.setParameter(1, anchorId);
            int parameterIndex = 2;
            for (final String parameterValue : collection) {
                query.setParameter(parameterIndex++, parameterValue);
            }
            query.executeUpdate();
        }
    }

Using native Postgres array support, this code can be written directly in the FragmentRepository interface using Query annotation:

    @Modifying
    @Query(value = "DELETE FROM fragment f WHERE anchor_id = :anchorId AND xpath LIKE ANY (:xpathPatterns)", nativeQuery = true)
    void deleteByAnchorIdAndXpathLikeAny(@Param("anchorId") int anchorId, @Param("xpathPatterns") String[] xpathPatterns);

    default void deleteByAnchorIdAndXpathLikeAny(int anchorId, Collection<String> xpathPatterns) {
        deleteByAnchorIdAndXpathLikeAny(anchorId, xpathPatterns.toArray(new String[0]));
    }

Comparison

Both solutions have same performance (as determined using Groovy performance tests modified to take 100,000 items)
Solution #1 is explicit, and easy to understand
Solution #2 is shorter, but may be not as clear
Solution #2 is more flexible, adding array support unlocks more Postgres features.
It's possible to combine both solutions.

CPS-1574: Remove 32K limit from DB operations