Background
Repository queries taking collection parameters are limited to 32K collection items.
For example, given a FragmentRepository method:
@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath IN :xpaths", nativeQuery = true) List<FragmentEntity> findAllByAnchorIdAndXpathIn(int anchorId, Collection<String> xpaths);
This would throw an exception if supplied with a collection of 32,767 xpaths.
The reason is that JPA/Hibernate queries would translate the command to the following SQL:
SELECT * FROM fragment WHERE anchor_id = ? AND xpath IN (?,?,?,?,?,?...,?)
Each collection item (xpath) is supplied as an SQL query parameter. Hibernate with Postgres has a limit of 32,767 query parameters per SQL statement, so an exception will be thrown with too many.
CPS-1574 is to remove those limits (and write tests to verify).
Tests
As part of CPS-1574, tests have been added to confirm the presence of the limit:
def 'Delete datanodes from multiple anchors limit exceeded: 32,766 (~ 2^15) anchors.'() { given: 'more than 32,766 anchor names' def anchorNames = (0..32_766).collect { "size-of-this-name-does-not-matter-for-limit-" + it } when: 'single operation is executed to delete all datanodes in given anchors' objectUnderTest.deleteDataNodes(CPS_PERFORMANCE_TEST_DATASPACE, anchorNames, OffsetDateTime.now()) then: 'a database exception is thrown' thrown(DataAccessResourceFailureException.class) }
Once the solution is implemented, these tests will be updated to show that no exception is thrown, with even more items e.g. 40,000.
Proposal #1: List partitioning
This solution adds a wrapper method that partitions the input collection into lists of 32,000 and calls the original repository method on each batch:
@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath IN :xpaths", nativeQuery = true) List<FragmentEntity> findAllByAnchorIdAndXpathIn(int anchorId, Collection<String> xpaths); default List<FragmentEntity> findAllByAnchorIdAndXpathInBatch(final int anchorId, final List<String> xpaths) { List<FragmentEntity> results = new ArrayList<>(xpaths.size()); for (final Collection<String> xpathsBatch : Lists.partition(xpaths, 32000)) { results.addAll(findAllByAnchorIdAndXpathIn(anchorId, xpathsBatch)); } return results; }
In addition, if we wish to avoid the overhead of adding the results to a collection when not needed, we may directly call the original when the collection size is less than 32,000:
@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath IN :xpaths", nativeQuery = true) List<FragmentEntity> findAllByAnchorIdAndXpathIn(int anchorId, Collection<String> xpaths); default List<FragmentEntity> findAllByAnchorIdAndXpathInBatch(final int anchorId, final List<String> xpaths) { if (xpaths.size() < 32000) { return findAllByAnchorIdAndXpathIn(anchorId, xpathsBatch); } List<FragmentEntity> results = new ArrayList<>(xpaths.size()); for (final Collection<String> xpathsBatch : Lists.partition(xpaths, 32000)) { results.addAll(findAllByAnchorIdAndXpathIn(anchorId, xpathsBatch)); } return results; }
Proposal #2: Postgres arrays
This solution uses a third-party library to add Postgres array support to queries.
Adding the dependency to cps-ri/pom.xml:
<dependency> <groupId>io.hypersistence</groupId> <artifactId>hypersistence-utils-hibernate-52</artifactId> </dependency>
This solution involves creating two versions of each repository method taking a collection parameter - one taking a Java array, and one using a Collection:
@Query(value = "SELECT * FROM fragment WHERE anchor_id = :anchorId AND xpath = ANY(:xpaths)", nativeQuery = true) List<FragmentEntity> findAllByAnchorIdAndXpathIn(@Param("anchorId") int anchorId, @Param("xpaths") String[] xpaths); default List<FragmentEntity> findAllByAnchorIdAndXpathIn(final int anchorId, final Collection<String> xpaths) { return findAllByAnchorIdAndXpathIn(anchorId, xpaths.toArray(new String[0])); }
This would result in the following SQL, given any number of xpaths:
SELECT * FROM fragment WHERE anchor_id = ? AND xpath = ANY(?)
As seen, only two SQL query parameters will be ever used. This is because the array is binary-encoded and sent as a single parameter.
Using Postgres arrays can simplify existing code
The current FragmentRepository has code already using Postgres arrays to support "LIKE ANY" logic (using LIKE with a collection). However, it uses query parameters, and is thus limited to 32K parameters.
@Override public void deleteByAnchorIdAndXpathLikeAny(final int anchorId, final Collection<String> xpathPatterns) { final String queryString = "DELETE FROM fragment f WHERE f.anchor_id = ? AND xpath LIKE ANY (array[:parameterPlaceholders])"; executeUpdateWithAnchorIdAndCollection(queryString, anchorId, xpathPatterns); } // Accept security hotspot as placeholders in SQL query are created internally, not from user input. @SuppressWarnings("squid:S2077") private void executeUpdateWithAnchorIdAndCollection(final String sqlTemplate, final int anchorId, final Collection<String> collection) { if (!collection.isEmpty()) { final String parameterPlaceholders = String.join(",", Collections.nCopies(collection.size(), "?")); final String queryStringWithParameterPlaceholders = sqlTemplate.replaceFirst(":parameterPlaceholders\\b", parameterPlaceholders); final Query query = entityManager.createNativeQuery(queryStringWithParameterPlaceholders); query.setParameter(1, anchorId); int parameterIndex = 2; for (final String parameterValue : collection) { query.setParameter(parameterIndex++, parameterValue); } query.executeUpdate(); } }
Using native Postgres array support, this code can be written directly in the FragmentRepository interface using Query annotation:
@Modifying @Query(value = "DELETE FROM fragment f WHERE anchor_id = :anchorId AND xpath LIKE ANY (:xpathPatterns)", nativeQuery = true) void deleteByAnchorIdAndXpathLikeAny(@Param("anchorId") int anchorId, @Param("xpathPatterns") String[] xpathPatterns); default void deleteByAnchorIdAndXpathLikeAny(int anchorId, Collection<String> xpathPatterns) { deleteByAnchorIdAndXpathLikeAny(anchorId, xpathPatterns.toArray(new String[0])); }
Comparison
- Both solutions have same performance (as determined using Groovy performance tests modified to take 100,000 items)
- Solution #1 is explicit, and easy to understand
- Solution #2 is shorter, but may be not as clear
- Solution #2 is more flexible, adding array support unlocks more Postgres features.
- It's possible to combine both solutions.