Table of Contents |
---|
...
A new Cps Path query algorithm is being proposed. The time complexity was measured assessed through code analysis and performance testing, for existing and proposed solutions, and is found to be:.
Time complexity | Existing solution | Proposed solution |
---|---|---|
Best case | O(N) | O(1) |
Worst case | O(N2) | O(N) |
For a database comprising over 1,000,000 data nodes in a single anchor, the following results were recorded:
...
/bookstore/categories[@code="1"]/books[@title@price="Matilda"15]
the following SQL query is generated:
Code Block | ||
---|---|---|
| ||
SELECT * FROM FRAGMENT WHERE anchor_id = 3 AND xpath ~ '/bookstore/categories\[@code=''1'']/books(\[@(?!.*\[).*?])?$' AND ( attributes @> '{"titleprice":"Matilda"15}' ) |
The use of regex potentially results in full table scans (as each node's xpath needs to be compared against the regex), severely impacting performance, and causing query time duration to grow linearly with the fragment table size (i.e. queries get slower as the database gets bigger).
In addition, the mechanism for fetching descendant nodes returned from the query causes causes quadratic complexity in the worst case quadratic performance.
Why does the existing solution have worst case quadratic complexity?
...
- An SQL query that returns nodes matching query criteria - O(N)
- For each node returned from query 1, another SQL query is issued to fetch that node with descendants, using a LIKE operator - O(N).
The LIKE operator has linear time complexity, as all nodes' xpaths in a given anchor need to be examined.
In the case of a query that matches all In the case of a query that matches all N nodes, we are issuing 1+N queries, each with complexity O(N). Thus the overall complexity is O(N2).
...
/bookstore/categories[@code="1"]/books[@title@price="Matilda"15]
The new approach will used nested sub-queries to search for each path component, first looking for "bookstore", and using that as the parent node, look for ''categories[@code='1']", and using that as parent, look for "books" before finally applying leaf condition checks.
...
Code Block | ||
---|---|---|
| ||
SELECT * FROM fragment WHERE parent_id IN= ( SELECT id FROM fragment WHERE xpath_component = 'categories[@code=''1'']' AND parent_id = ( SELECT id FROM fragment WHERE xpath_component = 'bookstore' AND anchor_id = 3 AND parent_id IS NULL ) ) AND ( xpath_component = 'books' OR xpath_component LIKE 'books[%' ) AND ( attributes @> '{"titleprice":"Matilda"15}' ) |
In addition, a new algorithm for fetching descendants is implemented.
...
Code Block | ||
---|---|---|
| ||
WITH RECURSIVE descendant_search AS ( SELECT id, 0 AS depth FROM fragment WHERE id IN (:fragmentIds) UNION SELECT child.id, depth + 1 FROM fragment child INNER JOIN descendant_search ds ON child.parent_id = ds.id WHERE depth <= :maxDepth ) SELECT f.id, anchor_id AS anchorId, xpath, f.parent_id AS parentId, CAST(attributes AS TEXT) AS attributes FROM fragment f INNER JOIN descendant_search ds ON f.id = ds.id |
There is existing code in CpsDataPersistenceService similar to this, so minimal code changes are required to implement this.
Comparison with existing solution
...
Here is a visualization of both algorithms, executing a query for /bookstore/categories[@code='1']/books[@title@price='Matilda15']:
Existing solution | Proposed solution |
---|---|
Because the existing solution uses a regex, all nodes in anchor 3 need to be examined to see if they match the regex. | Because the The proposed solution uses sub-queries to do indexed look-up of each path component, so only relevant nodes are examined. |
|
|
If the number of nodes is doubled, the number of xpaths checked against the regex is also doubled (linear complexity). | If the number of nodes is doubled, the same number of nodes is still scanned in this case (constant complexity). |
...
These tests below were performed using Groovy/Spock tests, directly calling the CPS-Service Java API (thus these tests do not include any HTTP service overhead as when REST APIs are used).
The test data is using the openroadm model, present in the integration-test module of CPS source code repository. In this case, each 'device node' has 86 data nodes. In these tests, four anchors were populated using the same data. Thus, for 3000 device nodes, there are 3000 devices x 86 data nodes x 4 anchors + parent node = 1,032,000 001 total data nodes.
The test timings used to generate the charts in this document is available as a spreadsheet: CPS-1646 CpsPath queries using path-component lookup.xlsx
...
Case | Query one out of many using descendant cps path | Query one out of many using absolute cps path |
---|---|---|
Query | //openroadm-device[@device-id="C201-7-1A-14"] | /openroadm-devices/openroadm-device[@device-id="C201-7-1A-19"] |
Comparison graph | ||
Graph detail | ||
Time complexity of | O(N) | O(N) |
Time complexity of proposed solution | O(N) | O(1) |
...
Presently, Cps Path queries limit leaf-condition, text-condition, etc. to the final path component. The proposed solution allows for future improvement where leaf conditions could be applied to any or all path components.
...
Presently, only case 1 will return data (because 'code' is the list key for 'categories'), and cases 2 and 3 return nothing. Given the proposed solution uses an SQL sub-query for each path-component, the query generator can be adapted to support cases 2 and 3. The main work required for this would be changes to the Cps Path parser, with no expected performance impact.
Other operations can be accelerated
The same algorithm to improve query performance could also be used to improve performance of other operations, such as GET, UPDATE, and DELETE data nodes. However, some of these existing operations have "plural" implementations taking Collections as parameters, such as getDataNodesForMultipleXpaths. Additional investigation is needed for these cases.
Optimization: Index-only lookup
...
when list key is used in leaf condition
Given the following CPS path query:
...
We can optimize for this case, by checking if xpath_component = "books[@title='Matilda']", thus avoiding the need to examine the attributes field in the fragment table (thus the entire operation involves only database indexes).
In the PoC implementation, the following is a snippet from the SQL generated from the CPS path query:
Code Block | ||
---|---|---|
| ||
AND ( xpath_component = 'books' OR xpath_component LIKE 'books[%' ) AND ( attributes @> '{"title":"Matilda"}' ) |
it It may be replaced with this to add the new functionality:
Code Block | ||
---|---|---|
| ||
AND ( xpath_component = 'books[@title=''Matilda'']' OR ( ( xpath_component = 'books' OR xpath_component LIKE 'books[%') ) AND (attributes @> '{"title":"Matilda"}') AND) ( attributes @> '{"title":"Matilda"}' ) ) ) |
This would allow for an index-only lookup of the node, without needing the LIKE comparison or the attribute check.
Work Breakdown for Implementation
In addition to the changes outlined above, there is additional work remaining for this change to be production-ready.
The main algorithm was mostly done during the PoC (all integration tests are passing for the PoC). The existing PoC code can thus be refactored to make it production ready.
It may be possible to incorporate the recursive SQL descendant search into the existing solution as a starting point.
DB upgrade
...
) |
This would allow for an index-only lookup of the node, without needing the LIKE comparison or the attribute check.
Work Breakdown for Implementation
Beside the work done in the PoC implementation, there is additional work for this change to be production-ready. The main algorithm is mostly complete in the PoC (all integration tests are passing for the PoC). The existing PoC code can thus be refactored to make it production ready.
Cps Path Parser changes
The PoC uses String.split for parsing path components, which means paths containing '/' in the leaf condition are not parsed correctly, such as //books[@title='GNU/Linux']. CpsPathBuilder and CpsPathQuery classes from cps-path-parser module will need to be updated to provide the individual path components (in normalized form).
DB upgrade
Because a new column is being added to the Fragment table, the new column needs to be populated. An SQL script will be needed to provide a value for of the new xpath_component field based on existing xpath field.