Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Summary

This is an assessment of current performance of CPS operations, as of August 2023.

Since January 2023, performance of many CPS and NCMP operations have been increased by orders of magnitude - some highlights:

  • NCMP CM-handle de-registration moved from average quadratic to linear time (constant rate). De-registering 20,000 CM-handles is thousands of times faster now.
  • CPS Path Queries have moved from worst-case quadratic to linear time complexity. Queries that took days previously take minutes or seconds now.
  • Memory usage during read, query, and update operations has been decreased by more than 90% (10 times less memory).
  • Not only has performance increased, but reliability and robustness. It is now possible for a single batch operation to handle tens of thousands of elements, where it would have failed before.

CPS Core Performance

Test Environment

...

HardwareSpecifications
CPU13th Gen Intel© Core™ i9-13900H (24 MB cache, 14 cores, up to 5.40 GHz turbo mode)
Memory32 GB RAM (16 GB x 2, DDR5, 4800 MHz)
Storage1 TB M.2 PCIe NVMe SSD
Operating SystemLinux Mint 21.2 Cinnamon (based on Ubuntu 22.04)

Test setup

The performance tests are written in Groovy (a JVM language). The tests were executed from Intelli-J using the Amazon Corretto 17 JDK.

As all CPS Core operations are synchronous, the results here are to be considered as single-threaded operation performance only.

Test data

Test data used complies with uses the Open ROADM YANG model - a real-world model for optical devices. Specifically, openroadm-device nodes, each consisting of 86 fragments, are created. For example, a test that creates 1000 1,000 device nodes will result in 86,000 fragments in the database. Some tests use up to 3000 3,000 device nodes (258,000 fragments - equivalent to around 20,000 CM-handles in NCMP), with four anchors replicating the data, meaning that the system has been tested up to 1 million fragments.

...

A varying number of Open ROADM device nodes will be stored using CpsDataService::saveData.

1000
Created device nodes11002003004005006007008009001,000
Fragments created868,60017,20025,80034,40043,00051,60060,20068,80077,40086,000

Time (seconds)

0.302.364.367.159.7611.5014.7718.4319.7922.1626.54
Fragments per second2873,6443,9453,6083,5253,7393,4943,2663,4773,4933,240

Graph of time taken to store device nodes

Observations

  • Storing data nodes has linear time complexity (as expected).
  • Raw performance is roughly 3000 3,500 fragments per second for the given test setup.
  • Performance can be improved by enabling write batching (CPS-1795)
  • There are edge cases with exponential complexity (adding books to the bookstore model).

...

The current database schema does not allow for Hibernate to use JDBC write batching. There is work worke in progress to address this.

...

There is work in progress to address this behavior using a custom algorithm using JDBC. The use of a graph database would implicitly fix such issues.

Updating data nodes

In this scenario, 1000 Open ROADM device nodes are already defined. A number of these existing data nodes will be updated using CpsDataService::updateDataNodeAndDescendants.

...

Time (seconds)

...

69.46

...

Image Removed

Observations

  • Updating data nodes has linear time complexity (as expected).
  • Raw performance is roughly 600 fragments per second for the given model and test setup.
  • Updating data nodes is 5 times slower than storing data nodes.

Commentary

This is the weak spot of all write operations. A custom algorithm comparing existing data against updated data (delta algorithm) may be required to improve performance here.

This could also indicate that Hibernate is not being used effectively.See here: https://vladmihalcea.com/the-best-way-to-map-a-onetomany-association-with-jpa-and-hibernate/ for explanation on why this occurs. Basically, unidirectional mapping OneToMany suffers from poor performance due to the order in which Hibernate writes out the entities. Switching to either unidirectional ManyToOne, or bidirectional OneToMany and ManyToOne could solve the issue using pure Hibernate, but they come with drawbacks. (To Do: More details on this.)

Updating data nodes

NOTE: I have removed the previous test results here, as the test analysis was flawed. Analysis is ongoing as per CPS-1674.

Performance tests to support new analysis are provisionally available at: https://gerrit.nordix.org/c/onap/cps/+/19086

Updating data leaves

In this scenario, 1000 Open ROADM , 1,000 OpenROADM device nodes are already defined. The data leaves of a number of these existing data nodes will be updated using CpsDataService::updateNodeLeaves.

...

Test Results

1000
Updated device nodes11002003004005006007008009001,000
Fragments updated11002003004005006007008009001,000

Time (seconds)

0.200.270.280.280.320.380.390.470.490.52

0.56

Fragments per second53707141,0711,2501,3161,5391,4891,6331,7311,786

Graph of updating data leaves of device nodes

Observations

  • Updating data leaves has linear time complexity.
  • Raw performance is about 3000 around 1,500 fragments per second.
  • This is very fast compared to updating whole data nodes. I recommend that NCMP use this API for updating CM-handle state, and should be preferred where possible.

Commentary

The performance of this function is excellent, and yet it may be improved by write batching.

 I recommend that NCMP use this API for updating CM-handle state.

Deleting data nodes

In this scenario, 300 Open ROADM OpenROADM device nodes are already defined. A number of these data nodes will be deleted using CpsDataService::deleteDataNodes. The types of nodes will be varied, for example, deleting container nodes, list elements, or whole lists.

Test results

N =
50100150200250300Example xpath
Delete top-level container node-----0.63/openroadm-devices
Batch delete N/300 container nodes0.150.260.380.450.550.69/openroadm-devices/openroadm-device[@device-id='C201-7-1A-10']/org-openroadm-device
Batch delete N/300 lists elements0.130.250.340.450.550.67/openroadm-devices/openroadm-device[@device-id='C201-7-1A-49']
Batch delete N/300 whole lists0.511.051.401.852.132.56/openroadm-devices/openroadm-device[@device-id='C201-7-1A-293']/org-openroadm-device/degree
Try batch delete N/300 non-existing0.250.540.670.951.151.32/path/to/non-existing/node[@id='27']

...

  • Delete performance is linear on the amount of data being deleted (as expected).
  • Raw performance of deleting containers of list elements container nodes is around 4035,000 fragments per second. (So we can delete data nodes around 10x faster than creating them.)
  • Deleting lists is much slower than deleting the parent container of the list (which this can be easily improved).
  • Of note, attempting to delete non-existing data nodes takes longer than actually deleting the equivalent amount of nodes with descendants - it is a slow operation.

Suggested improvement: For whole list deletion, add a condition to the WHERE clause in the SQL for deleting lists, to narrow the search space to children of the parent. For example:

SELECT *DELETE FROM fragment WHERE (existing conditions)
  AND parent_id = (SELECT id FROM fragment WHERE xpath = '/parent-xpath')

...

Reading the top-level container node with no descendants:

3000
Total device nodes5001,0001,5002,0002,50010001500200025003,000
Fragments read111111
Time (milliseconds)475248564847

The above data clearly indicates constant time.

Reading the top-level container node with all descendants:

3000
Total device nodes50010001500200025001,0001,5002,0002,5003,000
Fragments read43,00086,000129,000172,000215,000258,000
Time (seconds)0.421.191.542.162.532.67
Fragments per second102,38172,26983,76679,63085,65796,629

e

Graph of time taken to read top-level container node with all descendants

Observations

  • Reading a single top-level container node with no descendants has constant time (as expected).
  • Reading a single top-level container node with all descendants has linear time (as expected).
  • Raw performance of reading with all descendants is roughly 100,000 fragments per second.

Commentary

This is the fastest operation in CPS, in terms of fragments per second. This is not surprising, as a simple read should be the fastest operation.

Reading data nodes for multiple xpaths

...

Test results

3000
Total device nodes5001,00050010001500200025001,5002,0002,5003,000
Fragments read43,00086,000129,000172,000215,000258,000
Time (seconds)0.611.151.522.142.963.97

Special case: attempting to read multiple non-existing data nodes

In this case, we attempt to read many non-existing data nodes:

...

The above case appears to be constant time, but theoretically must be linear - we'll call it negligible time.

Observations

  • Reading many data nodes with all descendants has linear time (as expected).
  • Attempting to read many non-existing data nodes takes negligible time.
  • Raw performance of reading many with all descendants is roughly 80,000 fragments per second.

...

Recently, functionality was added to enable reading whole lists (CPS-1696). Here we compare performance of reading a container node containing a list, versus reading the list (with all descendants).

3000
Total device nodes50010001500200025001,0001,5002,0002,5003,000xpath
Reading container0.3860.7121.5292.6671.7593.112/openroadm-devices
Reading list0.5851.3352.0362.8602.7693.949/openroadm-devices/openroadm-device

As can be seen, it is slower reading the list than reading the parent node containing the list.

...

This should narrow the performance gap in this case.

Cps Path Queries

...

There have been massive orders-of-magnitude improvements to query performance. Formerly, queries had quadratic complexity, which is now linear. In the last month alone, performance has improved by 100x.

With a simple improvement, some cases can be made constant time instead of linear.

In the graphs below, time is measured in milliseconds.

Query one device out of many using absolute cps path

In this case, a query that matches a single device node using an absolute cps path is executed.

Fetch descendantsOmit DescendantsDirect DescendantsAll Descendants
Query

//openroadm-device[@device-id="C201-7-1A-14"]

Graph

Image Added

Image Added

Image Added

Time ComplexityLinear on total number of fragments in databaseLinear on total number of fragments in databaseLinear on total number of fragments in database
CommentsThese operations are as fast as reading with getDataNodes API.

Query one device out of many using descendant cps path

In this case, a query that matches a single device node using a descendant cps path is executed.

Fetch descendantsOmit DescendantsDirect DescendantsAll Descendants
Query//openroadm-device[@device-id="C201-7-1A-14"]
Graph

Image Added

Image Added

Image Added

Time ComplexityLinear on total number of fragments in databaseLinear on total number of fragments in databaseLinear on total number of fragments in database
CommentsThese operations are as fast as reading with getDataNodes API.

Query all devices using absolute cps path

In this case, a query that matches all device nodes using an absolute cps path is executed.

Fetch descendantsOmit DescendantsDirect DescendantsAll Descendants
Query

/openroadm-devices/openroadm-device[@status="success"]

Graph

Image Added

Image Added

Image Added

Time ComplexityLinear on the amount of fragments returned from queryLinear on the amount of fragments returned from queryLinear on the amount of fragments returned from query
CommentsThis operation is around half the speed as getDataNodes.This operation is around half the speed as getDataNodes.This has same speed as reading with getDataNodes.

Query all devices using descendant cps path

In this case, a query that matches all device nodes using a descendant cps path is executed.

Fetch descendantsOmit DescendantsDirect DescendantsAll Descendants
Query//openroadm-device[@ne-state="inservice"]
Graph

Image Added

Image Added

Image Added

Time ComplexityLinear on the amount of fragments returned from queryLinear on the amount of fragments returned from queryLinear on the amount of fragments returned from query
CommentsThis operation is around half the speed as getDataNodes.This operation is around half the speed as getDataNodes.This has same speed as reading with getDataNodes.

Commentary

Suggested improvement: For absolute cps path queries, the complexity can be reduced to best-case constant time (so it is no longer linear on the total number of fragments in the database).

Add a condition to the WHERE clause in the SQL to narrow the search space to children of the parent. For example:

SELECT * FROM fragment WHERE (existing conditions)
  AND parent_id IN (SELECT id FROM fragment WHERE xpath = '/parent-xpath')

(Note we used "parent_id IN" as there may be multiple parents with the same xpath during query across anchors.

Comparison of NCMP Performance across versions

CM-handle registration

40000
ReleaseDateCmHandlesCM-Handles removed
10050010002000500010000200001,0002,0005,00010,00020,00040,000Comments
KohnOct 2022Time taken
8 sec16 sec17 19 sec33 sec1 min3 minERRORERRORError due to DB running out shared memory
LondonMay 2023Time taken
6 sec7 sec12 sec22 sec1 min2 3 min10 min32 min
currentAug 2023Time taken
6 sec7 sec10 sec16 sec31 sec57 sec71 sec108 sec

Commentary

CM-handle registration is multi-threaded, so performance may appear to scale better than linear, until the CPU cores are maxed out.

...

CM-handle de-registration

ReleaseDateCmHandlesCM-Handles removed
1005001000200050001000020000400001,0002,0005,00010,00020,00040,000Comments
KohnOct 2022Time taken
7 sec2 min7 min25 min2.5 hourest: 10 hourest: 2 daysest: 7 daysSome values estimated due to time constraints
LondonMay 2023Time taken
< 1 sec2 sec3 sec5 sec17 sec37 sec85 secERRORError due to 32K limit
current

Aug 2023

Time taken
< 1 sec1 sec3 sec4 sec14 sec23 sec65 sec2 min

Current release has exactly linear performance for CM-handle de-registration (on a single thread):

CM-handles RemovedTime (sec)CM-handles/sec
5001.5327
10002.6377
500013.2377
1000025.9385
2000056.1356

Commentary

CM-handle de-registration is a synchronous operation. Since the above tests send a single large batch, the operation is performed in a single thread. The linear performance reflects this. If the operation was broken into many smaller requests, the expectation is that it would complete many times faster - but this has not been tested.