Table of Contents | ||
---|---|---|
|
...
Link to Project Proposal training materials
Project Name:
- Proposed name for the project: DataLake
- Proposed name for the repository: datalake
Project
...
Goal
Build permanent storage to persist the data that flows through ONAP, and build data analytics tools on it.
Project
...
Description
DMaaP data is read and processed by many ONAP components. DMaaP is backed by Kafka, which is a system for Publish-Subscribe, and is not suitable for data query and data analytics. Additionally, data in Kafka is not meant to be a permanent storage and data got gets deleted after certain retention period. Thus it is useful to persist the data that flows through DMaaP to databases, with the following benefits:
...
In this project, we provide a systematic way to real-time ingest DMaaP data to Couchbase, a distributed document-oriented database, and Druid, a data store designed for real-time OLAP analytics. We also provide sophisticated and ready-to-use data analytics tools built on the data.permanent storage.
DataLake's goals are:
- Provide To provide a systematic way to real-time ingest DMaaP data to Couchbase, a distributed document-oriented database, and Druid, a data store designed for real-time OLAP analytics.
- Also serves To serve as a common data storage for other ONAP components as well, with easy access.
- Provides To provide APIs and ways for ONAP components and external systems (e.g. BSS/OSS) to consume the data.
- Provides To provide sophisticated and ready-to-use data analytics tools built on the data.
Architecture
...
Scope
...
Data Sources
Monitor all or selected Data topics, real-time reads the data, and persists it..
Other ONAP components can use DataLake as a storage to save application specific data, through DMaaP or DataLake REST APIs.
Other data sources will be supported if needed.
Dispatcher
Provide admin REST API for configuration and topic management. A topic can be configured to be exported to which data stores, with Couchbase and Druid supported initially. We may support more distributed databases in the future.
Provide SDC/Design time framework UI for management, making use of the above admin REST API.
Document Store
Monitor selected topics, real-time pull the data and insert it into Couchbase, one table for each topic, with the same table name as the topic name.
Data types JSON, XML, and YAML are auto converted into native store schema. We may support additional formats. Data not in these formats is stored as a single string.
Provide REST API for data query, while applications can access the data through native API as well.
Couchbase supports Spark direct running on it, which allow complicate analytics tools to be built. We may develop Spark analytics applications if needed.
OLAP Store
Monitor selected topics, real-time pull the data and insert it into Druid, one datasource for each topic, with the same datasource name as the topic name.
Extracts the dimensions and metrics from JSON files, and pre-configure Druid settings for each datasource, which is customizable through a web interface.
Integrate Apache Superset for data exploration and visualization, and provide pre-builds interactive dashboards
Architecture Alignment
...
- How does this project fit into the rest of the ONAP Architecture?
DataLake provides both API and UI interfaces. UI is for analyst to analysis the data, while API is for other ONAP (and external) components to query the data. For example, UUI can use the API to retrieve historical events. Some of DCAE service applications may also make use of the APIs.- What other ONAP projects does this project depend on?
DataLake depends on DMaaP for data ingestion, also depends on some other common services: OOM, SDC, MSB.
- What other ONAP projects does this project depend on?
- In Relation to Other ONAP Components
- DCAE focuses on being a part of automated closed control loop on VNFs, storing collected data for archiving has not been covered by DCAE scope. (see ONAP wiki forum). Envision that some DCAE analytics applications may use the data in DataLake.
- PNDA is an infrastructure that bundles a wide variety of big data technologies for data processing. Applications are to be developed on the technologies provided by PNDA. The goal of DataLake is to store DMaaP and other data, and build ready-to-use applications around the data, making use of suitable technologies, whether they are provided by PNDA. Currently Couchbase, Druid and Superset are not included in PNDA.
- How does this align with external standards/specifications?
- APIs/Interfaces - REST, JSON, XML, YAML
- Information/data models - Swagger JSON
- Are there dependencies with other open source projects?
- Couchbase
- Apache Druid
- Apache Superset
- Apache Spark
- Couchbase
Other Information
...
- link to seed code (if applicable)
- Vendor Neutral
- Yes
- Meets Board policy (including IPR)
Use the above information to create a key project facts section on your project page
Key Project Facts
...
Facts | Info |
---|---|
PTL (first and last name) | Guobiao Mo |
Jira Project Name | DataLake |
Jira Key | DATALAKE |
Project ID | datalake |
Link to Wiki Space |
Release Components Name
...
Note: refer to existing project for details on how to fill out this table
Components Name | Components Repository name | Maven Group ID | Components Description |
---|---|---|---|
datalake | datalake | org.onap.datalake | Data stores for DMaaP ONAP data, with data access API and GUI data analysis analytics tools. |
Resources committed to the Release
...
Note 1: No more than 5 committers per project. Balance the committers list and avoid members representing only one company. Ensure there is at least 3 companies supporting your proposal.
...