Introduction to CDP: Cloudera Data Platform
Important: CDP-DC was renamed to CDP Private Cloud Base.
- Cloudera Data Platform: There are basically two version of CDP (Cloudera Data Platform)
- Cloudera Data Platform (Public)
- Cloudera Data Platform (Private)
CDP Private Cloud
- CDP Private Cloud is for both
- Analytics
- Data Management
- CDP Private Cloud is deployed in private Data Centres.
- There is integration of below two
- CDP Private Cloud Base.
- CDP Private Cloud Data Services
- This combined solution can be used for Data Analytics and Artificial intelligence.
CDP Private Cloud Base
- CDP Private Cloud Base is the on-premise version of Cloudera Data Platform.
- CDP Private Cloud Base was previously known as CDP Data Center.
- You can run different kind of custom workloads.
- CDP Private Cloud Base supports a variety of Hybrid solutions
- Compute task are separated from data storage and data can be accessed from remote clusters.
- CDP Private Cloud Base is comprised of a variety of components such as
- Bare Metals
- Cluster Management
- Apache HDFS
- Apache Ozone Object Storage
- SDX (Security, Governance and MetatData)
- Hive3
- HBase
- Custom Cluster: On CDP Private Cloud Base you can select any combination of these available services to create clusters that address your business requirements and workloads.
- Preconfigured Services: In fact, you can have various pre-configured packages of services are also available for common workloads.
- Cloudera Data Engineering (CDE)
- With this cluster you can do Data processing, Developing and serving predictive data models. This includes services HDFS, Ranger, Atlas, Hive and Hue.
- This is an all-inclusive data engineering toolset to orchestrate and automate complex data pipeline securely at any scale.
- Data Mart
- Using this you can browse, query and explore your data in interactive way. This includes services like HDFS, Ranger, Atlas, Hive and Hue.
- Operational Database
- This includes services like HDFS, Ranger, Atlas and HBase.
- Cloudera Machine Learning (CML)
- This service optimizes ML workflows for deploying, service and monitoring models.
- Cloudera DataFlow (CDF)
- Provides real-time streaming data analysis at high volume and high scale.
- Cloudera Data Warehouse (CDW)
- Delivers self-service analytics on massive amounts of data to thousands of users without compromising cost, speed and security.
- Cloudera Data Engineering (CDE)
Cloudera Private cloud Plus Edition
- Private Cloud Plus edition includes Base Edition, as well as easy-to-use containerized machine learning and data warehousing analytics. And a hybrid management control plane for a better user experience and lower data center costs.
Advantages of CDP Architecture
- The advantage of CDP architecture is that it is modular, and the same constructs can be transposed from a private data center to the public cloud and everything in between for a seamless hybrid experience.
- CDP Private Cloud is designed to take advantage of today's hybrid environments and allow organizations to effectively utilize their existing on-premises infrastructure while effortlessly bursting into public cloud when required.
Components of Cloudera Data Platform Private Cloud:
There are mainly two major categories of the components
- Tools: CDP Private Cloud Base also includes the following tools to manage and secure your deployment.
- Cloudera Manager:
- Cloudera Manager is a Web Application.
- Using Cloudera Manager, you can
- Monitor
- Manage
- Configure your clusters and services
- Cloudera Manager also provides API, to programmatically perform above activities.
- You can manage one or more cluster using Cloudera Manager.
- Other things which you can do
- Manage installations
- Cluster components upgrades.
- Maintenance Workflows
- Encryption
- Access Controls
- Data Replications
- Virtual Privat Cluster: You can also use Cloudera Manager to create a Virtual Private Cluster that allows you to separate compute resources from data storage and to share data storage among compute resources.
- Apache Atlas:
- This component is helpful for finding the entire data lineage.
- This is used for Data Governance.
- Apache Atlas works as a common metadata store which is designed to exchange metadata both inside and outside of the Hadoop Stack.
- Apache Ranger and Apache Atlas are closely integrated which enables you to define, administer and manage security and compliance policies consistently across all components of the Hadoop stack.
- Apache Ranger:
- This is used to manage Access Control using user interface and also policy administrations.
- Ranger provides the Auditing, Authentication and Authorization functionality for CDP Private Cloud Base Clusters.
- Ranger has Centralized Reporting Capability.
- You or Security Administrator can define security polciies at the database, table, column and file levels, and can administer permissions for specific LDAP-based groups or individual users.
- Time based rules are supported.
- Geolocation based policy rules are supported.
- Cloudera Manager:
- Cloudera Runtime:
- Cloudera Runtime includes 50+ open source projets.
- These are the components which are used for various activities. These are combination of services from Hortonworks and Cloudera itself.
- Hive3, Impala, Hue, DAS (Data Analytics Studio)
- Spark3, Zeppelin
- HBase, Phoenix
- Kafka
- Knox, Ranger, RMS, Atlas
- Apache Airflow
- Ozone
- Encryption: Ranger KMS, KTS etc.
- Additional Components separate installations using Parcels
- NiFi (Cloudera Data Flow) Or Cloudera Flow Management
- CSA (Cloudera Stream Analytics using Flink and SQL Stream Builder)
- CDSW (Cloudera Data Science Workbench): For Data Science and Machine Learning Load.
- Data Visualization (DataViz): To create Reports and Dashboards.
CDP Private Cloud Data Services
- CDP Private Cloud Data Services is a CDP product which brings many of the public cloud benefits to the Data Center.
- As you know CDP Private Cloud separate out Compute and Storage Load.
- This Data Services Capability provides, containerized compute analytic applications that scale dynamically and be upgraded independently.
- CDP Private Cloud Data Services help users to rapidly provision and deploy Data Services like
- Cloudera Data Warehouse
- Cloudera Machine Learning
- Cloudera Data Engineering
- Simply, using management console.
- Important: A CDP Private Cloud Data Services Deployment requires you to have a Private Cloud Base Cluster and Container Based Clusters to run the Data Services.
- Container Based Cluster: For this you can have either of the below
- Red Hat OpenShift Cluster
- Embedded Container Services (ECS) for the containers.
CDP Cluster
- A CDP Cluster is a distributed computing service that has access to shared data lake and runs on either
- Virtual Machines: Cloudera Data Hub
- Containers: Cloudera Analytics Experience.
- Below image helps you to understand what is Cloudera Analytic Experiences and Private Cloud Base.
- CDP Private Cloud Offers data warehouse (dW), Machine Learning(ML), Data Engineering (DE), Data Flow (DF), Operation Database (OD) as analytics experiences.
- Data Hub is part of Private Cloud Base.
- Both Data Experiences and Data Hub is helpful in building custom business applications.
SDX: Shared Data Experience
- SDX is a data access control layer that sits on top of the backend object store.
- This provides coherent data security and governance for all the application running within the environment.
- SDX is used for safeguard data privacy, ensure regulatory compliance, and prevent cybersecurity threats.
References:
https://www.adaltas.com/en/2021/07/19/cloudera-data-platform-overview/quicktechie.com
https://lenovopress.lenovo.com/lp1458.pdf
https://lenovopress.lenovo.com/lp1458.pdf#page=23&zoom=100,80,640
https://my.clouderacn.cn/knowledge-hub.html
https://www.ibm.com/docs/en/spectrum-scale-bda?topic=base-overview