Use this repository to have a working directory where you run deploy commands with predefined virtual infrastructure with Vagrant or your own infrastructure. You can customize the infrastructure and components of your cluster with 1 command per component.
To help users to setup the getting started, a setup script is located here.
The -e option is used to enable features, for example, -e extras enables TDP Collection Extras. Each feature have requirements, so if you enable a feature, you MUST install the feature requirements BEFORE launching the setup script.
Common requirements are:
- Python = 3.6 with virtual env package (i.e.
python3-venv) - Unzip (to execute the setup scripts)
jqrequired to execute helper script
Python requirements like Ansible and Mitogen are listed in the file requirements.txt. The virtual environment is populated with these requirements. Therefore, you should not install them by yourself outside of the virtual environment. Only versions described in requirements.txt are supported.
Specific features requirements are:
vagrant: see requirements in https://github.com/TOSIT-IO/tdp-vagrantui: see requirements in https://github.com/TOSIT-IO/tdp-ui.
The -r option is for selecting stable or latest version for all features. stable is recommended if you are new to TDP. We try our best to have a stable release working. latest is recommended when you want to contribute to TDP.
By default, the setup script will not delete your custom configuration, for example, the .env file is generated by the setup script and this file is read by tdp-lib and tdp-server. You can change it to your preference and rerun the setup script without losing your modification.
If you want a clean configuration (i.e. remove your custom change), you can add the -c flag and the setup script will remove and create files, symlink, directory it creates.
The below steps will deploy a TDP cluster with all features (see the line with ./scripts/setup.sh and multiple -e option) so you MUST install all requirements. If you want specific features, modify the ./scripts/setup.sh line.
If Vagrant is enabled, the Ansible host.ini file will be generated using the hosts variable in tdp-vagrant/vagrant.yml.
# Clone project from version control
git clone https://github.com/TOSIT-IO/tdp-getting-started.git
# Move into project dir
cd tdp-getting-started
# Setup local env with stable tdp-collection (mandatory), tdp-lib (mandatory), tdp-server, tdp-ui, tdp-collection-extras, tdp-observability, tdp-collection-prerequisites, and vagrant
# Modify the line below for your needs
./scripts/setup.sh -e server -e ui -e extras -e observability -e prerequisites -e vagrant
# Activate Python virtual env and set environment variables
source ./venv/bin/activate && source .env
# To enable mitogen (optional)
export ANSIBLE_STRATEGY_PLUGINS="$(python -c 'import os,ansible_mitogen; print(os.path.dirname(ansible_mitogen.__file__))')/plugins/strategy"
export ANSIBLE_STRATEGY="mitogen_linear"
# Launch VMs
vagrant up
# Configure TDP Collection Prerequisites
ansible-playbook ansible_collections/tosit/tdp_prerequisites/playbooks/all.ymlYou have four ways to deploy a TDP cluster, using TDP UI, using TDP Server API, using TDP Lib CLI, or using Ansible playbook.
# Open a new terminal and activate python virtual env
source ./venv/bin/activate && source .env
# Start tdp-server
uvicorn tdp_server.main:app --reload# Open a new terminal and start tdp-ui
npm --prefix ./tdp-ui run devIn the UI, click on "Deployments", "New deployment", "Deploy from the DAG", "Preview" (by default all services are deployed), "Deploy".
You can see the deployment in the "Deployments" page. Wait deployment to complete.
# Configure HDFS user home directories
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
# Configure Ranger policies
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml# Open a new terminal and activate python virtual env
source ./venv/bin/activate && source .env
# Start tdp-server
uvicorn tdp_server.main:app --reload# Deploy TDP cluster core and extras services
curl -X POST http://localhost:8000/api/v1/deploy/dag
# You can see the log in the tdp-server output (the terminal where uvicorn is running)
# Wait deployment
while ! curl -s http://localhost:8000/api/v1/deploy/status | grep -q "no deployment on-going"; do sleep 10; done
# Configure HDFS user home directories
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
# Configure Ranger policies
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml# Deploy TDP cluster core and extras services
tdp deploy
# Configure HDFS user home directories
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
# Configure Ranger policies
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml# Deploy TDP cluster core services
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/all.yml
# Deploy extras services
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/livy.yml
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/livy-spark3.yml
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/zookeeper-kafka.yml
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/kafka.yml
# Deploy observability services
ansible-playbook ansible_collections/tosit/tdp_observability/playbooks/meta/prometheus.yml
ansible-playbook ansible_collections/tosit/tdp_observability/playbooks/meta/grafana.yml
# Configure HDFS user home directories
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
# Configure Ranger policies
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml- HDFS NN Master 01
- HDFS NN Master 02
- YARN RM Master 01
- YARN RM Master 02
- MapReduce Job History Server
- HBase Master 01
- HBase Master 02
- Spark History Server
- Spark3 History Server
- Ranger Admin
- JupyterHub
Note: All the WebUIs are Kerberized, you need to have a working Kerberos client on your host, configure the KDC in your /etc/krb5.conf file and obtain a valid ticket. You can also access the WebUIs through Knox.
Each of the below sections includes a high-level explanation of each possible step of TDP deployment.
Execute the setup.sh script to create the project directories needed and clone stable or latest Ansible TDP collections. It also downloads the TDP binaries from their GitHub releases (e.g., Hadoop).
Note: The list of TDP binaries needed for deployment is maintained in the scripts/tdp-release-uris.txt file.
# Get stable tdp-collection
./scripts/setup.sh
# Get latest tdp-collection, tdp-collection-extras, tdp-observability, tdp-collection-prerequisites, and vagrant
./scripts/setup.sh -e extras -e observability -e prerequisites -e vagrant -r latestTo use tdp-vagrant it is necessary to use the -e vagrant option when using setup.sh.
You can define vagrant.yml file to update the machine resources according to your machine's RAM and core count (3Gb of RAM and 4 cores is ideal for the master nodes). The file tdp-vagrant/vagrant.yml contains default values.
cp tdp-vagrant/vagrant.yml .
# Now you can edit ./vagrant.ymlImportant: Do not modify tdp-vagrant/vagrant.yml to make it easier to update git submodule. The Vagrantfile will read vagrant.yml in the current directory and fallback to tdp-vagrant/vagrant.yml.
Start VMs with vagrant command.
vagrant upFor TDP Vagrant usage see https://github.com/TOSIT-IO/tdp-vagrant.
Note: The helper.sh script can generate the list of hosts in the cluster. Add the generated lines to your /etc/hosts file to resolve the local nodes from your shell or browser.
./scripts/helper.sh -hTo use tdp-collection-prerequisites it is necessary to use the -e prerequisites option when using setup.sh.
ansible-playbook ansible_collections/tosit/tdp_prerequisites/playbooks/all.ymlThis playbook deploys the following services: Chrony, a CA, a LDAP, a KDC, a PostgreSQL.
For TDP prerequisites usage see https://github.com/TOSIT-IO/tdp-collection-prerequisites.
tdp deployThis command deploys all core and extra (if enable during setup) services.
For TDP Lib usage see https://github.com/TOSIT-IO/tdp-lib.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/all.ymlThis playbook deploys the following services: Exporter, ZooKeeper, Hadoop core (HDFS, YARN, MapReduce), Ranger, Hive, Spark (2 and 3), HBase and Knox. It does not deploy extras services (see Extras Services Deployment to deploy it).
For TDP usage see https://github.com/TOSIT-IO/tdp-collection.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/exporter.ymlDeploys Apache ZooKeeper to the [zk] Ansible group and starts a 3 node Zookeeper Quorum.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/zookeeper.ymlRun echo stat | nc localhost 2181 from any node in the [zk] group to see its ZooKeeper status.
Deploys Ranger to the [ranger_admin] Ansible group.
Note that any changes to the [ranger_admin] hosts should also be reflected in the [hadoop client group].
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/ranger.yml
The Ranger UI can be accessed at the address https://<master-02.tdp ip>:6182/login.jsp and the user admin and password RangerAdmin123 (assuming default ranger_admin_password parameter). You may need to import the root.pem certificate authority into your browser or accept the SSL exception.
Launches HDFS, YARN, and deploys MapReduce clients.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/hadoop.yml
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/hdfs.yml
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/yarn.yml
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.ymltdp_user can access and write to its HDFS user directory:
# From edge-01.tdp
sudo su - tdp_user
kinit -ki
echo "This is the first line." | hdfs dfs -put - /user/tdp_user/test-file.txt
echo "This is the second (appended) line." | hdfs dfs -appendToFile - /user/tdp_user/test-file.txt
hdfs dfs -cat /user/tdp_user/test-file.txtDeploys Hive to the [hive_s2] Ansible group. HDFS filesystem is created and the service is launched.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/hive.yml
To interact with Hive, use the beeline CLI:
# From edge-01.tdp
sudo su - tdp_user
kinit -ki
export hive_truststore_password='Truststore123!'
# Connect to a random HiveServer2 using ZooKeeper
beeline -u "jdbc:hive2://master-01.tdp:2181,master-02.tdp:2181,master-03.tdp:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/etc/ssl/certs/truststore.jks;trustStorePassword=${hive_truststore_password}"
# Or directly to a HiveServer2
beeline -u "jdbc:hive2://master-03.tdp:10001/;principal=hive/[email protected];transportMode=http;httpPath=cliservice;ssl=true;sslTrustStore=/etc/ssl/certs/truststore.jks;trustStorePassword=${hive_truststore_password}"
# You can also use `beeline` alone which will default to the ZooKeeper mode
beelineFrom the Beeline shell:
# Create the database
CREATE DATABASE IF NOT EXISTS tdp_user LOCATION '/user/tdp_user/warehouse/tdp_user.db';
USE tdp_user;
# Examine the database
SHOW DATABASES;
SHOW TABLES;
# Modify the database
CREATE TABLE IF NOT EXISTS table1 (
col1 INT COMMENT 'Integer Column',
col2 STRING COMMENT 'String Column'
);
# Examine the database
SHOW TABLES;
# Modify the database table
INSERT INTO TABLE table1 VALUES (1, 'one'), (2, 'two');
# Examine the database table
SELECT * FROM table1;Deploys spark installations to the [spark_hs] and the [spark_client] Ansible group.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/spark.yml
To submit a Spark application:
# From edge-01.tdp
sudo su - tdp_user
kinit -ki
# Run a spark application locally
spark-submit --class org.apache.spark.examples.SparkPi --master local[4] /opt/tdp/spark/examples/jars/spark-examples_2.11-2.3.5-TDP-0.1.0-SNAPSHOT.jar 100
# Or spark-submit a spark application to yarn
spark-submit --class org.apache.spark.examples.SparkPi --master yarn /opt/tdp/spark/examples/jars/spark-examples_2.11-2.3.5-TDP-0.1.0-SNAPSHOT.jar 100Note: Other Spark CLIs are available: pyspark, spark-shell, spark-sql.
Deploys spark3 installations to the [spark3_hs] and the [spark3_client] Ansible group.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/spark3.yml
Spark 3 is installed alongside Spark 2 and can be used exactly the same way. The Spark 3 CLIs are: spark3-submit, spark3-shell, spark3-sql, pyspark3.
Deploys HBase masters, regionservers, rest and clients to the [hbase_master], [hbase_rs], [hbase_rest] and [hbase_client] Ansible groups respectively.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/hbase.yml
As tdp_user on edge-01, obtain a Kerberos TGT with the command kinit -ki and access the HBase shell with the command hbase shell.
Commands such as the below can be used to test your HBase deployment:
list
list_namespace
create 'tdp_user_table', 'cf'
put 'tdp_user_table', 'row1', 'cf:testColumn', 'testValue'
scan 'tdp_user_table'
disable 'tdp_user_table'
drop 'tdp_user_table'
Deploys Knox Gateway on the [knox] Ansible group:
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/knox.yml
You can then access the WebUIs of the TDP services through Knox:
- HDFS NN
- YARN RM
- MapReduce Job History Server
- HBase Master
- Spark History Server
- Spark3 History Server
- Ranger Admin
Note: You can login to Knox using the tdp_user that is created in the next step.
Deploys Livy Server on the [livy_server] group hosts:
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/livy.ymlThe Livy Server can be accessed at https://edge-01.tdp:8998 After deployment, one can create a Spark session and interact with it through cURL:
# From edge-01.tdp
sudo su - tdp_user
kinit -ki
# Create a session
curl -k -u : --negotiate -X POST https://edge-01.tdp:8998/sessions \
-d '{"kind": "pyspark"}' -H 'Content-Type: application/json'
# Get the session status (wait until it is "idle")
curl -k -u : --negotiate -X GET https://edge-01.tdp:8998/sessions
# Submit a snippet of code to the session
curl -k -u : --negotiate -X POST https://edge-01.tdp:8998/sessions/0/statements \
-d '{"code": "1 + 1"}' -H 'Content-Type: application/json'
# Get the statement result
curl -k -u : --negotiate -X GET https://edge-01.tdp:8998/sessions/0/statements/0Another Livy server is deployed for Spark 3 on the [livy-spark3_server] group hosts:
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/livy-spark3.ymlThe default port is different than the regular Livy server: 8999 instead of 8998.
Deploys Apache ZooKeeper to the [zk_kafka] Ansible group and starts a 3 node Zookeeper Quorum dedicated to Kafka.
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/zookeeper-kafka.ymlDeploys a Kafka cluster on the [kafka_broker] group hosts:
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/kafka.ymlThe Kafka CLIs are available on the edge node for all users and client properties files are in /etc/kafka/conf/*.properties. After deployment, one can interact with Kafka from edge-01.tdp:
# From edge-01.tdp
sudo su - tdp_user
kinit -ki
# Create a topic
kafka-topics.sh --create --topic test-topic \
--command-config /etc/kafka/conf/client.properties
# Write messages to the topic
kafka-console-producer.sh --topic test-topic \
--producer.config /etc/kafka/conf/producer.properties
>Hello there
>I am writting messages to a Kafka topic
>How cool is that?
>^C # CTRL+C to leave the console producer
# Read all messages from the topic
kafka-console-consumer.sh --topic test-topic --from-beginning \
--consumer.config /etc/kafka/conf/consumer.propertiesCreate, update, remove HDFS user home directories.
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
Additional users can be added to the Ansible variable hdfs_user_homes if required.
When adding users following the Ranger Usersync deployment, you will need to add or update Ranger policies including these new users. You must wait for Ranger Usersync to poll users from LDAP or you can restart the Ranger Usersync using the following playbook:
ansible-playbook ansible_collections/tosit/tdp/playbooks/ranger_usersync_restart.yml
Create, update, remove Ranger policies.
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml
Additional policies can be added to the Ansible variable ranger_policies if required.