Hadoop Installation Pre-Requisites

Below are the pre-requisites for installing Hadoop:

  • Linux
  • JDK 1.8 installed
  • IPV6 disabled
  • Selinux disabled

Linux

Minimum machine configuration:

  • vCPU : 8 vcores
  • RAM: 32 GB

JDK

JDK 8 is needed on the Linux Machine. Below are the steps for installing oracle java:

Ensure that java 8 is installed properly:

  • java -version
Sparkflows

Set the below in .bash_profile

  • export JAVA_HOME=/usr/java/jdk1.8.0_201-amd64/

Disable IPV6

  • Edit file /etc/sysctl.conf - vi /etc/sysctl.conf

Add the following lines:

  • net.ipv6.conf.all.disable_ipv6 = 1
  • net.ipv6.conf.default.disable_ipv6 = 1

Execute the following command to reflect the changes.

  • sysctl -p

Selinux

Just ensure that selinux should be disabled so that it cant impact Hadoop performance.

  • sudo setenforce 0

To disable it permanently

  • edit /etc/selinux/config

SELINUX=disabled

  • reboot

Steps Involved in Installing Hadoop

After Installation of Cloudera Manager

  • go to http://host-ip:7180/

    • Log in with admin/admin
    • Select Cloudera Express Installation
    • For host, give the hostname IP (private IP)
    • Install using Parcels
    • Include the Kafka parcels
    • User : sparkflows ( As per as updated on machine while creating Linux Machine)
    • Supply the private key
    Sparkflows
    • Install Core with Spark
    • Update default Configurations in it.

Add proxy user in HDFS

Create HDFS directory

Create HDFS directory for sparkflows user (we can create as per as requirements)

  • sudo su
  • su hdfs
  • hadoop fs -mkdir /user/sparkflows
  • hadoop fs -chown sparkflows:sparkflows /user/sparkflows

Install Spark2

spark2 is installed using CSD or Parcels

Login Again into Cloudera Manager

  • In Cloudera Manager:
    • Go to Hosts/Parcels
    • Download Spark2
    • Distribute Spark2
    • Activate Spark2
  • Add Spark2 service in Cloudera Manager
    • Go to Cluster/Add Service
    • Add Spark2 Service
    • For dependency select one with HIVE etc.
    • Select the host

In YARN increase Container memory to 8GB

  • yarn.scheduler.maximum-allocation-mb
  • yarn.nodemanager.resource.memory-mb

AFTER INSTALLATION GET CDH TO USE JAVA 8

  • In Spark configuration in Cloudera Manager set the below for spark-defaults.conf
    • spark.executorEnv.JAVA_HOME=/usr/java/jdk1.8.0_201-amd64/
    • then redeploy the client configurations
    • Restart the cluster service

Install Sparkflows

Upload the Fire Insights example data directory onto HDFS

  • As sparkflows user
  • cd fire-x.y.z
  • hadoop fs -put data

Log into Fire Insights

  • http://host-ip:8080/#/dashboard
    • Log in with admin/admin
    • Create user sparkflows in Sparkflows. Give it admin rights. Add to group default, save it.
    • Again Login with sparkflows user.
    • Go to Configurations under administration and click on infer hadoop cluster config and save it.
    • open spark and update spark2-submit under “spark.spark-submit” and save it.
    • Create a workflow and execute it.