HADOOP AutoInstall & Word Count Example

Requires python 3.5.x or newer version.
For reference, PROJECT_DIR is directory which contains main.py file.
To Stop the script from downloading hadoop, download & place hadoop-3.2.1.tar.gz in PROJECT_DIR

Before Installation

Please make sure of the following:

Script is for linux OS only (TESTED on Ubuntu 18.04.4 LTS)
Java (jdk) is installed version 8 or newer. (If not found script will throw error) (TESTED on open-jdk 9 open-jdk 10 & oracle-jdk 10). (For installing jdk from jdk_x.x.x.tar.gz file, instructions are given in Java_Install_Instruction.txt)
JAVA_HOME is set in environment variables & should be pointing to jdk . (If not found script will throw error).
ssh & pdsh are installed and working correctly.. (If not found script will throw error)
- If the current installation of ssh is not working install using the following command,
```
$ sudo apt-get install openssh-client openssh-server
```
- Make sure the following line is added in your .bashrc file at /home/user/.bashrc (Replace user with your username)
```
# Set pdsh rcmd type to ssh
export PDSH_RCMD_TYPE=ssh
```
  And source your .bashrc file.
```
$ source home/user/.bashrc
```
Make sure no process is using PORT 9870.
- To list processes using port 9870
```
$ sudo lsof -i :9870
```
- If process are there, kill them using following command -
```
$ kill -9 <pid>
```
If you have run a hadoop exectuable recently or if the program fails, clear your root tmp directory via following command and try running the script again.
```
    $ sudo rm -R /tmp/*
```
Make sure you have packages installed in your python environment which are listed in requirements.txt. Or install using following command.
```
    $ pip install -r PROJECT_DIR/requirements.txt
```

Installation

Change your directory to PROJECT_DIR
```
    $ cd PROJECT_DIR
```
Run main.py file
```
    $ python main.py
```

Results

Output of word count is saved in the file output.txt in PROJECT_DIR
pyLog.log contains all the logs.
Output of map reduce command (using hadoop: mapred straming ) is stored in mapredOutput.txt

File Structure

File	Objective
requirements.txt	List python packages required.
logger.py	Initializes the logger
systemData.py	It stores system variables like HOME directory.
fileOperations.py	Functions for easy access to file IO operation
javaCheck.py	This file checks if java is installed correctly and if JAVA_HOME is configured in the environment or not.
reqCheck.py	Check All Requirements (Calls function from javaCheck) and also checks if ssh and pdsh are installed or not.
passphraselessSSH.py	Sets up passphraseless ssh.
hadoopInstall.py	Downloads and extract Hadoop 3.2.1. Provides variable: HADOOP_HOME
configureHadoop.py	Adds line “export JAVA_HOME={JAVA_HOME}” & "export HADOOP_OPTS="--add-modules java.activation" to /etc/hadoop/hadoop-env.sh.
standalone.py	Runs the standalone operation
backupFile.py	Provides function for backup and restore of config file in case of rerun.
pyLog.log	Debug Log File
pseudoDistributedConfig.py	Configure hadoop for pseudo Distributed Operations.
pseudoDistributedExecution.py	Contains code for word count map reduce function.
word_count/mapper.py	Python script, mapper for map reduce word count program.
word_count/reducer.py	Python script, reducer for map reduce word count program.
texts	This directory contains texts file for word count.
main.py	Import from all the above files and runs them in sequence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HADOOP AutoInstall & Word Count Example

Before Installation

Installation

Results

File Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
texts		texts
word_count		word_count
.gitignore		.gitignore
Java_Install_Instruction.txt		Java_Install_Instruction.txt
README.md		README.md
backupFile.py		backupFile.py
configureHadoop.py		configureHadoop.py
fileOperations.py		fileOperations.py
hadoopInstall.py		hadoopInstall.py
javaCheck.py		javaCheck.py
logger.py		logger.py
main.py		main.py
passphraselessSSH.py		passphraselessSSH.py
pseudoDistributedConfig.py		pseudoDistributedConfig.py
pseudoDistributedExecution.py		pseudoDistributedExecution.py
reqCheck.py		reqCheck.py
requirements.txt		requirements.txt
standalone.py		standalone.py
systemData.py		systemData.py

ayush1120/Hadoop_AutoInstall

Folders and files

Latest commit

History

Repository files navigation

HADOOP AutoInstall & Word Count Example

Before Installation

Installation

Results

File Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages