Apache Oozie Tutorial: Introduction, Workflow & Easy Examples

In this article, we are going to learn about the scheduler system and why it is essential in the first place. We will also discuss why it is essential to have a scheduler in the Hadoop system. Also, we will deeply learn about Apache Oozie and a few of its concepts of Apache Oozie, such as time-based job, word count workflow job, oozie bundle, oozie coordinator, oozie workflow.

Scheduler System

As we all know that there are many jobs which are interdependent on each other and completion of one job only can start another job. For example, in the system of the Hadoop Ecosystem, Hive Job gets the input to work from the output of MapReduce. In this way, there is more process which receives feedback from the production of other processes. For this purpose of organizing the job and bringing a solution, jobs are scheduled using a scheduler system. 

Read: Apache Spark vs Mapreduce

One can now very quickly all those situations of scheduling using Apache Oozie. For the working of the Ecosystem of Hadoop, Apache Oozie is essential. 

Apache Oozie: Introduction

In the distributed environment of Hadoop, Jobs are executed and managed by a scheduler system called Apache Oozie. Many various kinds of jobs can be combined using apache oozie, and a job pipeline of one’s desire can be easily created. The task of MapReduce, Sqoop, Pig, or Hive can be quickly scheduled using apache oozie.

An individual can schedule their job easily using the Apache Oozie. One can also run parallel jobs of two or more at the same time with each other while creating the sequence of the task. The Scheduler System, called Apache System, is very extensible, reliable, and scalable.

Action in the workflow can be triggered by the Oozie, which is a web application of Open Source Java. It is the responsibility of Apache Oozie to start the job in the workflow. For the execution of the task, Apache Oozie uses the execution engine of Hadoop.

Through polling and callback, detection of task completion can be done by Apache Oozie. The task is provided with a unique callback HTTP URL at the beginning of starting a job by the Oozie. When the task is finished, Oozie will notify the unique callback HTTP URL about the completion of the task. Sometimes the tasks are polled for completion, in case the callback URL is failed to be invoked by the task. 

In Apache Oozie, One can find three kinds of jobs, which are:

Oozie Bundles – Oozie Bundles refer to multiple workflow jobs and coordinators packages.

Oozie Coordinator Jobs – According to the availability of data and time, these kinds of workflow jobs get triggered. 

Oozie Workflow Jobs – Execution of Actions in the sequence are specified by Directed Acyclic Graphs (DAGs)

Oozie Workflow

Direct Acyclic Graph (DAG) arranges the actions to be done in a sequence in the workflow of Hadoop. One step is dependent on another action, and the next action can start only after the execution of the previous actworkion is finished because it needs to take the output of one job as the input for another job.

Read: Must Read Big Data Interview Questions

Java action, shell action, MapReduce action, Hive action, Pig action are some of the workflow actions which can be scheduled and executed by the Apache Oozie scheduler system. One can also specify the condition for a job to run. In the workflow, one can tell the scheduler to run this job or action if the output comes like this or the action meets the requirement.

Based on the job, various kinds of activities can be created by an individual. There can be a unique kind of tags for each type of action. Before the workflow execution, the HDFS path should be placed with the jars or script and the workflow. 

Command: oozie job –oozie http://localhost:11000/oozie -config job.properties -run

http://host_name:11000 is the link address to go to the Oozie web console to check the job status. One can check the job status by just doing a click on the job after opening this Oozie web console.

A fork can be used when one needs to run many jobs together at the same time. When the fork is used, it requires an end node to fork and in this case one needs to take help of Join. Join should be used for each fork. When many jobs are executed together, nodes are assumed as the single c.

A single fork will have single nodes, and each Join will assume only on a single node as their child of the single fork. One can parallelly do the creation of 2 tables at the same time together.

Decision tags are also very useful to use in this system when one needs to run any action based on output. Decision tags help in deciding which operation to run after getting the required output to run a specific action. For instance, one needs not to create any hive table once it is already created. So if the table is already existing, then by adding the decision tag, an individual can stop the creation of the steps of the table. There are also switch cards present in decision nodes, and it is similar to the switch case.

The property file is also known as the config file. It comes handy when the management of the value of the param, script, name-node, and job-tracker becomes difficult. 

Apache Oozie Coordinator

There is some workflow that needs to be regularly scheduled, and there is some workflow that is complex to schedule. Both kinds of workflow can be quickly scheduled by using Oozie Coordinator. Event predicates, data, and time are used as the basis for the workflow trigeneration by Oozie Coordinators. On the satisfaction of a condition, the Job coordinator starts the workflows.

Here are some of the Definition one needs to understand for the coordinator jobs:

  1. frequency – For the purpose of the job execution, frequency is mentioned, and it is counted in minutes. 
  2. timezone – This tells us about the coordinator application’s timezone.
  3. end – It refers to the job’s end datetime.
  4. start – It refers to the job’s start datetime.

Let us now learn more about the Control Information’s properties:

execution- The jobs are executed in order, and the execution specifies it. Whenever different job coordinator meets multiple criteria of execution, then execution comes to tell the order in which the jobs should be executed. These are the kinds of execution:

  • LAST_ONLY
  • LIFO
  • FIFO – It is the standard default execution one can find but can also be changed to another type of execution according to the desire of the individual.
  • Concurrency – It is the property to control the maximum no. of action that can be taken for a job when the jobs are running parallely. One is the default value of the concurrency, which means at one time, only one action can be taken parallelly.
  • Timeout – It is the property that helps to decide the time limit of an action to wait before it is discarded. The action will be immediately timed out if the value is 0, and no input conditions can satisfy the materialization of action. The action can also wait forever without being discarded ever if one has mentioned the value as -1. -1 is the default value of timeout. 

Command: oozie job –oozie http://localhost:11000/oozie -config <path to coordinator.properties file> -run

Read: In-demand Big Data Skills Necessary to Land ‘Big’ Data Jobs

Apache Oozie Bundle

The data pipeline is the set of coordinator applications. The execution and definition of the data pipeline are allowed by the Oozie Bundle system. Coordinator applications are not clearly dependent on each other in an Oozie bundle. Data application pipeline can be easily created by using the coordinator applications’ data dependency. The bundle can be rerun, resumed, suspended, stopped, and started by any person. Operational control of the Oozie Bundle is effortless and better to use.

Kick-off-time – It is the time of the submitting and starting of the coordinator applications by a bundle.

Now, as we move forward, we will know how the creation of the workflow job is done:

Apache Oozie Word Count Workflow Job

With the use of Apache Oozie, one can do the execution of Word Count Job. All the files will be placed after the directory of WordCountTest is created. All the word count jar is placed after the creation of the lib directory.

Job and the associated parameters will be specified in the file of workflow.xml and job.properties after it is created.

1. job.properties

ResourceManager and NameNode’s are defined in the file of job.properties after it is created. The path of NameNode resolves the path of the directory of the workflow. Jobs are submitted to YARN with the help of the path of JobTracker. HDFS stores the path of the workflow.xml file and is provided by the user.

2. workflow.xml

All the execution and actions are defined in the file called workflow.xml after the user creates it. WorkflowRunnerTest is the name of the workflow-app which has to be specified by the user. After that, the Start node will be determined by the user.

For a workflow job, the entry point is called the start node. The job will start after the point of start is mentioned in the first node of the workflow. After the mentioning of the start node, the job will start from the next node called intersection0. 

Read: Big Data Project Ideas for Beginners

In the next step, in the action node, the user needs to specify the task that should be performed. The mission of WordCount of MapReduce will be executed now. The task of the WordCount of MapReduce is completed after the user specifies the required configuration. The user then defines the address of the NameNode and the Job Tracker. 

The next action is to prepare the element. The cleaning up of the directory is done by using the feature. The directory cleanup is done before the action execution. Now we are going to operate to delete in HDFS. If the out1 folder is created already, we will remove the out1 folder. Before the execution of the job, deletion or creation of a folder is done using prepare tags. Output value class, output key class, reducer class, mapper class, job queue name are some of the properties of the MapReduce specified by the user.

Please have a look at the coding of the workflow.xml file:

<workflow-app xmlns=”uri:0ozie:workflow:0.1″ name=”WorkflowRunnerTest “>

<start to=”intersection0″/>

<action name=”intersection0″>

<map-reduce>

<job-tracke>localhost:8032</job-tracker>

<name-node>hdfs://localhost:8020</name-node>

<prepare> <delete path=”hdfs://localhost:8020/00zieout/outl”/></prepare>

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>default</value>

</property>

<property>

<name>mapred.mapper.class</name>

<value>MapperClass</value>

</property>

<property>

<name>mapred.reducer.class</name>

<value>ReducerClass</value>

</property>

<property>

<name>mapred.output.key.class</name>

<value>org.apache.hadoop.io. Text</value>

</property>

<property>

<name>mapred.output.value.class</name>

<property>

<name>mapred.output.key.class</name>

<value>org.apache.hadoop.io. Text</value>

</property>

<pгoperty>

<name>mapred.output.value.class</name>

<value>org.apache.hadoop.io. IntWritable</value>

</property>

<ргоperty>

<name>mapred.input.dir</name>

<value>/data</value>

</property>

<property>

<name>mapred.output.dir</name>

<value>/oozieout/out1</value>

</property>

</configuration>

</map-reduce>

<ok to=”end”/>

<error to=”fail”/>

</action>

<kill name=”fail”>

<message>Map/Reduce failed, error message</message>

</kill>

<end name=”end”/>

</workflow-app>

 In HDFS, the Output and Input directory is the last configuration task of MapReduce. Input directory is also known as the data directory. In the NameNode root path, the Data directory gets stored. On the failure of the job, the element will be killed according to the specification given by the user in the end.

Goto : data

Go back to dir listing

Advanced view/download options

Below are important points about this VM, please go through it without fail.

1) Hadoop and all other components are present in /usr/lib/

JDK : /usr/lib/jvm/jdkl.7.0 67

Eclipse : /homme/upgrad/Desktop/eclipse

Hadoop : /usr/lib/hadoop-2.2.0

Pig : /usr/lib/pig-0.12.0

Hive : /usr/lib/hive-0.13.1-bin

Hbase : /usr/lib/hbase-0.96.2-hadoop2

Oozie : /usr/lib/oozię-4.0.0

Sqoop : /usr/lib/sqoop-1.4.4

Flume-ng : /usr/lib/flume-ng

2) The paths of all the components are set.

JDK : .bashrc

Hadoop : .bashrc

Pig : /etc/profile.d/pig.sh

Hive : /etc/profile.d/hive.sh

In the property oozie.wf.application.path in the file of job.properties, the user needs to specify and move the WordCountTest folder in HDFS. Now we will perform a copy of the folder of WordCountTest folder in the root directory of the Hadoop.

Command: hadoop fs -put WordCountTest /

upgrad@localhost:/usr/lib/0ozie-4.0.0

File Edit View Search Terminal Help

[upgrad@localhost oozie-4.0.0]$ hadoop fs -put WordCountTest /

17/12/19 18:11:03 WARN util.NativeCodeLoader: Unable to load native-hadoop

library for your platform… using builtin-java classes where applicable

To check If the folder is uploaded not uploaded in the root directory of HDFS, then the user needs to check the folder by going to the NameNode Web UI for verification.

Now using this code, we will do the execution of the job of the workflow and go ahead:

Command: oozie job -oozie http://localhost:11000/o0ozie -config job.properties –run

upgrad@localhost:/usr/lib/oozie-4.0.0/WordCountTest

File Edit View Search Terminal Help

[upgrad@localhost oozie-4.0.0]$ cd WordCountTest

[upgrad@localhost WordCount Test]$ oozie job -oozie http://localhost:11000/00zie

– config job.properties -run

job: 0000009-171219160449620-0ozie-edur-W

[upgrad@localhost WordCountTest]$

  • Time-Based Word Count Coordinator Job | Apache Oozie Tutorial

We are going to create a controller that will be executed every specified time interval. In the end, time base word count job will be completed across the time interval. Using Apache Oozie, we can create a scheduled job and run them in a periodical manner.

Let us move forward and let us create an Oozie coordinator job. There will be three files create that is workflow.xml, coordinator.xml & workflow.xml files. Wordcount jar file will be placed inside the lib directory.

Let us look into properties file:

frequency=60

startTime=2017-12-19T13\:29Z

endTime=2017-2-19T13\:34Z

timezone=UTC

 nameNode=hdfs://localhost:8020

jobTracker=localhost:8032

queueName=default

 workflowPath=${nameNode}/WordCountTest_TimeBased

 oozie.coord.application.path=${nameNode}/WordCountTest_TimeBased

 Here we have to specify the frequency at which the work should be executed. The unit of frequency here is in minutes. Here, in our case, the coordinator job should be running every 60 minutes. Data set will be captured using the frequency specified that is produced and scheduled to run the coordinator application.

Use this below format to specify frequency:

${coord:minutes(int n)}

 n

 ${coord:minutes(45)} –> 45

${coord:hours(int n)}

 n * 60

 ${coord:hours(3)} –> 180

${coord:days(int n)}

 variable

 ${coord:days(2)} –> minutes in 2 full days from the current date

${coord:months(int n)}

 variable

 ${coord:months(1)} –> minutes in a 1 full month from the current date

Here we have to specify the startTime and the endTime of the job. Where startTime is the start date, and endTime is the end date of the coordinated job.

Finally, we have to specify the application path where all files are stored.

All the properties will be well defined in the coordinator.properties file. Frequency, name, and timezone should be specified in the properties . Hence let us create elements in the coordinator XML file.

<coordinator-app name=”coordinator1” frequency =”${frequency}” start=”${startTime}”

end=”${endTime}” timezone=”${timezone}” xmlns=”uri:oozie:coordinator:0.1”>

<action>

<workflow>

<app-path>${workflowPath}</app-path>

</workflow>

</action>

</coordinate-app>

 Now let us create workflow.xml for our job.

<workflow-app xmlns=”ari:oozie:vorkflows6.” name= Warktlovfunnertest>

<start to=”intersection9″/>

kaction name=intersectione>

<map-reduce>

<job-tracker>localhost:8032</job-tracker>

<name-node>hdfs://localhost:8020</name-node>

<prepare> <delete path=”h5ts//Loalhost 3320/00aie ireBasadcut/ut/></prepare>

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>default</value>

</property>

<property>

<name>mapred.mapper.class</name>

<value>MapperClass</values

</property>

<property>

<name>mapred.reducer.class</name>

<value>ReducerClass</value>

</property>

<property>

<name>mapred.output.key.class</name>

<value>org.apache.hadoop.io. Text</value>

</property>

<property>

<name>mapred.output.value.class</name>

 Also make:

<value>org.apache.hadoop.io.Text</value>

</property>

<pгоperty>

<name>mapred.output.value.class</name>

<value>org.apache.hadoop.io.Intwritable</value>

</property>

<property>

<name>mapred.input.dir</name>

<value>/data</value>

</property>

<property>

<name>mapred.output.dir</name>

<value>/oozieTimeBasedout/out1</value>

</property>

</configuration>

</map-reduce>

<ok to=”end”/>

<error to=”fail”/>

</action>

<kill name=”fail”>

<message>Map/Reduce failed, error message</message>

</kill>

<end name=end”/>

</workflow-app>

Now let us move this to the HDFS directory.

Finally we have to configure Coordinator job.

<contiguration>

<property>

<name>startTime</name>

<value>2017-12-19T13:29Z<value>

</property>

<property>

<name>workflowPath</name>

<value>hdfs:/localhost:8020WordCountTest TimeBased</value>

</property>

<property>

<name>oozie.coord.application.path</name>

<value>hdfs:/Mocalhost 8020/WordCountTest TimeBased<value>

</property>

<property>

<name>timezone</name>

<value>UTC<value>

<iproperty>

<property>

<name>user.name</name>

<value>upgrad</value>

</property>

<property>

<name>mapreduce job.user.name</name>

<value>upgrad<value>

</property>

<property>

<name>queueName</name>

<value>default</value>

<property>

<property>

Now lets see the output created.

!! 1

“Save” 1

“reboot” 1

(Just 1

+ 1

-C 1

-f 1

-n 1

-r 2

safemode 1

. 1             

./bin/oozie-start.sh 1

./bin/start-hbase.sh 2

./flume-ng 1

bashrc 2

/etc/profile.d/hbase. sh 1

/etc/profile.d/hive.sh 1

/etc/profile.d/oozie.sh 1

/etc/profile.d/pig.sh 1

/etc/profile.d/sqoop.sh 1

 

 Conclusion

Finally, we came to an end of the tutorial and hoped that you liked the tutorial and learned something about Apache Oozie. This article is a good start for any beginner who is interested in learning the basic concept of the Apache Oozie. 

If you are interested to know more about Big Data, check out our PG Diploma in Software Development Specialization in Big Data program which is designed for working professionals and provides 7+ case studies & projects, covers 14 programming languages & tools, practical hands-on workshops, more than 400 hours of rigorous learning & job placement assistance with top firms.

Lead the Data Driven Technological Revolution

7 Case Studies & Projects. Job Assistance with Top Firms. Dedicated Student Mentor.
Learn More

Leave a comment

Your email address will not be published. Required fields are marked *

×