Task Scheduler for Warehouse Reporting

A task scheduler in a Hadoop cluster schedules the jobs consisting of tasks, and allocates specific resources to each job running in a cluster. By default, the task scheduler allocates equal number of resources to all the jobs. For example, if 10 jobs are running they will share resources of the cluster equally. However, you can configure the task scheduler to control the execution of the jobs such that one job runs faster than others by allocating more resources (pools or queues) to the job. This helps you prioritize to run a few reports over others.

Features

NetWitness supports two task schedulers:

  • Fair Scheduler (org.apache.hadoop.mapred.FairScheduler)
  • Capacity Scheduler (org.apache.hadoop.mapred.CapacityTaskScheduler)

Fair Scheduler

This scheduler divides the total capacity of the cluster into logical pools. You can submit a job to any one of these pools. All the jobs submitted to a pool share the resources allocated to the pool only. Once a pool has free resources, the freed resources are given to other pools with jobs running. For example, a fair scheduler has 100% resources with two pools namely Pool A and Pool B which share the total resources at 40% and 60% respectively. If Pool A has four jobs running, it allocates 10% resources to each job. When the four jobs are completed, the freed resources are allocated to Pool B.

Note: You can configure a pool to run more than one job in parallel.

Capacity Scheduler

This scheduler divides the total capacity of the cluster into queues. Each queue is allocated a pre-configured share of the total capacity. A job may be submitted to any of these queues. If more than one job is submitted to the same queue, the jobs will be executed sequentially. For example, if a capacity scheduler has 100% resources with three queues namely the Default, Low and High and they share the total resources at 20%, 30% and 50% respectively. If Default has two jobs D1 and D2, Low has three jobs L1, L2 and L3, and High has four jobs H1, H2, H3 and H4, these jobs are executed in their respective queues sequentially. If the jobs in a queue are completed, the freed resources will not be distributed to other queues.