If you’re considering adding an Apache™ Hadoop® workflow to your EMC® Isilon® cluster, you’re probably wondering how to set it up. The new white paper “EMC Isilon Best Practices for Hadoop Data Storage” provides useful information for deploying Hadoop in your Isilon cluster environment.

The white paper also introduces the unique approach that Isilon took to Hadoop deployments. In a typical Hadoop deployment, large unstructured data sets are ingested from storage repositories to a Hadoop cluster based on the Hadoop distributed file system (HDFS). Data is mapped to the Hadoop DataNodes of the cluster and a single NameNode controls the metadata. The MapReduce software framework manages jobs for data analysis. MapReduce and HDFS use the same hardware resources for both data analysis and storage. Analysis results are then stored in HDFS or exported to other infrastructures.

Traditionl Hadoop Deployment

In an EMC Isilon Hadoop deployment, the HDFS is integrated as a protocol into the Isilon distributed OneFS® operating system. This approach gives users direct access through the HDFS to data stored on the Isilon cluster using standard protocols such as SMB, NFS, HTTP, and FTP. MapReduce processing and data storage are separated, allowing you to independently scale compute and data storage resources as needed.

EMC Isilon Hadoop Deployment

Every node in the Isilon cluster acts as the NameNode and DataNode. Compute clients running MapReduce jobs can connect to any node in the cluster. Data analysis results can be accessed by Hadoop users through standard protocols without the need to export results.

To learn more about the benefits of Hadoop on Isilon scale-out network attached storage (NAS), read “Hadoop on EMC Isilon Scale-Out NAS” and “EMC Isilon Scale-Out NAS for In-Place Hadoop Data Analytics.”

Best practices for deploying Hadoop to your Isilon cluster

You can connect Apache Hadoop or an enterprise-friendly Hadoop distribution, such as Pivotal HD or Cloudera, to your Isilon cluster.

First, you’ll need to turn on the HDFS protocol in OneFS. Contact your account representative to complete this step. Next, follow these best practices:

  1. Review the EMC Hadoop Start Kit 2.0. Visit the EMC Hadoop Starter Kit (HSK) 2.0 for step-by-step guides on how to connect a Hadoop distribution to your Isilon cluster. HSK guides are available for Apache Hadoop, Pivotal HD, Cloudera, and Hortonworks. A video demonstration for Pivotal HD is also available.
  2. Find your Isilon cluster’s optimal point to help determine the number of nodes that will best serve your Hadoop workflow and compute grid. The optimal point is the point at which it scales in processing MapReduce jobs and reduces run times in relation to other systems for the same workload. Contact your account representative to help you determine this information.
  3. Create directories and set permissions. OneFS controls access to directories and files with POSIX mode bits and access control lists (ACLs). Make sure directories and files are set up with the correct permissions to ensure that your Hadoop users can access their files.
  4. Don’t run NameNode and DataNode services on clients. Because the Isilon cluster acts as the NameNode and DataNodes for the HDFS, these services should only run on the cluster and not on compute clients. On compute clients, you should only run MapReduce processes.
  5. Increase the HDFS block size from the default 64 MB to 128 MB to optimize performance. Boosting the block size lets Isilon nodes read and write HDFS data in larger blocks. The result is an increase in performance of MapReduce jobs.
  6. Store intermediate jobs on an Isilon cluster. A Hadoop client typically stores its intermediate map results locally. The amount of local storage available on a client affects its ability to run jobs. Storing map results on the cluster can help performance and scalability.
  7. Consult the Isilon best practices white paper for additional tips. You can find more details about some of these best practices in “EMC Isilon Best Practices for Hadoop Data Storage.” You can also find additional tips for tuning OneFS for HDFS operations, using EMC Isilon SmartConnect™ for HDFS, aligning datasets with storage pools, and securing HDFS connections with Kerberos.

 

If you have questions related to Hadoop and your Isilon environment, contact your account representative. If you have documentation feedback or want to request new content, email isicontent@emc.com.

[display_rating_result]

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein
Tags: , , ,

4 Comments

  1. Jafar Hosseinzadeh says:

    Hello, Thank you for your post. I have a question. In your EMC Isilon Hadoop deployment diagram, you have MAP reduce on 4 servers. Are these masters or slaves. If I run NameNode on the Isilon nodes why would I need workers? I should be able to setup a cluster with 2 masters and 1 worker/slave.
    Please let me know what you think.

    • Todd Jolley says:

      Hi Jafar-
      The diagram in this article represents a typical Isilon Hadoop environment, but is not a specific architecture to deploy. In particular, it is showing the separation of compute from data, and how certain services (like map and reduce) run on the compute layer, while data resides on OneFS.

      All master/worker architectures, as well as sizing and layout of compute should follow the best practices set out by the Hadoop vendor (Hortonworks, Cloudera, IBM, etc) to match your requirements.

      I hope this helps!

      Thanks!

  2. Hadoop is, in its most basic form, the Hadoop File System (HDFS), which redundantly stores data on commodity servers (cheap) to mitigate node and disk drive failure, and data locality for performing processing where the blocks of data are stored. Hadoop v.1 implemented the Map-Reduce paradigm and Hadoop v.2 implements the more generic YARN scheduler/resource manager, but bottom line, Hadoop is about moving and distributing the program logic where the data resides. The Name Node process is simply a process that “knows” where blocks of data are stored and the Data Node process is simply a process (on a generic Hadoop cluster) that supplies blocks from HDFS to the processing logic. How does the Isilon/Hadoop implementation implement data locality, if the NN and DN processes are split from the compute nodes?

    • Risa Galant says:

      Hi Robert,

      Thanks for this interesting question! I checked in with our HDFS folks and they explained that one of the major departures from traditional Hadoop implementations when running on Isilon is the separation of compute and storage. On Isilon, we don’t maintain node locality. However, rack locality can be virtually defined. In our context, we can virtually define a relationship between a number of clients and Isilon nodes. The concept of locality is not completely lost. With respect to node locality though, the slight latency increase is incredibly small when you consider the larger block size requests made by Hadoop clients. This would be a much different concern if the access pattern was small block size and random, at which point the access time plays a larger role in overall performance.

      What Isilon adds to the equation is significantly higher namenode fault tolerance and space efficient data fault tolerance. In the case of the former, every Isilon node in an Isilon cluster is available for namenode requests. In the case of the latter, Isilon uses a Reed-Solomon-based Erasure Coding scheme in order to provide as high a level of fault tolerance as possible. While mirroring options are available, the real savings is achieved via one of several hybrid data protection schemes. These data protection schemes allow for single drive or node failures up through the loss of multiple drives and nodes. At volume, this allows for significantly reduced overhead for environments that contend with a massive amount of data; to such a degree that adding compute in line takes more physical space, can lead to higher operating cost, and where the increase in compute or memory may not be necessary for the workload.

      As you may know, there is an effort underway to add erasure coding to native Hadoop and for exactly the same reasons. As HDFS begins to implement FEC, as an option, loss of node locality is a consequence.

      Hope this answers your question.

      Best,

      Risa

Leave a Comment

Comments are moderated. Dell EMC reserves the right to remove any content it deems inappropriate, including but not limited to spam, promotional and offensive comments.