Archive for the ‘Advanced Topics’ Category

Manage your data with the new EMC Isilon SDK

Gwen Zierdt

Gwen Zierdt

Principal Technical Content Developer at EMC
Award-winning technical writer for the IT Pro and web service developer.
Gwen Zierdt

Latest posts by Gwen Zierdt (see all)

The Isilon SDK is now available! The Isilon SDK includes documentation and code samples to help you to develop a customized interface to your Isilon OneFS cluster. The OneFS API is a REST-based HTTP interface that allows automation, orchestration, and provisioning of an Isilon cluster. Using the OneFS API, third-party applications can leverage the capabilities of the OneFS operating system to simplify management, data protection, and provisioning.

Code samples are available through the Isilon community on EMC {code} and from the Isilon GitHub repository. Immediately available for download is the Python Language Bindings for the OneFS API and the Statistics Browser tool.  The Isilon SDK Info Hub connects you to a central source of information and provides additional links to resources and documentation.

To use the Isilon SDK, you need either a physical Isilon cluster, or a OneFS simulator. See the video Technical Demo: EMC Isilon OneFS Simulator on the EMC YouTube channel to learn how to install the OneFS Simulator.

The Isilon SDK is Open Source and available free under the MIT license.

Sign up on the Slack EMC {Code} CodeCommunity and join the #isilon channel. Be part of the conversation about the Isilon SDK and receive news of future updates.

Let us know!

Let us know what you think. If you have feedback for us about this or any other Isilon technical content, email us at mailto:isicontent@emc.com. And thank you!

Data lakes for data science

Steve Hoenisch

Steve Hoenisch

Solutions Architect and White Paper Writer at EMC Emerging Technologies Division
Steve Hoenisch

Latest posts by Steve Hoenisch (see all)

Big data can be challenging for an enterprise organization, because big data affects data scientists, application developers, and infrastructure managers differently. Each of these specialists has different needs when it comes to analytic frameworks and storage infrastructure.

A data lake is a storage strategy to collect data in its native format in a shared storage infrastructure, making data available to different analytics applications, teams, and devices over common protocols. The notion of an EMC Isilon data lake sets the stage for a discussion of the kind of architecture that best supports the enterprise data science program pipeline and the newer, highly scalable big data tools. You can find this discussion in a new white paper, Data Lakes for Data Science: Integrating Analytics Tools with Shared Infrastructure for Big Data.

This blog post highlights the impact that data science has on an enterprise organization, and the considerations for decision makers to keep in mind about analytics frameworks and storage infrastructure. For details about data lake solutions and examples, refer to the white paper.

 

The impact of data science on the enterprise

Implementing an enterprise data science program to analyze big data involves two overarching, interrelated requirements:

  1. The flexibility to use the analytics tool that works best for the dataset on hand.
  2. The flexibility to use the analytics tool that best serves your analytical objectives.

Several aspects of the data science pipeline highlight these requirements:

  1. When you begin to collect data to solve a problem, you might not know the characteristics of the dataset, and those characteristics might influence the analytics framework that you select.
  2. When you have a dataset, but have not yet identified a problem to solve or an objective to fulfill, you might not know which analytics tool or method will best serve your purpose.

 

Analytics frameworks

With the traditional solution of the data warehouse and business intelligence system (DW/BI), these requirements are well known, as the following passage from Margy Ross and Ralph Kimball’s book, ”The Data Warehouse Toolkit,” illustrates:

“The DW/BI system must adapt to change. User needs, business conditions, data, and technology are all subject to change. The DW/BI system must be designed to handle this inevitable change gracefully so that it doesn’t invalidate existing data or applications. Existing data and applications should not be changed or disrupted when the business community asks new questions or new data is added to the warehouse.”

However, the fact of the matter is that unknown business problems and varying datasets demand a flexible approach to choosing the analytics framework that will work best for a given project or situation.

In particular, one change that DW/BI systems have difficulty adapting to is the demands of big data. In the face of new business requirements to collect and analyze large sets of unstructured data, DW/BI systems have become barriers to change. Why?

Because a data warehouse or relational database management system (RDMS) is not capable of scaling to handle the volume and velocity of big data and does not satisfy some key requirements of a big data program, such as handling unstructured data. The schema-on-read requirements of an RDMS impede the storage of a variety of data.

Indeed, the sheer variety of data requires a variety of tools—and different tools are likely to be used during the different phases of the data science pipeline. Common tools include Python, the statistical computing language R, and visualization software, such as Tableau. But the framework that many businesses are rapidly adopting is Apache Hadoop.

Analytics tools such as Apache Hadoop, Apache Hive, and Spark underscore the data science pipeline. At each stage of the workflow, data scientists are working to clean their data, extract aspects of it, aggregate it, explore it, model it, sample it, test it, and analyze it. With such work comes many use cases, and each use case demands the tool that best fits the task. During the stages of the pipeline, different tools, such as Apache Hive and Apache Spark, may be put to use.

Storage infrastructure and the data lake

The infrastructure of any data storage system must support data access over multiple protocols so that many tools running on different operating systems, whether on a compute cluster or a user’s workstation, can access the stored data.

The flexibility of a data lake empowers the IT infrastructure to serve the rapidly changing needs of the business, the data scientists, and the big data tools. If the storage solution is flexible enough to support many big data activities, it can yield a sizable return on the investment.

For more information, including examples of data science studies conducted in enterprise environments, read the white paper, “Data Lakes for Data Science: Integrating Analytics Tools with Shared Infrastructure for Big Data.”

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, or comments about the video specifically, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

Introducing the EMC Isilon External Network Connectivity Guide

Risa Galant

Risa Galant

Principal Technical Writer at EMC Isilon Storage Division
Risa Galant

Latest posts by Risa Galant (see all)

Ever wonder about the best way to set up communication between Isilon clusters and external client applications? Maybe you’d like to learn about Isilon network topology and how IP routing works in OneFS 7.1, or what the best practices are for using source-based routing in OneFS 7.2. Perhaps you’re curious about considerations around Isilon technology refreshes, or what to tell your client system administrators about DNS settings.

We’ve got just the content for you! Check out the EMC Isilon External Network Connectivity Guide: Routing, Network Topologies, and Best Practices for SmartConnect. (You’ll need to log in to the EMC Online Support site to view it.) Developed as a collaborative effort between Isilon Information Development and Isilon Professional Services, the Isilon External Network Connectivity Guide’s scenario-driven content covers your favorite Isilon external networking topics.  Be aware, though, that it isn’t a tutorial. The guide reviews Isilon networking basics, but assumes that as a network or storage architect or administrator, you’re already familiar with general networking concepts and terms.

Here are some highlights from the guide:

  • An easy-to-consume table to help you choose the best load balancing policy for your environment
  • Guidelines for keeping your Isilon cluster running efficiently
  • DNS setting recommendations to pass along to your client system administrators to help ensure that client connections stay fresh
  • Common questions and answers about Isilon in-band network management
  • Guidelines for calculating the number of IP addresses you’ll need for planning your network architecture
  • An illustrated, scenario-based walkthrough that introduces you to the wonders of dynamic SmartConnect zones and IP addresses
  • Recommended strategies for network design for specialized workloads from different industries, such as media and entertainment
  • Best practices for ensuring cluster stability, data integrity, and optimal network performance
  • Another easy-to-consume table describing common data unavailable causes and preventive actions you can take
  • Planning guidelines for technology refresh cycles
  • Recommended IP allocation strategies for SmartConnect Advanced listed by protocol
  • An illustrated discussion of network routing in OneFS 7.1
  • Another illustrated discussion of source-based routing (SBR) in OneFS 7.2, with a bonus discussion of destination-based routing just for comparison

SBR diagram by Andrew Chung

The guide also covers how best to use SyncIQ and SmartConnect Advanced for backup and disaster recovery planning. In fact, there’s a whole section covering SmartConnect best practices. Learn how SmartConnect works, what to check if you have firewalls, and what practices to avoid.  And if you’ve ever wondered how Isilon and SmartConnect handle DNS delegation, the Isilon External Network Connectivity Guide is the guide for you.

As if that weren’t enough information about SmartConnect, there are more scenario-based descriptions of hot networking topics such as where the SmartConnect service runs, what happens when you replace nodes while SmartConnect is active, and how to use SmartConnect in an isolated network environment.

Pretty comprehensive, huh? That’s the idea: to provide an all-inclusive guide to Isilon external network connectivity. We hope that this will be your go-to guide for getting answers to your Isilon external networking questions. It’s sort of a “how to hook up with an Isilon cluster” guide.

You’ll find the guide on EMC Online Support here: EMC Isilon External Network Connectivity Guide. Note that you’ll need to log in to the support site to access it. Let us know what you think!

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog or comments about the video specifically, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

Cluster capacity advice from an EMC Isilon expert

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

Avoiding scenarios where your cluster reaches maximum capacity is crucial for making sure it runs properly. Our Best Practices for Maintaining Enough Free Space on Isilon Clusters and Pools guide contains information to help Isilon customers keep their clusters running smoothly.

However, there are common misperceptions about cluster capacity, such as the notion that it’s easy to delete data from a cluster that is 100 percent full. Another misunderstanding: using Virtual Hot Spare (VHS) to reserve space for smartfailing a drive is not always necessary.

To clarify these issues and other concerns about cluster capacity, I interviewed one of Isilon’s top experts on this topic, Bernie Case. Bernie is a Technical Support Engineer V in Global Services at Isilon, with many years of experience working with customers who experience maximum cluster capacity scenarios. He is also a contributing author to the Best Practices for Maintaining Enough Free Space on Isilon Clusters and Pools guide. In this blog post, Bernie answers questions about cluster capacity and provides advice and solutions.

Q: What are common scenarios in the field that lead to a cluster reaching capacity?

A: The typical scenarios are when there’s an increased data ingest, which can come from either a normal or an unexpected workflow. If you’re adding a new node or replacing nodes to add capacity, and it takes longer than expected, a normal workflow will continue to write data into the cluster—possibly causing the cluster to reach capacity. Or there is a drive or node failure on an already fairly full cluster, which necessitates a FlexProtect (or FlexProtectLin) job from the Job Engine to run to re-protect data, therefore interrupting normal SnapshotDelete jobs. [See EMC Isilon Job Engine to learn more about these jobs.] Finally, I’ve seen snapshot policies that create a volume of snapshots that takes a long time to delete even after snapshot expiration. [See Best Practices for Working with Snapshots for snapshot schedule tips.]

Q: What are common misperceptions about cluster capacity?

A: Some common misconceptions include:

  • 95 percent of a 1 PiB cluster still leaves about 50TiB of space. That’s plenty for our workflow. We won’t fill that up.
  • Filling up one tier and relying on spillover to another tier won’t affect performance.
  • The SnapshotDelete job should be able to keep up with our snapshot creation rate.
  • Virtual Hot Spare (VHS) is not necessary in our workflow; we need that space for our workflow.
  • It’s still very easy to delete data when the cluster is 100 percent full.

Q: What are the ramifications of a full cluster?

A: When a cluster reaches full capacity, you’re dealing primarily with data unavailable situations—where data might be able to be read, but not written. For example, a customer can experience the inability to run SyncIQ policies, because those policies write data into the root file system (/ifs). There’s also the inability to make cluster configuration changes because those configurations are stored within /ifs.

Finally, a remove (rm) command for deleting files may not function when a cluster is completely full, requiring support intervention.

Q: What should a customer do immediately if their cluster is approaching 90-95 percent capacity?

A: Do whatever you can to slow down the ingesting or retention of data, including moving data to other storage tiers or other clusters, or adjusting snapshot policies. To gain a little bit of temporary space, make sure that VHS is not disabled.

Call your EMC account team to prepare for more storage capacity. You should do this at around 80-85 percent capacity.  It does take time to get those nodes on-site, and you don’t want any downtime.

VHS in SmartPools settings should always be enabled. The default drive to protect is 1 drive, and reserved space should be set to zero. For more information, see KB 88964.

VHS options should always be selected to set aside space for a drive failure. You should have at least 1 virtual drive (default value) set to 0% of total storage. For more information on these default values, see KB 88964 on the EMC Online Support site.

Q: What are the most effective short-term solutions for managing or monitoring cluster capacity?

A: Quotas are an effective way to see real-time storage usage within a directory, particularly if you put directories in specific storage tiers or node pools. Leverage quotas wherever you can.

The TreeDelete job [in the Job Engine] can quickly delete data, but make sure that the data you’re deleting isn’t just going into a snapshot!

Q: What are the most effective long-term solutions to implement from the best practices guide?

A: Make sure you have an event notifications properly configured, so that when jobs fail, or drives fail, you’ll know it and can take immediate action. In addition to notifications and alerts, you can use Simple Network Management Protocol (SNMP) to monitor cluster space, for an additional layer of protection.

InsightIQ and the FSAnalyze job [which the system runs to create data for InsightIQ’s file system analytics tools] can give great views into storage usage and change rate, over time, particularly in terms of daily, monthly, or weekly data ingest.

Q: Is there anything you would like to add?

A: Cluster-full situations where the rm command doesn’t work are sometimes alarming. In a file system such as OneFS, a file deletion often requires a read-modify-write cycle for metadata structures, in addition to the usual unlinking and garbage collection that occurs within the file system. Getting out of that situation can be challenging and sometimes time-consuming. Resolving it requires a support call—and a remote session, which can be a big problem for private clusters.

Sometimes accidents happen or a node can fail, which can push a cluster to the limit of capacity thresholds. Incidents such as these can occasionally lead to data unavailability situations that can halt a customer’s workflow. Being ready to add capacity at 80-85 percent can prevent just this sort of situation.

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, or comments about the video specifically, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

Multitenancy for Hadoop data on an EMC Isilon cluster

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

The process of analyzing big data within big organizations can be complicated. There can be many data sets to analyze, some which are stored in silos or contain secure information. And there can be many different Hadoop users accessing these data sets, each with different permissions and credentials. So how can organizations effectively manage multiple data sets and Hadoop users?

In EMC® Isilon® OneFS®, you can take advantage of multitenancy to tackle this issue. Multitenancy creates secure, separate namespaces on a shared infrastructure so that different Hadoop users (or tenants) can connect to an Isilon cluster, run Hadoop jobs concurrently, and consolidate their Hadoop workflows onto a single cluster. OneFS 7.2 supports several Hadoop distributions and HDFS 2.2, 2.3, and 2.4. The OneFS HDFS implementation also works with Ambari for management and monitoring, Kerberos authentication, and Kerberos impersonation.

The white paper, “EMC Isilon Multitenancy for Hadoop Big Data Analytics,” highlights how to set up access zones for multitenancy and manage Hadoop data in an Isilon cluster.

How Hadoop works in Isilon

The Apache Hadoop analytics platform comprises the Hadoop Distributed File System, or HDFS, a storage system for vast amount of data, and MapReduce, a processing paradigm for data-intensive computation analysis.

EMC Isilon serves as the file system for Hadoop clients. This enables Hadoop clients to directly access their datasets on the Isilon storage system and run data analysis jobs on their compute clients. OneFS implements server-side operations of the HDFS protocol on each node in the Isilon cluster to handle calls to the NameNode and to manage read/write requests to DataNodes.

EMC Isilon Hadoop Deployment

To configure an Isilon cluster for Hadoop, you first need to activate a HDFS license in OneFS. Contact your account team for more information. Then visit our EMC Hadoop Starter Kits to learn how to deploy multiple Hadoop distributions, such as Pivotal, Cloudera, or HortonWorks, on your Isilon cluster.

Access zones for multitenancy

Access zones lay the foundation for multitenancy in OneFS. Access zones provide a virtual security context that segregates tenants and creates a virtual region that isolates data sets. Each access zone encapsulates a namespace, HDFS directory, directory services, authentication, and auditing. An access zone also isolates system connections for further security.

The following procedures for managing and securing data sets are covered in “EMC Isilon Multitenancy for Hadoop Big Data Analytics.”

  • Provide multiprotocol support – Learn how you can store data by using existing workflows on your Isilon cluster and access it through SMB, NFS, OpenStack Swift, and HDFS protocols, instead of running HDFS copy operations to move data to Hadoop clients.
  • Manage different data sets – Learn how you can use SmartPools for managing different data sets based on customized policies.
  • Associate network resources with access zones – Understand how virtual racking works in Isilon and how you can configure SmartConnect in OneFS to manage connections to data on your Isilon cluster.
  • Secure access zones – Review how role-based access control and directory services with access zones in OneFS are used to authenticate users assigned to each zone.

Hadoop information hubs

You can find a rich array of information about Isilon and Hadoop. Visit our online Isilon Community on the EMC Community Network for InfoHubs, which serves as a single location for all of our Hadoop-related content. The Hadoop InfoHub contains links to general information about Isilon and Hadoop. The Cloudera with Isilon InfoHub contains links to information about deploying the Cloudera distribution for Isilon.

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

Object storage in EMC Isilon Swift

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

Next-generation applications for cloud, analytics, social media and mobile devices  rely on object storage to store and access data. Object storage is an efficient way to store large amounts data: it flattens data hierarchy and enables automated API access between storage and applications. You can integrate object storage into your EMC® Isilon® cluster by using the open source OpenStack™ Swift API. OneFS® 7.2 exposes the OpenStack Object Storage API as a set of Representational State Transfer (REST) web services over HTTP. This way, you can direct applications that use the Swift API to store content and metadata as objects on your Isilon cluster.

What are the specific benefits of using Isilon Swift? First, the containers and objects that you save on an Isilon cluster can also be simultaneously accessed as directories and files by using other supported protocols such as NFS, SMB, HTTP, FTP, and HDFS. This interoperability between protocols can eliminate islands of storage and simplify management. Second, you can take advantage of Isilon authentication to secure the content saved through the Swift API.

How Isilon Swift works

The Swift API requests that you submit can store and manage containers, objects, and metadata in the OneFS file system. An instance of the Swift protocol driver runs on each node in the cluster and handles API requests.

The Swift API presents the home directories as accounts, directories as containers, and files as objects. Each home directory in the OneFS file system maps to a Swift account. The directories and subdirectories in a home directory map to containers and subcontainers. Files appear as objects. All objects have metadata.

How OneFS interoperates between object and file. See the OpenStack Swift Object Storage on EMC Isilon Scale-Out NAS white paper for more information.

How OneFS interoperates between object and file. See the OpenStack Swift Object Storage on EMC Isilon Scale-Out NAS white paper for more information.

Authentication in OneFS

When a Swift client connects to an Isilon cluster, the connection must be authenticated. Authentication takes places in an OneFS access zone. Access zones are virtual contexts that you can set up to control access to an Isilon cluster through an incoming IP address. When a Swift user submits an authentication request to the cluster, OneFS creates an access token for the user. This token contains the user’s full identity and security credentials for the access zone that the user is assigned to. For more information about how authentication works in OneFS, see the following white papers: OneFS Multiprotocol Security Untangled and OpenStack Swift Object Storage on EMC Isilon Scale-Out NAS.

Client libraries and HTTP requests

Isilon Swift supports two client libraries: the Python-Swift client library and Apache Libcloud. The following HTTP requests are supported by Isilon Swift to work with these libraries: GET, PUT, DELETE, POST, HEAD, and COPY requests. Isilon Swift does not support certain object store features, such as using the HTTPS protocol and accessing conditional GET and PUT calls based on ETag matching. For more information, see the Isilon Swift Tech Note (login to the EMC Online Support site is required).

For more information

If you want to learn more about Isilon Swift, read OpenStack Swift Object Storage on EMC Isilon Scale-Out NAS. If you have OneFS 7.2 and need commands and procedures for using Isilon Swift, refer to the Isilon Swift Tech Note on the EMC Online Support site.

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

The top 3 operational differences between EMC Isilon OneFS 6.5 and OneFS 7.0

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

isilon-onefs-7-0Attention all current EMC® Isilon® OneFS 6.5 users: OneFS 6.5 will reach its end of service life (EOSL) on June 30, 2015. OneFS 7.0 introduces several new features, enhancements, and operational changes. If you need to upgrade to OneFS 7.0, you might be wondering what’s different about this version and how these differences will affect your day-to-day administrative tasks. You can learn more by looking at the Administrative Differences in OneFS 7.0 white paper.

The top three changes that OneFS 6.5 users should prepare for are:

  • Administration using role-based access control (RBAC)
  • Authentication using access zones
  • Managing groups of nodes in SmartPools

Role-based access control

In OneFS 6.5, you can grant web and SSH login and configuration access to non-root users by adding them to the admin group. The admin group is replaced with the administrator role in OneFS 7.0 using RBAC. A role is a collection of OneFS privileges, usually associated with a configuration subsystem, that are granted to members of that role as they log in to the cluster.

For information about role-based access, including a description of roles and privileges, see Isilon OneFS 7.0: Role-Based Access Control.

An important note!

After you upgrade to OneFS 7.0, make sure you add existing administrators to an administrator role.

Access Zones

In OneFS 7.0, all user access to the cluster is controlled through access zones. With access zones, you can partition the cluster configuration into self-contained units and configure a subset of parameters as a virtual cluster with its own set of authentication providers, user mapping rules, and SMB shares. The built-in access zone is the “System” zone, which by default provides the same behavior as OneFS 6.5, using all available authentication providers, NFS exports, and SMB shares.

For information about access zones, see the OneFS 7.0.2 Administration Guide.

SmartPools

In OneFS 6.5, a group of nodes is called a disk pool. In OneFS 7.0, a group of nodes is called a node pool, and a group of disks in a node pool is called a disk pool. Also, Isilon nodes are automatically assigned to node pools in the cluster based on the node type. This is called autoprovisioning. Disk pools can no longer be viewed or targeted directly through the OneFS 7.0 web administration interface or the command-line interface. Instead, the smallest unit of storage that can be administered in OneFS 7.0 is a node pool. Disk pools are managed exclusively by the system through autoprovisioning.

An important note!

Before you upgrade to OneFS 7.0, you must configure disk pools into a supported node pool configuration. Disk pools must contain nodes of the same type, according to their node equivalence class. Disk pools that contain a mixture of node types must be reconfigured.

For information about how to prepare your Isilon cluster for upgrade to OneFS 7.0, see the Isilon OneFS 7.0.1 – 7.0.2 Upgrade Readiness Checklist.

For more information about OneFS 7.0

Visit these links for more information about:

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

How to secure a Hadoop data lake with EMC Isilon

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

Apache™ Hadoop®, open-source software for analyzing huge amounts of data, is a powerful tool for companies that want to analyze information for valuable insights.

Hadoop redefines how data is stored and processed. A key advantage of Hadoop is that it enables analytics on any type of data. Some organizations are beginning to build data lakes—essentially large repositories for unstructured data—on the Hadoop Distributed File System (HDFS) so they can easily store data collected from a variety of sources, and then run compute jobs on data in its original file format. There’s no need to load data into the HDFS for analysis, saving data scientists time and money. They can then survey their Hadoop data lake and discover big data intelligence to drive their business.

However, the Hadoop data lake also presents challenges for organizations that want to protect sensitive information stored in these data repositories. For example, organizations might need to follow internal enterprise security policies or external compliance regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) or the Sarbanes-Oxley Act (SOX). A Hadoop data lake is difficult to secure because HDFS was neither designed nor intended to be an enterprise-class file system. It is a complex, distributed file system of many client computers with a dual purpose: data storage and computational analysis. HDFS has many nodes, each of which presents a point of access to the entire system. Layers of security can be added to a Hadoop data lake, but managing each layer adds to complexity and overhead.

Best of both worlds

The EMC® Isilon® scale-out data lake offers the best of both worlds for organizations using Hadoop: enterprise-level security and easy implementation of Hadoop for data analytics.securing a hadoop data lake

The new white paper, Security and Compliance for Scale-Out Hadoop Data Lakes, describes how Hadoop data is stored on Isilon scale-out network-attached storage (NAS), and how the OneFS® operating system helps to secure that data.

An Isilon cluster separates data from compute clients in which the Isilon cluster becomes the HDFS file system. All data is stored on an Isilon cluster and secured by using access control lists, access zones, self-encrypting drives, and other security features. OneFS implements the server-side operations of HDFS as a native protocol. Therefore, Hadoop clients access data on the cluster through HDFS and standard protocols such as SMB and NFS.

For more information about how Hadoop is implemented on an Isilon cluster, see EMC Isilon Scale-Out NAS for In-Place Hadoop Data Analytics.

Isilon security capabilities

OneFS can facilitate your efforts to comply with regulations such as HIPAA, SOC, SEC 17a-4, the Federal Information Security Management Act (FISMA), and the Payment Card Industry Data Security Standard (PCI DSS). The table below summarizes some of the challenges of securing a Hadoop data lake, and how the capabilities of an Isilon cluster can help to address these issues. For full descriptions of these capabilities, see Security and Compliance for Scale-Out Hadoop Data Lakes.

 Hadoop data lakes: security challenges and Isilon capabilities

Security challenges Isilon capabilities Description
A Hadoop data lake can contain sensitive data—intellectual property, confidential customer information, and company records. Any client connected to the data lake can access or alter this sensitive data.
  • Compliance mode and write-once, read-many (WORM) storage
  • Auditing
The SEC 17a-4 regulation requires that data is protected from malicious, accidental, or premature alteration. Isilon SmartLock™ is a OneFS feature that locks down directories through WORM storage. Use compliance mode only for scenarios where you need to comply with SEC 17a-4 regulations. In addition, auditing can help detect fraud, unauthorized access attempts, or other threats to security.
ACL policies help to ensure compliance. However, clients may be connecting to the Hadoop cluster by using different protocols, such as NFS or HTTP.
  • Authentication and cross-protocol permissions
OneFS authenticates users and groups connecting to the cluster through different protocols by using POSIX mode bits, NTFS, and ACL policies. By managing ACL policies in OneFS, you can address compliance requirements for environments that mix NFS, SMB, and HDFS.
Applying restricted access to directories and files in HDFS requires adding layers to your file system.
  • Role-based access control for system administration (RBAC)
  • Identity management
  • User mapping
  • Access zones
The PCI DSS Requirement 7.1.2 specifies that access must be restricted to privileged user IDs. RBAC, a OneFS feature, lets you manage administrative access by role, and assign privileges to a role. You can associate one user with one ID through identity management and user mapping, and then assign that ID to a role. In OneFS, access zones are a virtual security context in which OneFS connects to directory services, authenticates users, and controls access to a segment of the file system.
FISMA and HIPAA and other compliance regulations might require protection for data at rest. Encryption of data at rest Isilon self-encrypting drives are FIPS 140-2 Level 3 validated. The drives automatically apply AES-256 encryption to all data stored in the drives without requiring additional equipment. You can enable a WORM state on directories for data at rest.

To learn how to implement Hadoop on your Isilon cluster, see 7 best practices for setting up Hadoop on an EMC Isilon cluster.

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, contact isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

 

[display_rating_result]

How EMC Isilon storage improves performance for EDA workflows

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

200248437-001To develop the chips that go inside advanced technologies, such as smartphones and personal computers, engineers often rely on electronic design automation (EDA) software tools for chip design and testing.

As EDA projects and designs increase in complexity, the amount of project data increases as well. Similar to most industries, the EDA industry is facing challenges with managing the exponential growth of unstructured data while optimizing performance and storage efficiency.

The new technical white paper, “EMC Isilon NAS: Performance at Scale for Electronic Design Automation,” highlights how Isilon scale-out network attached storage (NAS) can alleviate the bottlenecks and inefficient use of storage space for EDA workflows running on traditional storage systems. The primary audience for this white paper includes engineers and executives working in the EDA industry. However, anyone that uses workflows requiring high levels of concurrent running jobs may also find this white paper to be useful.

For example, during the frontend phase of the EDA digital design workflow, EDA applications read and compile millions of small source files to build and simulate chip design. Jobs are typically run concurrently against a deep and wide directory structure, which creates a large amount of metadata overheard and high CPU usage on the storage system. This white paper illustrates how Isilon scale-out storage is more effective than traditional data storage at alleviating workflow performance issues, such as:

  • Metadata access: Using a centralized metadata server can become a bottleneck. Average metadata operations for a typical EDA workflow include 65 percent metadata access, 20 percent writes, and 15 percent data reads. Isilon uses a distributed metadata architecture and can store all metadata on solid-state drives (SSDs), reducing the latency for metadata operations when running concurrent jobs. For more information about EMC® Isilon® OneFS® SSD caching, refer to the white paper, “EMC Isilon OneFS SmartFlash: File System Caching Infrastructure.”
  • Run times for concurrent jobs: All nodes in an Isilon cluster work in parallel. OneFS automatically distributes jobs using SmartConnect™ to each node instead of running all the jobs against a single controller or requiring the manual distribution of jobs to controllers. Isilon recommends that you work with an Isilon representative to determine the number of nodes that will best serve your workflow.

You can learn more about Isilon scale-out NAS architecture, storage efficiency, and data management by referring to “EMC Isilon NAS: Performance at Scale for Electronic Design Automation.”

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, send an email to isi.knowledge@emc.com. To provide documentation feedback or request new content, send an email to isicontent@emc.com.

[display_rating_result]

Understanding Global Namespace Acceleration (GNA)

Colin Torretta

Colin Torretta

Senior Technical Writer
Colin Torretta

Latest posts by Colin Torretta (see all)

With the proliferation of solid state drives (SSDs) in data centers across the world, companies are finding more and more ways to take advantage of the high speed and low latency of SSDs in unique and exciting ways. Within the EMC® Isilon® OneFS® operating system, one of the innovative ways Isilon is using SSDs is for Global Namespace Acceleration (GNA). GNA is a feature of OneFS that increases performance across your entire cluster by using SSDs to store file metadata for read-only purposes, even in node pools that don’t contain dedicated SSDs.

GNA is managed through the SmartPools™ software module of the OneFS web administration interface. SmartPools enables storage tiering and the ability to aggregate different type of drives (such as SSDs and HDDs) into node pools. When GNA is enabled, all SSDs in the cluster are used to accelerate metadata reads across the entire cluster. Isilon recommends one SSD per node as a best practice, with two SSDs per node being preferred. However, customers with a mix of drive types can benefit from the metadata read acceleration with GNA regardless of how SSDs are placed across the cluster. When possible, GNA stores metadata in the same node pool containing the associated data. If there are no dedicated SSDs in the node pool, however, a random selection is made to any node pool containing SSDs. This means as long as SSDs are available somewhere in the cluster, a node pool can benefit from GNA.

For more information about GNA, see the “Storage Pools” section of the OneFS web administration and CLI administration guides.

Important considerations when using GNA

Here are some important considerations to keep in mind when determining whether GNA can benefit your workflow.

  • Use GNA for cold data workflows. Certain workflows benefit more from the performance gains that GNA provides. For example, workflows that require heavy indexing of “cold data”—which is archive data on stored on disks that is left unmodified for extended periods of time—benefit the most from the increased speed of metadata read acceleration. GNA does not provide any additional benefit to customers who already have solely SSD clusters, because all metadata is already stored on SSDs.
  • SSDs must account for a minimum of 1.5% of the total space on your cluster. To use GNA, 20% of the nodes in your cluster must contain SSDs, and SSDs must account for a minimum of 1.5% of the total space on your cluster, with 2% being strongly recommended. This ensures that GNA does not overwhelm the SSDs on your cluster. Failure to maintain these requirements will result in GNA being disabled and metadata read acceleration being lost. To enable GNA again, metadata copies will have to be rebuilt, which can take time.
  • Consider how new nodes affect the total cluster space. Adding new nodes to your cluster affects the percentage of nodes with SSDs and total available space on SSDs. Keep this in mind whenever you add new nodes to avoid GNA being disabled and the metadata copy being immediately deleted. SSDs must account for a minimum of 1.5% of total space on your cluster.
  • Do not remove the extra metadata mirror. When GNA is enabled, an SSD is set aside as an additional metadata mirror, in addition to the existing mirrors set by your requested protection, which is determined in SmartPools settings. A common misunderstanding is that the SSD is an “extra” mirror and it can be safely removed without affecting your cluster. In reality, this extra metadata mirror is critical to the functionality of GNA, and removing it causes OneFS to rebuild the mirror on another drive. See the graphic below for information on the number of metadata mirrors per requested protection when using GNA. For more information about requested protection, see the “Storage Pools” section of the OneFS Web Administration Guide.
The number of metadata mirrors required by GNA per requested protection level in OneFS.

The number of metadata copies required by GNA to achieve read acceleration per requested protection level in OneFS.

 

[display_rating_result]