Posts Tagged ‘data lake’

Data lakes for data science

Steve Hoenisch

Steve Hoenisch

Solutions Architect and White Paper Writer at EMC Emerging Technologies Division
Steve Hoenisch

Latest posts by Steve Hoenisch (see all)

Big data can be challenging for an enterprise organization, because big data affects data scientists, application developers, and infrastructure managers differently. Each of these specialists has different needs when it comes to analytic frameworks and storage infrastructure.

A data lake is a storage strategy to collect data in its native format in a shared storage infrastructure, making data available to different analytics applications, teams, and devices over common protocols. The notion of an EMC Isilon data lake sets the stage for a discussion of the kind of architecture that best supports the enterprise data science program pipeline and the newer, highly scalable big data tools. You can find this discussion in a new white paper, Data Lakes for Data Science: Integrating Analytics Tools with Shared Infrastructure for Big Data.

This blog post highlights the impact that data science has on an enterprise organization, and the considerations for decision makers to keep in mind about analytics frameworks and storage infrastructure. For details about data lake solutions and examples, refer to the white paper.

 

The impact of data science on the enterprise

Implementing an enterprise data science program to analyze big data involves two overarching, interrelated requirements:

  1. The flexibility to use the analytics tool that works best for the dataset on hand.
  2. The flexibility to use the analytics tool that best serves your analytical objectives.

Several aspects of the data science pipeline highlight these requirements:

  1. When you begin to collect data to solve a problem, you might not know the characteristics of the dataset, and those characteristics might influence the analytics framework that you select.
  2. When you have a dataset, but have not yet identified a problem to solve or an objective to fulfill, you might not know which analytics tool or method will best serve your purpose.

 

Analytics frameworks

With the traditional solution of the data warehouse and business intelligence system (DW/BI), these requirements are well known, as the following passage from Margy Ross and Ralph Kimball’s book, ”The Data Warehouse Toolkit,” illustrates:

“The DW/BI system must adapt to change. User needs, business conditions, data, and technology are all subject to change. The DW/BI system must be designed to handle this inevitable change gracefully so that it doesn’t invalidate existing data or applications. Existing data and applications should not be changed or disrupted when the business community asks new questions or new data is added to the warehouse.”

However, the fact of the matter is that unknown business problems and varying datasets demand a flexible approach to choosing the analytics framework that will work best for a given project or situation.

In particular, one change that DW/BI systems have difficulty adapting to is the demands of big data. In the face of new business requirements to collect and analyze large sets of unstructured data, DW/BI systems have become barriers to change. Why?

Because a data warehouse or relational database management system (RDMS) is not capable of scaling to handle the volume and velocity of big data and does not satisfy some key requirements of a big data program, such as handling unstructured data. The schema-on-read requirements of an RDMS impede the storage of a variety of data.

Indeed, the sheer variety of data requires a variety of tools—and different tools are likely to be used during the different phases of the data science pipeline. Common tools include Python, the statistical computing language R, and visualization software, such as Tableau. But the framework that many businesses are rapidly adopting is Apache Hadoop.

Analytics tools such as Apache Hadoop, Apache Hive, and Spark underscore the data science pipeline. At each stage of the workflow, data scientists are working to clean their data, extract aspects of it, aggregate it, explore it, model it, sample it, test it, and analyze it. With such work comes many use cases, and each use case demands the tool that best fits the task. During the stages of the pipeline, different tools, such as Apache Hive and Apache Spark, may be put to use.

Storage infrastructure and the data lake

The infrastructure of any data storage system must support data access over multiple protocols so that many tools running on different operating systems, whether on a compute cluster or a user’s workstation, can access the stored data.

The flexibility of a data lake empowers the IT infrastructure to serve the rapidly changing needs of the business, the data scientists, and the big data tools. If the storage solution is flexible enough to support many big data activities, it can yield a sizable return on the investment.

For more information, including examples of data science studies conducted in enterprise environments, read the white paper, “Data Lakes for Data Science: Integrating Analytics Tools with Shared Infrastructure for Big Data.”

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, or comments about the video specifically, contact us at isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

[display_rating_result]

How to secure a Hadoop data lake with EMC Isilon

Kirsten Gantenbein

Kirsten Gantenbein

Principal Content Strategist at EMC Isilon Storage Division
Kirsten Gantenbein
Kirsten Gantenbein

Apache™ Hadoop®, open-source software for analyzing huge amounts of data, is a powerful tool for companies that want to analyze information for valuable insights.

Hadoop redefines how data is stored and processed. A key advantage of Hadoop is that it enables analytics on any type of data. Some organizations are beginning to build data lakes—essentially large repositories for unstructured data—on the Hadoop Distributed File System (HDFS) so they can easily store data collected from a variety of sources, and then run compute jobs on data in its original file format. There’s no need to load data into the HDFS for analysis, saving data scientists time and money. They can then survey their Hadoop data lake and discover big data intelligence to drive their business.

However, the Hadoop data lake also presents challenges for organizations that want to protect sensitive information stored in these data repositories. For example, organizations might need to follow internal enterprise security policies or external compliance regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) or the Sarbanes-Oxley Act (SOX). A Hadoop data lake is difficult to secure because HDFS was neither designed nor intended to be an enterprise-class file system. It is a complex, distributed file system of many client computers with a dual purpose: data storage and computational analysis. HDFS has many nodes, each of which presents a point of access to the entire system. Layers of security can be added to a Hadoop data lake, but managing each layer adds to complexity and overhead.

Best of both worlds

The EMC® Isilon® scale-out data lake offers the best of both worlds for organizations using Hadoop: enterprise-level security and easy implementation of Hadoop for data analytics.securing a hadoop data lake

The new white paper, Security and Compliance for Scale-Out Hadoop Data Lakes, describes how Hadoop data is stored on Isilon scale-out network-attached storage (NAS), and how the OneFS® operating system helps to secure that data.

An Isilon cluster separates data from compute clients in which the Isilon cluster becomes the HDFS file system. All data is stored on an Isilon cluster and secured by using access control lists, access zones, self-encrypting drives, and other security features. OneFS implements the server-side operations of HDFS as a native protocol. Therefore, Hadoop clients access data on the cluster through HDFS and standard protocols such as SMB and NFS.

For more information about how Hadoop is implemented on an Isilon cluster, see EMC Isilon Scale-Out NAS for In-Place Hadoop Data Analytics.

Isilon security capabilities

OneFS can facilitate your efforts to comply with regulations such as HIPAA, SOC, SEC 17a-4, the Federal Information Security Management Act (FISMA), and the Payment Card Industry Data Security Standard (PCI DSS). The table below summarizes some of the challenges of securing a Hadoop data lake, and how the capabilities of an Isilon cluster can help to address these issues. For full descriptions of these capabilities, see Security and Compliance for Scale-Out Hadoop Data Lakes.

 Hadoop data lakes: security challenges and Isilon capabilities

Security challenges Isilon capabilities Description
A Hadoop data lake can contain sensitive data—intellectual property, confidential customer information, and company records. Any client connected to the data lake can access or alter this sensitive data.
  • Compliance mode and write-once, read-many (WORM) storage
  • Auditing
The SEC 17a-4 regulation requires that data is protected from malicious, accidental, or premature alteration. Isilon SmartLock™ is a OneFS feature that locks down directories through WORM storage. Use compliance mode only for scenarios where you need to comply with SEC 17a-4 regulations. In addition, auditing can help detect fraud, unauthorized access attempts, or other threats to security.
ACL policies help to ensure compliance. However, clients may be connecting to the Hadoop cluster by using different protocols, such as NFS or HTTP.
  • Authentication and cross-protocol permissions
OneFS authenticates users and groups connecting to the cluster through different protocols by using POSIX mode bits, NTFS, and ACL policies. By managing ACL policies in OneFS, you can address compliance requirements for environments that mix NFS, SMB, and HDFS.
Applying restricted access to directories and files in HDFS requires adding layers to your file system.
  • Role-based access control for system administration (RBAC)
  • Identity management
  • User mapping
  • Access zones
The PCI DSS Requirement 7.1.2 specifies that access must be restricted to privileged user IDs. RBAC, a OneFS feature, lets you manage administrative access by role, and assign privileges to a role. You can associate one user with one ID through identity management and user mapping, and then assign that ID to a role. In OneFS, access zones are a virtual security context in which OneFS connects to directory services, authenticates users, and controls access to a segment of the file system.
FISMA and HIPAA and other compliance regulations might require protection for data at rest. Encryption of data at rest Isilon self-encrypting drives are FIPS 140-2 Level 3 validated. The drives automatically apply AES-256 encryption to all data stored in the drives without requiring additional equipment. You can enable a WORM state on directories for data at rest.

To learn how to implement Hadoop on your Isilon cluster, see 7 best practices for setting up Hadoop on an EMC Isilon cluster.

Start a conversation about Isilon content

Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, contact isi.knowledge@emc.com. To provide documentation feedback or request new content, contact isicontent@emc.com.

 

[display_rating_result]