Avoiding scenarios where your cluster reaches maximum capacity is crucial for making sure it runs properly. Our Best Practices for Maintaining Enough Free Space on Isilon Clusters and Pools guide contains information to help Isilon customers keep their clusters running smoothly.
However, there are common misperceptions about cluster capacity, such as the notion that it’s easy to delete data from a cluster that is 100 percent full. Another misunderstanding: using Virtual Hot Spare (VHS) to reserve space for smartfailing a drive is not always necessary.
To clarify these issues and other concerns about cluster capacity, I interviewed one of Isilon’s top experts on this topic, Bernie Case. Bernie is a Technical Support Engineer V in Global Services at Isilon, with many years of experience working with customers who experience maximum cluster capacity scenarios. He is also a contributing author to the Best Practices for Maintaining Enough Free Space on Isilon Clusters and Pools guide. In this blog post, Bernie answers questions about cluster capacity and provides advice and solutions.
Q: What are common scenarios in the field that lead to a cluster reaching capacity?
A: The typical scenarios are when there’s an increased data ingest, which can come from either a normal or an unexpected workflow. If you’re adding a new node or replacing nodes to add capacity, and it takes longer than expected, a normal workflow will continue to write data into the cluster—possibly causing the cluster to reach capacity. Or there is a drive or node failure on an already fairly full cluster, which necessitates a FlexProtect (or FlexProtectLin) job from the Job Engine to run to re-protect data, therefore interrupting normal SnapshotDelete jobs. [See EMC Isilon Job Engine to learn more about these jobs.] Finally, I’ve seen snapshot policies that create a volume of snapshots that takes a long time to delete even after snapshot expiration. [See Best Practices for Working with Snapshots for snapshot schedule tips.]
Q: What are common misperceptions about cluster capacity?
A: Some common misconceptions include:
- 95 percent of a 1 PiB cluster still leaves about 50TiB of space. That’s plenty for our workflow. We won’t fill that up.
- Filling up one tier and relying on spillover to another tier won’t affect performance.
- The SnapshotDelete job should be able to keep up with our snapshot creation rate.
- Virtual Hot Spare (VHS) is not necessary in our workflow; we need that space for our workflow.
- It’s still very easy to delete data when the cluster is 100 percent full.
Q: What are the ramifications of a full cluster?
A: When a cluster reaches full capacity, you’re dealing primarily with data unavailable situations—where data might be able to be read, but not written. For example, a customer can experience the inability to run SyncIQ policies, because those policies write data into the root file system (/ifs). There’s also the inability to make cluster configuration changes because those configurations are stored within /ifs.
Finally, a remove (rm) command for deleting files may not function when a cluster is completely full, requiring support intervention.
Q: What should a customer do immediately if their cluster is approaching 90-95 percent capacity?
A: Do whatever you can to slow down the ingesting or retention of data, including moving data to other storage tiers or other clusters, or adjusting snapshot policies. To gain a little bit of temporary space, make sure that VHS is not disabled.
Call your EMC account team to prepare for more storage capacity. You should do this at around 80-85 percent capacity. It does take time to get those nodes on-site, and you don’t want any downtime.
Q: What are the most effective short-term solutions for managing or monitoring cluster capacity?
A: Quotas are an effective way to see real-time storage usage within a directory, particularly if you put directories in specific storage tiers or node pools. Leverage quotas wherever you can.
The TreeDelete job [in the Job Engine] can quickly delete data, but make sure that the data you’re deleting isn’t just going into a snapshot!
Q: What are the most effective long-term solutions to implement from the best practices guide?
A: Make sure you have an event notifications properly configured, so that when jobs fail, or drives fail, you’ll know it and can take immediate action. In addition to notifications and alerts, you can use Simple Network Management Protocol (SNMP) to monitor cluster space, for an additional layer of protection.
InsightIQ and the FSAnalyze job [which the system runs to create data for InsightIQ’s file system analytics tools] can give great views into storage usage and change rate, over time, particularly in terms of daily, monthly, or weekly data ingest.
Q: Is there anything you would like to add?
A: Cluster-full situations where the rm command doesn’t work are sometimes alarming. In a file system such as OneFS, a file deletion often requires a read-modify-write cycle for metadata structures, in addition to the usual unlinking and garbage collection that occurs within the file system. Getting out of that situation can be challenging and sometimes time-consuming. Resolving it requires a support call—and a remote session, which can be a big problem for private clusters.
Sometimes accidents happen or a node can fail, which can push a cluster to the limit of capacity thresholds. Incidents such as these can occasionally lead to data unavailability situations that can halt a customer’s workflow. Being ready to add capacity at 80-85 percent can prevent just this sort of situation.
Start a conversation about Isilon content
Have a question or feedback about Isilon content? Visit the online EMC Isilon Community to start a discussion. If you have questions or feedback about this blog, or comments about the video specifically, contact us at firstname.lastname@example.org. To provide documentation feedback or request new content, contact email@example.com.