From Chaos to Clarity: 6 Best Practices for Organizing Big Data

There’s no doubt we’re living in the era of big data. And, as the amount of data we generate grows exponentially, organizing it becomes all the more challenging. If you don’t organize the data well, especially if it resides in cloud storage, it becomes complex to track, manage, and process.

That’s why I’m sharing six strategies you can use to efficiently organize big data in the cloud so things don’t spiral out of control. You can consider how to organize data from different angles, including within a bucket, at the bucket level, and so on. In this article, I’ll primarily focus on how you can efficiently organize data on Backblaze B2 Cloud Storage within a bucket. With the strategies described here, you can consider what information you need about each object you store and how to logically structure an object or file name, which should hopefully equip you to better organize your data.

Before we delve into the topic, let me give a super quick primer on some basics of object storage. Feel free to skip this section if you’re familiar.

First: A Word About Object Storage

Unlike traditional file systems, when you’re using object storage, you have a simple, flat structure with buckets and objects to store your data. It’s designed as a key-value store so that it can scale to the internet.

There are no real folders in the object store file system. The impact of this is data is not separated into a hierarchical structure. That said, there are times that you actually want to limit what you’re querying. In that instance, prefixes provide a folder-like look and feel, which means that you can get all the benefits of having a folder without any major drawbacks. From here onwards, I’ll generally refer to folders as prefixes and files as objects.

With all that out of the way, let’s dive into the ways you can efficiently organize your data within a bucket. You probably don’t have to employ all these guidelines. Rather, you can pick and choose what best fits your requirements.

1. Standardize Object Naming Conventions

Naming conventions, simply put, are rules about what you and others within your organization name your files. For example, you might decide it’s important that the file name describes the type of file, the date created, and the subject. You can combine that information in different ways and even format pieces of information differently. For example, one employee may think it makes more sense to call a file Blog Post_Object Storage_May 6, 2023, while another might think it makes sense to call that same file Object Storage.Blog Post.05062023.

These decisions do have impact. For instance that second date format would confuse the majority of the world who uses the day/month/year format, as opposed to month/day/year as is common in the United States. . And, what if you take a different kind of object as your example, one that versioning becomes important for? When do code fixes for version 1.1.3 actually become version 1.2.0?

Simply put, having a consistent and well thought out naming convention for your objects makes life easy when it comes to organizing data. You can and should derive and follow a pattern while naming the objects. Based on your requirements, a consistent and well thought out pattern for naming your objects makes it easy to find and sort files.

2. Harness The Power of Prefixes

Prefixes provide a folder-like look and feel on object stores (as there are no real folders). The prefixes are powerful and immensely helpful while effectively organizing your data and allow you to make good use of the wildcard function in your command line interface (CLI). A good way to think about a prefix is that it creates hierarchical categories in your object name. So, if you were creating a prefix about locations and using slashes as a delimiter, you’d create something like this:

North America/Canada/British Columbia/Vancouver

Let’s imagine a scenario where you generate multiple objects per day, you can structure your data per year per month and per day. An example prefix would be year=2022/month=12/day=17/ for the multiple objects generated on December 17, 2022. If you queried for all objects created on that day, you might get results that look like this:

2022/12/17/Object001
2022/12/17/Object002
2022/12/17/Object003

On the Backblaze B2 secure web application, you will notice these prefixes create “folders” three levels deep, year=2022, month=12 and day=17. The folder, day=17, will contain all the objects with the example prefix in their names. Partitioning data is helpful to easily track your data. It is also helpful in the processing workflows that use your data after storing it on Backblaze B2.

3. Programatically Separate Data

After ingesting data into B2 Cloud Storage, you may have multiple workflows to make use of data. These workflows are often tied to specific environments and in turn generate more new data. Production, staging, and test are some examples of environments.

We recommend keeping the copy of raw data and the new data generated by a specific environment separate. This lets you keep track of when and how changes were made to your datasets, which in turn means you can roll back to a native state if you need to or replicate the change if it’s producing the results you want. In occasions of undesirable events like a bug in your processing workflow, you can rerun the workflow with a fix in place on the raw copy of data. To illustrate the data specific to the production environment, an example would be /data/env=prod/type=raw, and /data/env=prod/type=new.

4. Leverage Lifecycle Rules

While your data volume is ever increasing, we recommend reviewing and cleaning up unwanted data from time to time. Doing that process manually is very cumbersome, especially when you have large amounts of data. Never fear: Lifecycle rules to the rescue. You can set up lifecycle rules to automatically hide or delete data based on a certain criteria which you can configure on Backblaze B2.

For example, some workflows create temporary objects during processing. It’s useful to briefly retain these temporary objects to diagnose issues, but they have no long-term value. A lifecycle rule could specify that objects with the /tmp prefix are to be deleted two days after they are created.

5. Enable Object Lock

Object Lock makes your data immutable for a specified period of time. Once you set that period of time, even the data owner can’t modify or delete the data. This helps to prevent an accidental overwrite of your data, creates trusted backups, and so on.

Let’s imagine a scenario where you upload data to B2 Cloud Storage and run a workflow to process the data which in turn generates new data, and use our production, staging, and test example again. Due to a bug, your workflow tries to overwrite your raw data. When you have Object Lock set, the rewrite won’t happen, and your workflow will likely error out.

6. Customize Access With Application Keys

There are two types of application keys on B2 Cloud Storage:

Your master application key. This is the first key you have access to and is available on the web application. This key has all capabilities, access to all buckets, and has no file prefix restrictions or expiration. You only have one master application key—if you generate a new one, your old one becomes invalid.
Non-master application key(s). This is every other application key. They can be limited to a bucket, or even files within that bucket using prefixes, can set read-only, read-write, or write-only access, and can expire.

That second type of key is the important one here. Using application keys, you can grant or restrict access to data programmatically. You can make as many application keys in Backblaze B2 as you need (the current limit is 100 million). In short: you can get detailed in customizing access control.

In any organization, it’s always best practice to only grant users and applications as much access as they need, also known as the principle of least privilege. That rule of thumb reduces risk in security situations (of course), but it also reduces the possibility for errors. Extend this logic to our accidental overwrite scenario above: if you only grant access to those who need to (or know how to) use your original dataset, you’re reducing the risk of data being deleted or modified inappropriately.

Conversely, you may be in a situation where you want to grant lots of people access, such as when you’re creating a cell phone app, and you want your customers to review it (read-only access). Or, you may want to create an application key that only allows someone to upload data, not modify existing data (write-only access), which is useful for things like log files.

And, importantly, this type of application key can be set to expire, which means that you will need to actively re-grant access to people. Making granting access your default (as opposed to taking away access) means that you’re forced to review and validate who has access to what at regular intervals, which in turn means you’re less likely to have legacy stakeholders with inappropriate access to your data.

Two great places to start here are restricting the access to specific data by tying application keys to buckets and prefixes and restricting the read and write permissions of your data. You should think carefully before creating an account-wide application key, as it will have access to all of your buckets, including those that you create in the future. Restrict each application key to a single bucket wherever possible.

What’s Next?

Organizing large volumes by putting some guidelines into practice can make it easy to store your data. Pick and choose the ones that best fit your requirements and needs. So far, we have talked about organizing the data within a bucket, and, in the future, I’ll provide some guidance about organizing buckets on B2 Cloud Storage.

First: A Word About Object Storage

1. Standardize Object Naming Conventions

2. Harness The Power of Prefixes

3. Programatically Separate Data

4. Leverage Lifecycle Rules

5. Enable Object Lock

6. Customize Access With Application Keys

What’s Next?

About Bala Krishna Gangisetty

Related Posts

Introducing Genblaze: A Python SDK for Generative Media Pipelines

What Network Data Can and Can’t Tell Us About AI Infrastructure

Your AI Strategy Is Only as Strong as Your Data Foundation