Welcome to a (De)-Duplicate Reality - Sizing SIR

Tuesday, November 6th, 2007

First I want to welcome you to my first blog. Chris Poelker (our other revered blogger) and I have known each other for many years, even before he followed me to Falconstor. Chris is clearly a technology visionary and big picture guy. I tend to sometimes get mired down in the details. But together we make a great team. It has been challenge for me to step away from the keyboard and spend more time at a white board. But with a dynamic company like Falconstor, I have enjoyed being able to work the whole life cycle of enterprise projects and still get my hands dirty in the details.

A major focus of mine this year has been around FalconStor’s de-duplication technology, Single Instance Repository (SIR). We will assume that you understand what de-duplication can do and how it does it. I thought we could jump into one of the more challenging questions that you have to answer when you are looking to deploy the technology.

How do I size the de-duplication repository for my environment?

Many people, including myself, have worked on creating tools to estimate the data reduction that will be experienced for a given environment. The more complicated the calculations became; the more I realized that there were too many variables in an environment that can effect the result.

The short answer to the overall ratio question is 7:1. If you have 100 Tera-bytes of data on your backup tapes, you will need about 15 TB of disk to store it.

Before any of you start banging your fist on the table asking where are the 20:1 or 30:1 reduction ratios, you will see those daily averages after 4-8 weeks. This is highly effective in reducing the re-occurring bandwidth required to replicate. However, there still needs to be a copy of the unique data stored. The first few weeks of utilizing any de-duplication product will start out with lower ratios and rise over time.

The simple model assumes that you take one full backup of everything each week and have 30-day retention on most of the data. It does not matter if you take your full backups over the weekend or spread them out over the week. If you take 7 full backups a week, your reported ratio will be a lot higher, but in the end the amount of storage required to hold it for 30 days will not change much.

Using these worst case figures gives you a 7:1 overall repository:

Week 1: 4:1

Week 2: 7:1

Week 3: 9:1

Week 4: 10:1

The weeks that follow climb to 20:1 or higher and the size of the repository will start to plateau as data is deleted.

There have been customers who had first and second week de-dupe rates at 17:1 or higher. Even if you were to double your de-duplication ratios, you would only reduce your storage size by less than 20%. But I like to live on the conservative side and this is a blog. With FalconStor, you can add storage to the SIR, with little impact to your backup environment due to the post processing model. But that is a topic for another blog entry.

I hope this discussion has brought a little clarity to sizing a repository. Please contact Falconstor for a more detailed whitepaper or a formal discussion. The next logical blog entry will discuss sizing the VTL storage space (cache). Yes, it takes a little extra storage, but the benefits outweigh the costs.

Leave a Reply

You must be logged in to post a comment.