Data Deduplication is a specialized data compression technique used mostly in high-end data storage solutions, such as SANs and Backups, to reduce the storage space by eliminating duplicate copies of repeating data. It can also be applied to network data transfer to reduce the number of bytes of data being sent. For example, let’s say I have a set of 100GB data that’s rarely changed but still needs to be included in the weekly data backup. Normally, it takes more than 400GB after a month but with Data Deduplication, it would take as less as 100GB data storage, only the changes being made during the month will take extra space in the backup. See how much spaces are saved?
As you can see, it’s a very efficient methodology that would save tons of your storage spaces. But unfortunately, you don’t see it around in lower-end computing level, such as client system like Windows 7. It would be very nice if we can adopt this cool technique into our backup solution or even built-in in the client operating system.
OpenDedup is an open source deduplication solution that was designed for enterprises with virtual environments looking for a high-performance, scalable, and low-cost deduplication solution. It is basically a file system for Linux, known as SDFS, with deduplication capability built-in. Well, this isn’t much helpful because it’s for Linux. But the good news is that it also has a Windows port that works on both Windows 7 and Windows Server 2008 R2 systems, though not officially supported. Both 32-bit and 64-bit are also supported.
The online Quick Start Guide is no longer exist so I am outlining a few steps here to help those who are interested in seeing how cool this technique is.
Table of Contents
Step 1: Download and Install OpenDedup
Go to OpenDedup download page, scroll down to the bottom of the page, and click to download the Windows binaries. You will need both OpenDedup and Dokian library to work together. Dokian is included in the download package and will be installed automatically.
Double-click the downloaded package to start installation. It’s straightforward. You can choose to use all default settings along the way. But I do suggest you to change the default location to a place where you will be storing the data, unless you are going to store the data in the same system volume in Windows. I find it’s easier later when setting up the volume
You will see a folder called sdfs created in the location you specified during the installation.
Step 2: Create a Dedup Volume
Next step up, it’s time to create a dedup volume which basically is a data container on the hard drive where the dedup files will be stored.
Open a Command Prompt Window as Administrator, navigate to the folder where you installed OpenDedup program. And run the following command:
mksdfs --volume-name=dedup --volume-capacity=500GB
It will creates a 500GB dedup volume in default location which is c:\program files (x86)\sdfs\volume folder with the configuration file saved in etc folder. If you are going to stay with the default settings, you can skip to next step to start the mounting process. If not, you will need to dig deep to c:\program files (x86)\sdfs\etc folder and modify the volume xml file to use the different data store location.
You can simply open the xml file in Notepad and replace all “c:\program files (x86)” with your real path to your volume. In my case, I replaced them with “e:\backup”.
Step 3: Mount the volume
In the same Command Prompt window, fire up another command like below:
mountsdfs -v dedup -m z
It basically tells the system to mount a Dedup volume called dedup to a drive letter Z. You will see the new drive Z showing up in your Windows Explorer right away. The drive will stay active till the command prompt window is closed or you shut down the machine.
Seeing how it works
To demonstrate how Deduplication works in OpenDedup, I first copied a folder called Data that contains about 4GB of data. Then I renamed it to Data.1. And then I copied the same folder again.
When it’s all done. My dedup drive Z has 2 folder, Data and Data.1, in total of 8G of data.
But when you check the drive’s property, here is what you see how much space used to store this 8GB of data.
Impressed? How fascinating.
A few last words
If you are following the steps and are still reading, congratulations. You are really interested in this technology. As you can tell, even though you can make some scripts to make things a little more intuitive, the process of setting it up is still tedious, or even complicated. But the result is rewarding. I am not sure how much similar tools out there that can also do Data Deduplication but I am glad that I found this open source tool and am happy so far about what I see. It seems to me a perfect candidate for data backup and archiving purpose.