Getting Pydio to access a Ceph S3 backend

I’ve been experimenting around with Ceph lately and wanted to hook up a web based front end. A quick google search yielded ownCloud and Pydio. Since ownCloud advertises the S3 backend availability only for the Pro version I decided to give Pydio a go.

Unfortunately this was a bit fraught with difficulties so I just wanted to document the various errors here in case someone else is running into this.
Note that these are just some quick steps to get up and running, you should review file permissions and access rights when installing this on a production server!

The following steps were all executed on a Ubuntu 16.04.01 distribution which ships with PHP 7.0 and Apache 2.4.18

Installing Pydio

Since the community edition only has trusty packages as far as I could find I downloaded the tar archive (version 6.4.2 was at this point) and installed a new site into Apache:

wget https://download.pydio.com/pub/core/archives/pydio-core-6.4.2.tar.gz
tar -xzf pydio-core-6.4.2.tar.gz
sudo mkdir /var/www/pydio
sudo mv pydio-core-6.4.2/* /var/www/pydio
sudo chown -R root:root /var/www/pydio/
sudo chown -R www-data:www-data /var/www/pydio/data

Creating an apache config

sudo vim /etc/apache2/sites-available/pydio.conf

Put this as content:

Alias /pydio "/var/www/pydio/"
<Directory "/var/www/pydio">
  Options +FollowSymLinks
  AllowOverride All

  SetEnv HOME /var/www/pydio
  SetEnv HTTP_HOME /var/www/pydio

Make the site available and restart Apache

cd /etc/apache2/sites-enabled
sudo ln -s ../sites-available/pydio.conf pydio.conf
sudo service apache2 restart

At this stage you should be able to access you Pydio install on a browser via http://serverip/pydio
Pydio has an install wizard which will guide you through setting up an admin user and the database backend (for testing you could just go with SQLite otherwise you will have to setup a Postgres or MySQL database and an associated pydio user)

Hooking up to the Ceph S3 backend

Pydio organizes files into workspaces and there is a plugin for an S3 backed workspace which ships out of the box.
So the next step is to log into Pydio as admin user and make sure the access.S3 plugin is activated. You will probably see an error complaining about the AWS SDK not being installed, so that needs to happen first:

cd /var/www/pydio/plugins/access.s3
sudo -u www-data wget http://docs.aws.amazon.com/aws-sdk-php/v2/download/aws.phar

Since radosgw (the S3 interface into Ceph) only supports v2 signature (at this time 10.2.2 Jewel was current) you cannot use the v3 SDK.
Now the plugin should be showing status OK. Double click it and make sure it uses SDK Version 2
Next step is to create a new workspace and selecting the S3 backend as the storage driver.

  • For Key and Secret Key use the ones created for your user (how to create radosgw users for S3 can be looked up on the internet)
  • Region use US Standard (not sure if it really matters)
  • Container is the bucket you want all the files for this workspace to be stored in. Pydio won’t create the bucket for you so you’ll have to create it with another S3 capable client
  • Signature version set to Version 2 and API Version to 2006-03-01
  • Custom Storage is where you can point to your local radosgw instance, Storage URL is the setting you need for that. You should put in the full URL including protocol, e.g. http://radosgw-server-ip:7480/ (assuming you’re running radosgw on the default port which is 7480 with Jewel release)
  • I’ve disabled the Virtual Host Syntax as well since I’m not sure yet how to make this work.
  • Everything else I’ve left on default settings.

Now the fun begins. Here is the first error message I encountered when trying to access the new workspace:

Argument 1 passed to Aws\S3\S3Md5Listener::__construct() must implement interface Aws\Common\Signature\SignatureInterface, string given

Some quick google seemed to suggest a client written for SDK v3 was trying to use SDK v2, so I started trialing all the combinations combinations of plugin settings and SDKs but I only mostly got HTTP 500 errors which left no trace in any of the logfiles I could find.
Another error I encountered during my experiments was:

Missing required client configuration options:   version: (string)
A "version" configuration value is required. Specifying a version constraint
ensures that your code will not be affected by a breaking change made to the
service. For example, when using Amazon S3, you can lock your API version to
"2006-03-01".
Your build of the SDK has the following version(s) of "s3": * "2006-03-01"
You may provide "latest" to the "version" configuration value to utilize the
most recent available API version that your client's API provider can find.
Note: Using 'latest' in a production application is not recommended.
A list of available API versions can be found on each client's API documentation
page: http://docs.aws.amazon.com/aws-sdk-php/v3/api/index.html.
If you are unable to load a specific API version, then you may need to update
your copy of the SDK

I downgraded to PHP 5.6 to rule out any weird 7.0 incompatibilities which got me a little bit further so I thought that was a problem but ultimately it boiled down to the way how the backend configures the S3 client. In /var/www/pydio/plugins/access.S3/class.s3AccessWrapper.php changing

if (!empty($signatureVersion)) {
    $options['signature'] = $signatureVersion;
}

to

if (!empty($signatureVersion)) {
    $options['signature_version'] = $signatureVersion;
}

kicked everything into life. Not sure if that’s due to a recent change in the v2 SDK (current at this point was 2.8.31) or something else. Looking through the Pydio forums it seems like they tested access to a Ceph S3 backend successfully – so who knows.

Next is trying to make it connect to a self-signed SSL gateway.

Thoughts on RAID and NAS – Part 1

I’m currently looking into building my own NAS: basically a standard PC with a whole bunch of disks running Ubuntu or some other Linux distribution. The first things which comes to mind: “Of course I’m going to run RAID 5 on there. A lot of main boards these days support it out of the box and I get redundancy.”. Well, so I went on to start looking for hardware.

I like to keep things separate so my idea was to have a system drive and decided to try an SSD for it. A 60GB SSD from OCZ is about NZ$100 which is big enough as system drive. Also I know the mantra that “RAID is no backup” so I though I’d better put another separate disk in where I could mirror some of the more critical data on. Not an ideal backup solution (the backup medium resides in the same environment, connected to the same controller on the same mainboard and the same PSU) but oh well – can’t have everything can we.

Ok, now we have 2 disks in the system already, let’s see how many data disks we can fit in. This is apparently constrained by the case (mounting slots), the mainboard (number of SATA connectors) and the PSU (number of power connectors). With the Coolermaster Elite 371 I found a nice case for about NZ$140 which offers six 3.5″ bays and three 5.25″. Assuming that I’ll fit in a DVD drive or something similar this leaves us with up to 8 slots where HDDs can be mounted.

Then let’s go on to the mainboard. I had a look at various Intel and AMD CPU/mainboard combinations and the Asus M5A97 Evo plus an Athlon II X2 270 seemed a nice combination. The Asus offers 6x 6Gb/s SATA ports plus integrated RAID 5 and the Athlon should be up to the task required by the NAS box. For cheaper Intel CPUs which are still slightly ahead of the AMD the mainboards tend to offer less features so the AMD package in total seemed the best. That’s about NZ$270 for board + CPU.

Sweet, so this leaves us with 4 spare ports on the board for data disks. Now, 4 disks at 2TB each gives you approx. 6TB available capacity in a RAID 5 which is what I was aiming for. All sorted then. As data disks I opted for the Western Digital Green Power 2TB model which are about NZ$120 each.

Together with 4GB RAM, some case fans, CPU cooler, some decent wireless gear, a cold spare HDD and 5.25″ -> 3.5″ mounting brackets the total price of the system clocked in at just under NZ$1900 – not bad. While an of-the-shelf 4 bay NAS would have been about NZ$400-500 cheaper this solution give me quite a bit more flexibility.

All sorted then – right? Hmm, not quite. A colleague at work mentioned the bad words “Unrecoverable Read Error” (short URE) to me and I thought “Well, better check what’s that all about”. Now, as it turns out this means that every approximate 12TB of data you read of a disk an “Unrecoverable Read Error” will be reported – in other words “a bad sector”. This will cause the disk to get dropped from the RAID which then needs to be rebuilt after the bad sector has been mapped out. Does not sound so bad – right? Well, what happens when you actually have a full disk failure (lets say a head crash) and you replace the drive and then the array gets rebuilt. Now imagine your get a URE during the rebuild – not so nice. It will very likely end up in some data corruption. So I decided to ask the big gooracle and came across this article on ZDNet which gave me some things to think about (and led me to write this post).

The author makes one implicit assumption that based on a 7 disk RAID 5 array with 2TB per disk in case of a disk failure you will have to read approx. 12TB of data from the other disks and thus encounter a URE with a probability close to 1 (based on an average 12TB URE rate). I think this is invalid because the URE is per disk. And you still only need to read 2TB of each disk. Hmm, lets see if we can come with up some calculations here.

Let’s define a set of events called URE[x] which means “a URE is encountered after x TB have been read from a single disk”. Then we define the following probabilities:
P(URE[x]) = x/12 for 0 <= x <= 12
P(URE[x]) = 0 for x <= 0 (nothing read yet, extremely unlikely that we get a URE)
P(URE[x]) = 1 for x >= 12 (probability of encountering a URE after 12TB or more have been read)

This assumes that the probability for getting a URE is linear in the amount of data read which is probably not the case but make some calculations easier.
Let further be:
n – total number of disks on the array
c – capacity per disk in TB
d – total amount of data read from the array at the point of rebuilt
FAIL – the event that we get a URE while we are trying to rebuild an array which had a total disk failure

P(FAIL) is apparently the probability that at least one of the remaining (n – 1) disks has a URE while rebuilding. This is equal to one minus the probability that no drive will have a failure. The event that a single drive will have a URE at that point is URE[d/n + c] (assuming the read data is equally distributed across all disks). Therefore P(URE[d/n + c]) = ((d/n) + c) / 12 and the probability that it won’t fail is P(!URE[d/n + c]) = 1 – ((d/n) + c) / 12. Assuming that those events are independent the probability that out of (n – 1) drives none will have a URE is then: P(!URE[d/n+c])^(n-1) which means P(FAIL) = 1 – P(!URE[d/n+c])^(n-1) = 1 – (1 – ((d / n) + c) / 12)^(n-1)

Looks a bit dry, so let’s run it with some numbers. The ZDNet article stated that approximately 3% of all drives fail in the first 3 years. Let’s make some assumptions:

I plan to have 4 2TB disks in the array, prime it with about 3TB of data and then cause maybe 5GB/day of read/write traffic for the array. For simplicities sake we assume that writes affect the URE same way as reads. So that leaves us with:
n = 4
c = 2 (TB)
d = 3 (TB) + 3 * 365 * 5 / 1000 = 8.475 (TB)
Therefore P(FAIL) = 1 – (1 – (d/n + c) / 12)^(n-1) = 1 – (1 – ((2.12 + 2) / 12))^3 = 72.9%

So, if I have a drive failure after 3 years with the above mentioned setup and usage the probability of encountering a URE during the rebuild is approximately 75%. I have made a little spreadsheet to calculate the probabilities based on the main parameters: RAID 5 Probability Calculations. Playing around with the numbers shows: Increasing the number of disks (like using 7 1.5TB disks) doesn’t help. Although P(URE[x]) decreases per disk (as the load is spread) overall P(FAIL) increases due to the larger number of disks.

Only when you start going to enterprise drives with a URE of about 120TB you start dropping down to 10% probability of a failure during a rebuild. However a 600GB enterprise SAS drive currently costs about NZ$350 and you would need 14 of those to make your 8 TB array.

Lets define an event CRASH which means “A drive has a major crash and is gone for good”. Assuming that CRASH is independent for all disks in an array (which it is not but again let’s make that assumption for simplicity’s sake) then the probability that at least one drive in the array fails is 1 minus the probability that no drive fails which is 1 – P(!CRASH)^n (with n being the number of disks in the array). Assuming P(CRASH) = 0.03 then P(!CRASH) = 0.97 and for a 4 disk array 1 – 0.97^4 = 11.5%. Again assuming that CRASH and FAIL are independent the probability of having a CRASH and a FAIL is P(CRASH) * P(FAIL) = 8%. So with the above setup there is an 8% chance to have some kind of data loss during the first 3 years.

Does that mean RAID 5 is useless? Well – not quite. Just because you have a URE during a rebuild doesn’t mean that all your data is gone. However it is very likely that some of you data is now corrupted but that might be only 1 file instead of everything. It depends on your controller and OS how much pain it will be to recover from that and get your array rebuild. I think it’s potentially more trouble than it’s worth it so I’ll be looking into other alternatives to see what the odds are there.