Automatic database backups to Amazon S3

When you’re in the publishing industry, regular database backups are crucial: one deleted server or accidental drop of a database can mean hours, days, or more of lost content. This is compounded by platforms like WordPress, which encourage users to author content directly in the content management system; if the database ever disappears, you’ve lost all of your content!

In order to protect content, it’s crucial that publishers are creating backups on a regular schedule. If you’re using a managed database service like Amazon RDS these may be handled for you, but you can never have too many backups of your production database.

One simple approach that we’re using as part of our more comprehensive disaster recovery plan is an automated, daily backup of all of our databases to Amazon S3.

Setting up Amazon S3

If you haven’t worked with Amazon Simple Storage Service (S3) before, it’s a cloud based object storage that serves as one of the cornerstones of Amazon Web Services. We can use it to store all sorts of content, including images, videos, log files, database dumps, and more.

In order to securely store our database backups, we’ve created an S3 bucket that will hold our database backups:

Modal for creating a new Amazon S3 bucket

Don’t worry about setting the permissions just yet, we’ll be doing that in the next section. Give your bucket a unique name, select a region, then click “Create”.

Creating the backup IAM group

To prevent unauthorized access and mitigate against compromised AWS credentials from sinking an organization, Amazon offers (and strongly encourages the use of) its Identity Access Management (IAM) tool. Using IAM, separate users can be created for different tasks, with very specific permissions. Thanks to IAM, a compromised S3 user for web-facing assets won’t put your database backups at risk (or vice versa).

To keep our backups secure, we’ll create a new IAM group called “Backup-S3”. Members of this group will be able to PUT objects into our backup S3 bucket, but will be unable to do anything else (or access any other buckets we might have).

Amazon Identity Access Management group screen

Under “Inline Policies”, we’ll create a new policy with the following contents (replacing “$BUCKET” with the name of the bucket you created earlier):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::$BUCKET/*"
    }
  ]
}

Note: in the screenshot above, we have “Backup-S3” as a managed policy, since we’re re-using it across a few different groups and services within AWS. The policy code would be exactly the same, please use whichever approach makes the most sense for your team.

Once you’ve established your IAM group policy, we need to create one or more members; you’ll most likely want to create a new IAM user for each server (for example, engineering.growella.com and growella.com each use a separate IAM user in our setup) to reduce the risk that one compromised server could bury the backup S3 bucket in bad data (remembering that the only thing our backup users are able to do is PUT objects).

Within the IAM tool, create a new user with only programmatic access to AWS (the backup user shouldn’t need nor be able to log into the AWS Management Console) and assign it to the IAM group created above. Take care to copy the access and secret keys, as you’ll need those shortly!

Automating database backups and sending them to Amazon S3

Now that we have an S3 bucket to store our backups and a user who can write to it, it’s time to automate our database backup process.

First, we’ll establish a connection with Amazon S3 using s3cmd, a command-line tool for communicating with S3 buckets. On Ubuntu (and other Debian-based installations), this can be done with the following command:

$ sudo apt-get install s3cmd

Once s3cmd is installed, we need to configure it:

$ s3cmd --configure

This will walk you through authenticating as the database backup user we created earlier, using the access and secret keys. The default settings should otherwise be fine, and don’t be alarmed if you get an error warning that s3cmd can’t read the list of buckets; this is expected behavior, since our policy only gave access to our backup bucket, and did not include the ListAllMyBuckets permission.

Daily database dumps

Next, we want to ensure that our server is automatically sending fresh database backups to S3 every morning.

The Principle of Least Authority dictates that our users should have the lowest level of access necessary to accomplish their jobs, so we’ll create a new database user within MySQL specifically for database backups:

# Create the new database user, "backup".
# "{PASSWORD}" should be replaced w/ something secure!
CREATE USER 'backup'@'localhost'
IDENTIFIED BY ‘{PASSWORD}’; # Set a secure password

# Grant "backup" read-only privileges on all databases.
GRANT SELECT, SHOW VIEW, RELOAD,
REPLICATION CLIENT, EVENT, TRIGGER, LOCK TABLES
ON *.* TO 'backup'@'localhost';

Even with a read-only database user, we don’t want to be storing these credentials somewhere unsafe. To prevent plaintext passwords from being stored, we’ll take advantage of MySQL’s configuration utility, mysql_config_editor:

$ mysql_config_editor set --login-path=database-backup \
  --host=localhost --user=backup --password

This command will prompt for the “backup” user’s password, then store these credentials in a file called “database-backup”. From now on, we can connect to the database as the backup user by running:

$ mysql --login-path=database-backup

The next piece in this puzzle is creating the database backups and sending them to Amazon S3 (finally!). Fortunately for you, we’ve already done the legwork and made our automatic database backup script available on GitHub; either download the script and upload it to your server, or run the following command on your server:

$ git clone https://gist.github.com/e71a4ebdd0bada440cbe95b222161d1d.git database-backup \
  && chmod +x database-backup/database-backup.sh

The backup script works by connecting to your server’s MySQL instance using the “database-backup” file we created via mysql_config_editor, then loops through each database (omitting standard system databases), dumping and gzipping its contents to /tmp/{DATABASE_NAME}.sql.gz, then pushing that file to S3 via s3cmd.

You can verify this behavior by running the command manually (replacing “my-backup-bucket” with the name of the S3 bucket you created at the start of this tutorial):

$ database-backup/database-backup.sh my-backup-bucket/test

If everything goes according to plan, you should be able to see a new folder in your S3 bucket, named “test”, with a structure that looks something like this:

/my-backup-bucket
  /test
    /database
      - mydatabase.sql.gz

Each database from your production server should be represented under /test/database/, with the database backup named to match the database name.

Note: once we’ve run our test successfully, feel free to delete the /test/ directory from your S3 bucket; when we run this script on a schedule, we’ll replace it with the environment name.

Now that we’ve verified our script is backing up all of our databases to Amazon S3, it’s time to automate the process. To do this, we’ll turn to our dear friend, Cron. Run crontab -e as the user you’d like to be responsible for backups, then add the following line to the user’s crontab:

# Daily database backups at 7am UTC
0 7 * * * /path/to/database-backup/database-backup.sh {bucket}/{environment} >> /var/log/database-backup.log

Feel free to adjust the times as appropriate for your server, but 7am UTC is 2am for our Ohio-based team, which works out nicely. The “{bucket}” placeholder should also be replaced with your bucket name, and “{environment}” replaced with an identifier for your site (e.g. “production”, “staging”, “engineering-blog”, etc.); this lets us use a single S3 bucket to backup any number of web properties.

Rotating database backups using versioning and lifecycle

You may have noticed that the database backups always carry the same filename: {database}.sql.gz. Isn’t this just overwriting the backup every time we run our script?

You’re absolutely correct — or, you would be, were it not for S3’s versioning capabilities. Using versioning, Amazon can keep multiple versions of the same file, allowing you to “roll back” to a previous version if necessary. Turning this feature on is as simple as enabling the feature in the bucket’s “Properties” tab.

Versioning setting for Amazon S3 buckets

Once versioning is active, we also gain access to S3’s lifecycle configuration. With lifecycle, we can determine how long we want to retain database backups before they’re deleted; in the case of Growella, we hold daily production database backups for 30 days, and staging backups for a week. This gives us enough time to react should anything go wrong (or we need to reference old content that’s since been removed) without hoarding gigabytes upon gigabytes of old backups.

To configure your bucket’s lifecycle, visit the “Lifecycle” tab, then click “Add Lifecycle Rule”:

This is another benefit to how we’ve organized our backups into /{environment}/databases, as we can specify that we only want to rotate database backups (you could use this same bucket to store an archive of an old project indefinitely, for example); in the rule above, we’ve specified that we only want to operate on /production/databases/.

Configuring the lifecycle of an S3 bucket, specifying the rule's name and scope

After giving our lifecycle rule a name and a corresponding scope (path prefix), we’ll skip ahead to the “Expiration” step (no configuration is necessary on the “Transitions” step), where we’ll define how long we want to retain backups:

The "Expiration" step of the S3 bucket lifecycle rule wizard

In this step, we want to expire previous versions of our files 30 days after they’ve been replaced as the current version. Depending on your data retention needs, you might increase or decrease this value, but 30 days seems like a reasonable default.

Finally, save your new lifecycle rule. If everything works as it’s supposed to, Amazon S3 will hold 30 days worth of database backups, with a new one being generated every morning!

30 days of database backups, available to use through Amazon S3

A word on backups

As recent history can atest, having a backup policy in-place is only half the battle; it’s crucial that disaster recovery plans are regularly being audited and tested. The worst time to find out that backups weren’t working as you had expected is during an outage, so please make it a habit to regularly verify your team’s disaster recovery procedure.

Leave a Reply