Cleaning up garbage files in an AWS S3 bucket

I did not ask for this.

I manage a very large project called State RegData for my employer, the Mercatus Center. This will eventually involve one hundred repositories, two for each state, and a ginormous amount of data. We back up the data in an s3 bucket, and sometimes the sync process with the AWS CLI is… slow. I accidentally messed up one of these very slow syncs, and since then my bucket has been 99% of what you see above – in addition to the 13 neat directories that actually should be in there.

I’ve been putting off cleaning this up until now. I found some pretty good documentation and an even better blog post to guide me.

The main problem here is that the AWS CLI doesn’t allow for wildcards in file paths, so you can’t just say aws s3 rm s3://state-regdata/s3-state*. However, the --include and --exclude flags allow you to exclude everything (--exclude '*') and then include what you want (--include 's3-state*'), which allows you to do the same exact things.

At first I ran the command aws s3 rm s3://state-regdata --exclude '*' --include 's3-state*' and my command line sat there for a while, pretending to work, and then finished. Turns out, it did nothing, so I compared notes with the blog post linked above and noticed that he ended his bucket address with a / and used the --recursive flag.

I tried that and after streaming through 600,000 garbage file names, I am back in business.