Watch Out For Unexpected S3 Cost When Using AWS Athena
Following up on my last blog post (Using Parquet on Athena to Save Money on AWS), I wanted to share another thought about AWS Athena, specifically how the S3 bucket is being used by Athena to store query results.
On the first use of Amazon Athena, AWS will automatically create a new bucket to store the query results (bucket name
aws-athena-query-results-<ACCOUNTID>-<REGION>). Athena will store a raw result file (
QueryId.csv) and a metadata file (
By storing the data using the
QueryId, it allows you to access previous query’s result without re-running them (saving you money since you don’t need to rescan the data).
However, you are the owner of the bucket and therefore responsible for the storage on this bucket and here is a couple of reasons why it could cost you a LOT of money:
#1 All the queries are being stored! ALL OF THEM!
AWS Athena store every query results in the bucket. Query data will just accumulate forever costing more and more money on AWS.
#2 Your data may be compressed but the results are not
AWS S3 bucket is storing the results in raw CSV. Your data may be compressed (GZIP, Snappy, …) but the results will be in raw CSV. As an example, I ran an accidental
SELECT * FROM flights.parquet_snappy_data on a 84M dataset using Apache Parquet which resulted on a 977MB file on S3.
How to fix this?
It’s actually pretty easy. If (and only if) you don’t plan to re-use old query results, make sure to setup Lifecycle on your bucket using a Transition or Expiration actions. For example, you could delete query results after 1 or 7 days. At CloudForecast, we actually don’t persist
QueryIds since it’s not useful to us so we expire the AWS S3 files after 1 day.
Want to try CloudForecast? Sign up today and get started with a risk-free 30 day free trial. No credit card required.