Watch Out For Unexpected S3 Cost When Using AWS Athena

aws-athena-query-results--). Athena will store a raw result file (QueryId.csv) and a metadata file (QueryId.csv.metadata).
By storing the data using the QueryId, it allows you to access previous query’s result without re-running them (saving you money since you don’t need to rescan the data).
However, you are the owner of the bucket and therefore responsible for the storage on this bucket and here is a couple of reasons why it could cost you a LOT of money:
#1 All the queries are being stored! ALL OF THEM!
AWS Athena store every query results in the bucket. Query data will just accumulate forever costing more and more money on AWS.#2 Your data may be compressed but the results are not
AWS S3 bucket is storing the results in raw CSV. Your data may be compressed (GZIP, Snappy, …) but the results will be in raw CSV. As an example, I ran an accidentalSELECT * FROM flights.parquet_snappy_data on a 84M dataset using Apache Parquet which resulted on a 977MB file on S3.
How to fix this?
It’s actually pretty easy. If (and only if) you don’t plan to re-use old query results, make sure to setup Lifecycle on your bucket using a Transition or Expiration actions. For example, you could delete query results after 1 or 7 days. At CloudForecast, we actually don’t persistQueryIds since it’s not useful to us so we expire the AWS S3 files after 1 day.
_Feel free to reach out if you have any questions at [email protected] or by Twitter: @francoislagier. Also, follow our journey @cloudforecast.
Want to try CloudForecast? Sign up today and get started with a risk-free 30 day free trial. No credit card required.
Blog
More from CloudForecast
Cloud Cost Management is Easy With CloudForecast
We would love to learn more about the problems you are facing around AWS and Azure cost. Connect with us directly and we’ll schedule a time to chat!
Start Free Trial