Francois Lagier
Francois Lagier Francois is our CFO (Chief *French* Officer) and a co-founder. He runs the tech side of CloudForecast with Kacy and is always asking Tony for customer feedback.

Using Athena to transform a CSV file to Parquet

Using Athena to transform a CSV file to Parquet

Athena is a powerful product that allows anyone with SQL skills to analyze large-scale datasets in seconds without the need to set up complex processes to extract, transform, and load the data (ETL).

I wrote about Athena in my last two blog posts and I wanted to follow up on a not so common feature of Athena: The ability to transform a CSV file to Parquet for really cheap!

Transforming a CSV file to Parquet is not a new challenge and it’s well documented(here, here or even here. All these options are great and can be used in production, but they all require the use of things like AWS EMR, Spark or AWS Glue. If you want to check out Parquet or have a one-off task, using Athena can speed up the process.

Thanks to the Create Table As feature, it’s a single query to transform an existing table to a table backed by Parquet.

To demonstrate this feature, I’ll use an Athena table querying an S3 bucket with ~666MBs of raw CSV files (see Using Parquet on Athena to Save Money on AWS on how to create the table (and learn the benefit of using Parquet)).

Here is the query to convert the raw CSV data to Parquet:

1
2
3
4
5
6
    CREATE TABLE flights.athena_created_parquet_snappy_data
    WITH (
          format = 'PARQUET',
          parquet_compression = 'SNAPPY',
          external_location = 's3://{INSERT_BUCKET}/athena-export-to-parquet'
    ) AS SELECT * FROM raw_data

Since Athena only charges for data scanned (in this case 666MBs), I will only be charged $0.0031 for this example.

The data is now available in my new table flights.athena_created_parquet_snappy_data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
CREATE EXTERNAL TABLE `athena_created_parquet_snappy_data`(
  `year` smallint, 
  `month` smallint, 
  `day_of_month` smallint, 
  `flight_date` string, 
  `op_unique_carrier` string, 
  `flight_num` string, 
  `origin` string, 
  `destination` string, 
  `crs_dep_time` string, 
  `dep_time` string, 
  `dep_delay` double, 
  `taxi_out` double, 
  `wheels_off` string, 
  `arr_delay` double, 
  `cancelled` double, 
  `cancellation_code` string, 
  `diverted` double, 
  `air_time` double, 
  `carrier_delay` double, 
  `weather_delay` double, 
  `nas_delay` double, 
  `security_delay` double, 
  `late_aircraft_delay` double)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://cf-flight-data-2018/athena-export-to-parquet'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'parquet.compression'='SNAPPY')

But also in S3:

s3_athena

This is just the tip of the iceberg, the Create Table As command also supports the ORC file format or partitioning the data.

Obviously, Athena wasn’t designed to replace Glue or EMR, but if you need to execute a one-off job or you plan to query the same data over and over on Athena, then you may want to use this trick.

If you have questions about CloudForecast to help you monitor your AWS cost, or questions about this post, feel free to reach out via email francois@cloudforecast.io or by Twitter: @francoislagier. Also, follow our journey @cloudforecast