This article is about the comparison of features of Azure Datalake gen 2 and AWS S3 and how they can be used as a data store/or data lake in the modern data architecture. If you are interested in
modern data architecture, I have another article which discusses here.
Whereas Azure provides 2 types of object storage: Blob and Azure Data Lake Service (ADLS), AWS has one single object storage called Simple Storage Service (S3). In modern data platforms, these object storage services are used as data lake storage. This article quickly compares the Azure and AWS object storages (Datalake Gen 2 and S3).
Note that prior to the introduction of ADLS Gen2, when we wanted cloud storage in Azure for a data lake implementation, we needed to decide between Azure Data Lake Storage Gen1 (formerly known as Azure Data Lake Store) and Azure Storage (specifically blob storage). The new ADLS Gen2 service is built upon Azure Storage as its foundation. For this reason, you will not see ADLS Gen2 listed in Azure as its own service – since ADLS Gen1 is its own service, this shift has been confusing for many people.
2. Common features
Both ADLS and S3 are fault-tolerant, high durability, availability, versioning, multiple region replication, and performance storage.
ADLS is designed especially for analytics data platform whereas S3 is widely used for the various purposes (i.e storage logs, database back up, volume storage snapshot backup, data lake, streaming storage)
4. Storage classes/tiers
ADLS provides 3 tiers: HOT, COLD and ARCHIVE
There are 6 AWS storage classes:
- S3 Standard: high durability, availability, and performance object storage for frequently accessed data
- S3 Intelligent-Tiering: designed to optimize costs by automatically moving data to the most cost-effective access tier, without performance impact or operational overhead
- S3 Standard-Infrequent Access: for data that is accessed less frequently, but requires rapid access when needed
- S3 One Zone-Infrequent Access: for data that is accessed less frequently, but requires rapid access when needed and 1 AZ.
- S3 Glacier: Secure, durable, and low-cost storage class for data archiving
- S3 Glacier Deep Archive: lowest-cost storage class and supports long-term retention
5. Size (Winner: ADLS)
Both give unlimited storage and auto-scale. However, ADSL file size is unlimited whereas S3 object size limits to 5GB.
6. HDFS Compatibility (Winner: ADLS)
ADLS is True HDFS Compatibility and designed for Parallel Processing whereas S3 is not.
7. Bind schema and query directly from storage
- Microsoft introduces a new language called U-SQL to query directly the ADLS. It is similar SQL syntax.
- As ADLS is HDFS Compatibility, you can bind Hadoop HCatalog with ADLS and use SQL to query the ADSL. Read aricle
- One of the interesting features of AWS is to automatically collect the S3 data meta and create Data Catalog by using the AWS Glue Crawler. Once Glue’s Data Catalog is built, either AWS Anthena or Reshift Spectrum can be used to query the S3 directly.
- If you are interested in how to use Glue Crawler to build Data Catalog, read the AWS Glue Crawler , Anthena, and Redshift Spectrum, documentation
8. Storage pricing
It appears that both vendors’ prices are very competitive, and match with each other.
Below are ADLS prices for HOT, COOL, ARCHIVE tiers for Sydney region, hierarchy file structure, LRS in US dollars (effective on 8 June 2019)
AWS S3 prices are based on region, storage classes and storage size. Below is example S3 prices for Asia Pacific (Sydney) region:
9. Folder structure and files
Whereas ADLS gives the folder structure and file system, AWS S3 stores the objects in ‘Bucket’. At the time of this article, both Azure ADLS and AWS S3 support folder structure.