In the previous two posts (see Part 1 and Part 2), we compared the two most popular cloud platforms, Microsoft's Azure and Amazon's AWS for their offerings in the end-to-end ecosystem of data analytics, both large scale and real time.
In this final post, will compare Azure's Data Factory and an equivalent offering from AWS in the form of AWS Data Pipeline. Both are fairly similar in their abilities and offerings, however, while AWS pitches the Data Pipeline as a platform for data migration between different AWS compute and storage services, and also between on premise and AWS instances, Azure's pitch for Data Factory is more as an integration service for orchestrating and automating the movement and transformation of data.
In terms of quality attributes, both services are very capable in terms of scalability, reliability, flexibility, and of course, cost of operations. Data Pipeline is backed by the highly available and fault tolerant infrastructure of AWS and hence is extremely reliable. It is also very easy to create a pipeline using the drag and drop console in AWS. It offers a host of features, such as scheduling, dependency tracking, and error handling. Pipelines can not only be run serially, but also in parallel. The usage is also very transparent in terms of moderating control over the computational resources assigned to execute the business logic. Azure Data Factory, on the other hand, provides features such as visualizing the data lineage.
In terms of pricing, Azure charges by the frequency of activities and where they run. A low frequency activity in cloud is charged at $.60 and the same activity on premise is charged $1.50. Similarly the high frequency activities have higher charges. Note that you are also charged for data movement separately for cloud and on premise. In addition, pipelines that are left inactive are also charged.