Building Modern Data Lakes

The volumes of data required for Machine Learning projects are continuously growing. Data scientists and Data engineers need the ability to access huge amounts of data in a timely manner. In order to build data platforms that can handle massive quantities of data ingestion and processing, a number of technologies have emerged. This article willContinue reading “Building Modern Data Lakes”

Building resilient Distributed Systems at scale

In this brave new world of distributed systems, we are entrusted with keeping the infrastructure up and running.The source of the challenge is to monitor the services themselves and the space in between. We face non-determinism, sometimes we can’t tell if our system is up, down, or partially working, and every failure is a taskContinue reading “Building resilient Distributed Systems at scale”

Data consistency across Microservices

We were told a monolith is evil and microservices are the answer. What nobody told us is that microservices come with many pain points deriving from its distributed nature. In the past, we built an application connected to one database where normalized data was queried using “joins”. Then came: big data, big traffic and withContinue reading “Data consistency across Microservices”

Querying our Data Lake in S3 using Zeppelin and Spark SQL

Until recently, most companies used the traditional approach for storing all the company’s data in a Data Warehouse. The internet growth caused an increase in the number of data sources and the massive quantities of data to be stored, requiring scaling these Data Warehouses constantly. They were not designed to handle petabytes of data, so companies wereContinue reading “Querying our Data Lake in S3 using Zeppelin and Spark SQL”

Why Big Data is pushing us towards Machine learning

As an Engineer Manager with more than 20 years of experience I have seen many changes that completely disrupted different areas: “Web 2.0”, “Cloud computing”, “Mobile-first”, “Big Data”, etc. The new kid on the block is “Machine learning” and it is definitely at its peak, one example is perfectly described by CB Insights on how startups coinedContinue reading “Why Big Data is pushing us towards Machine learning”

Building a data lake in S3 using an event-driven serverless architecture

This article is about the journey from: A data warehouse to a data lake. Batch to near real-time processing. Availability to query all data from the same repository (raw and processed). Let’s start by understanding some terms. Wikipedia defines Data Warehouses as:“…central repositories of integrated data from one or more disparate sources. They store current and historicalContinue reading “Building a data lake in S3 using an event-driven serverless architecture”

Design a site like this with WordPress.com
Get started