Cloud Data Platform

Courses

DevOps-Roadmap

Software Stacks

Azure
BDAS, the Berkeley Data Analytics Stack
Chromadb
Databricks
Docker
AWS
dvc (Data Version Control)
Google
Iguazio
Kubernetes
Langchain
LlamaIndex
Lyft
Neptune.ai
Onehouse
OpenAI
Jupyterhub
- Amazon EMR
  - Create a cluster with JupyterHub
Facebook
MongoDB
- MongoDB’s GenAI Showcase
Ray
Uber
Weaviate

ETL (Extract, Transform, Load)

Dashboarding

Metabase (open source, connects to Databricks UC)
PowerBI and OLAP Cubes

Surveys

Emerging Architectures for Modern Data Infrastructure:
- 2020
- 2022 Update
A. Pandhi: Modern Data Stack: Looking into the Crystal Ball (2022)

Posts

Spark, Dask, and Ray: A History, N. Manchev (2021)
Jupyter Notebooks, Spark
- How to connect Jupyter Notebook to remote spark clusters and run spark jobs every day?, Teng Peng (2020). This uses Bayesnote to orchestrate notebooks on Spark clusters.
- Orchestrate Jupyter Notebooks in 5 minutes, Teng Peng (2020)
- Reddit: Simple workflow orchestration tool with Jupyter Notebook support (2022)
- Deploy Application from Jupyter Lab to a Spark Standalone Cluster, D. Lin (2020)
Towards Data Science: The Fundamentals of Data Warehouse + Data Lake = Lake House, by G.R. Peternel (2021)
byteflow: How to choose between Parquet, ORC and AVRO for S3, Redshift and Snowflake?
Martin Kleppmann: Schema evolution in Avro, Protocol Buffers and Thrift (2012)
Working with large ROS bag files on Hadoop and Spark (2017)
Jamin Ball, Clouded Judgement:The Modern Data Cloud: Warehouse vs Lakehouse (2021)
Reviews
- firebolt.io: Snowflake vs Databricks vs Firebolt
- M. Schmitt:
  - Comparing managed machine learning platforms (2020)
  - Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow (2020)
- StackOverflow: Difference in usecases for AWS Sagemaker vs Databricks?
  - Databricks is a better platform for Big data(scala, pyspark) Developing.(unbeatable notebook environment)
  - SageMaker is better for Deployment. and if you are not working on big data, SageMaker is a perfect choice working with (Jupyter notebook + Sklearn + Mature containers + Super easy deployment).
  - SageMaker provides “real time inference”, very easy to build and deploy, very impressive. you can check the official SageMaker Github.

Data formats

Jim Dowling: Guide to File Formats for Machine Learning: Columnar, Training, Inferencing, and the Feature Store (2019)
Chaim Rand: Data Formats for Training in TensorFlow: Parquet, Petastorm, Feather, and More (2021)
Thomas Gamauf: Tensorflow Records? What they are and how to use them (2018)
Lunds U ML Course: [Chapter 13 - Loading and Preprocessing Data with TensorFlow]https://canvas.education.lu.se/courses/3766/pages/chapter-13-loading-and-preprocessing-data-with-tensorflow)

People

Ali Ghodsi, Berkeley

Courses

Software Stacks

ETL (Extract, Transform, Load)

Dashboarding

Tutorials

Videos

Tools

Data ingestion

Surveys

Posts

Data formats

People

Other