Cloud Data Platform
Courses
Software Stacks
- Azure
- BDAS, the Berkeley Data Analytics Stack
- Chromadb
- Databricks
- Docker
- AWS
- dvc (Data Version Control)
- Iguazio
- Kubernetes
- Langchain
- LlamaIndex
- Lyft
- Neptune.ai
- Onehouse
- OpenAI
- Jupyterhub
- Amazon EMR
- MongoDB
- Ray
- Uber
- Weaviate
ETL (Extract, Transform, Load)
- Airbyte
- Fivetran
- Stitch
Dashboarding
- Metabase (open source, connects to Databricks UC)
- PowerBI and OLAP Cubes
Tutorials
Videos
- Ali Ghodsi
- Realizing the Vision of the Data Lakehouse, Keynote Spark + AI Summit 2020
- How to Build a Cloud Data Platform (2020)
Tools
- Deltalake
- rclone
- Parquet
- Docs
- Boudewijn Braams: The Parquet Format and Performance Optimization Opportunities (2019)
- rosbag2parquet
- Hadoop
- W. Crowell: Spark vs. Hadoop vs. Hive: Key Differences and Use Cases (2021)
- HDF5
- B. Holländer: HDF5 Datasets For PyTorch (2019)
- MLFlow
- Spark
- Learning Spark: Lightning-Fast Data Analytics, 2nd editon, by J. Damji et al (2020)
- Spark ML
- Jupyter lab on Spark
- A. Perez: Apache Spark Cluster on Docker (ft. a JupyterLab Interface) (2020)
- Working with large ROS bag files on Hadoop and Spark (2019)
- Kubernetes
- NetworkChuck: You need to learn Kubernetes RIGHT NOW!!
- Kubeflow
- Metaflow
- Netflix Tech Blog: Open-Sourcing Metaflow, a Human-Centric Framework for Data Science (2019)
- A. Goblet: A Review of Netflix’s Metaflow (2019)
- Ricardo Raspini Motta: Kedro vs ZenML vs Metaflow: Which Pipeline Orchestration Tool Should You Choose? (2024)
- Deploying Infrastructure for Metaflow
- Ville Tuulos: Metaflow: The ML Infrastructure at Netflix
- Petastorm
- Docs
- Petastorm: A Light-Weight Approach to Building ML Pipelines
- Introducing Petastorm: Uber ATG’s Data Access Library for Deep Learning (2018)
- Databricks docs: Load data using Petastorm
- Yevgeni Litvin: Petastorm: A Light-Weight Approach to Building ML Pipelines (2019), slides, 2018 video
- L. Zhang, Databricks: Simplify Data Conversion from Spark to Deep Learning, slides (2021). This is an example of Petastorm and Horovod on Tensorflow and PyTorch.
- Snowflake
- Using Snowpark As Part Of Your Machine Learning Workflow (2022)
- Large-Scale Machine Learning with Snowflake and RAPIDS, by M. Adkins et al (2022)
- Databricks vs Snowflake: 9 Critical Differences, A. Phaujdar (2021)
- Databricks vs Snowflake: The Definitive Guide (2021)
- Terraform
Data ingestion
Surveys
- Emerging Architectures for Modern Data Infrastructure:
- A. Pandhi: Modern Data Stack: Looking into the Crystal Ball (2022)
Posts
- Spark, Dask, and Ray: A History, N. Manchev (2021)
- Jupyter Notebooks, Spark
- How to connect Jupyter Notebook to remote spark clusters and run spark jobs every day?, Teng Peng (2020). This uses Bayesnote to orchestrate notebooks on Spark clusters.
- Orchestrate Jupyter Notebooks in 5 minutes, Teng Peng (2020)
- Reddit: Simple workflow orchestration tool with Jupyter Notebook support (2022)
- Deploy Application from Jupyter Lab to a Spark Standalone Cluster, D. Lin (2020)
- Towards Data Science: The Fundamentals of Data Warehouse + Data Lake = Lake House, by G.R. Peternel (2021)
- byteflow: How to choose between Parquet, ORC and AVRO for S3, Redshift and Snowflake?
- Martin Kleppmann: Schema evolution in Avro, Protocol Buffers and Thrift (2012)
- Working with large ROS bag files on Hadoop and Spark (2017)
- Jamin Ball, Clouded Judgement:The Modern Data Cloud: Warehouse vs Lakehouse (2021)
- Reviews
- firebolt.io: Snowflake vs Databricks vs Firebolt
- M. Schmitt:
- StackOverflow: Difference in usecases for AWS Sagemaker vs Databricks?
- Databricks is a better platform for Big data(scala, pyspark) Developing.(unbeatable notebook environment)
- SageMaker is better for Deployment. and if you are not working on big data, SageMaker is a perfect choice working with (Jupyter notebook + Sklearn + Mature containers + Super easy deployment).
- SageMaker provides “real time inference”, very easy to build and deploy, very impressive. you can check the official SageMaker Github.
Data formats
- Jim Dowling: Guide to File Formats for Machine Learning: Columnar, Training, Inferencing, and the Feature Store (2019)
- Chaim Rand: Data Formats for Training in TensorFlow: Parquet, Petastorm, Feather, and More (2021)
- Thomas Gamauf: Tensorflow Records? What they are and how to use them (2018)
- Lunds U ML Course: [Chapter 13 - Loading and Preprocessing Data with TensorFlow]https://canvas.education.lu.se/courses/3766/pages/chapter-13-loading-and-preprocessing-data-with-tensorflow)
People
- Ali Ghodsi, Berkeley