Troy Cheng / 2025-09-01

Week1: Course Overview

Cloud

develop locally with scalability in mind

Disaggregated Economic Accounts - database - Big Data

What kind of big? Memory-intensive or compute-intensive?

counterfactual - hit a wall

Self-intro

Apache Airflow & Perfect

TA - to email and get help

Course targets

learn Linux twice

one on local machine; one on virtual machine (computing clusters)

sandboxed server - doesn’t have the same architecture as your laptop

Midterm proj

single assignment

Final proj

3-4 people group assignment

Big data

what is big data?

a lot is happening online - telemetry

All the tiny things you do online is recorded. Mouse tracking, eye tracking …

side-channel attack - differential power analysis

where does it come from?

user-generated content

health and scientific data - for archive purpose

How is it being created?

graph structure - relationship

Apache (web server) log files

never delete - always appending

A log of every data transaction

log -> data warehouse (sys admin) -> data lake (data scientist - higher aggregated level)

RAG and Context Engineering: The New Data Pipeline

Raw Data -> Data Lake -> Processing -> Vector DB -> LLM Context

Key Components:

Data Lakes (S3, Azure Data Lake): Store massive unstructured data
Data Warehouses (Snowflake, BigQuery): Structured data for context
Vector Databases (Pinecone, Weaviate, Qdrant): Semantic search
Embedding Models: Convert data to vectors
Orchestration (Airflow, Prefect): Manage the pipeline

Relative Data Sizes

This framework classifies data sizes based on two key questions:

Can the data be processed on a single machine?
Can the data be stored on a single machine?

Depending on the answers, data falls into four categories:

Can be processed on single machine?	Can be stored on single machine?	Category	Processing Approach	Explanation	Examples
Yes	Yes	Small	Single-machine processing	The dataset is small enough to fit on a single laptop/PC and can be handled by one machine without advanced tools.	An Excel sheet, a few million rows in CSV
Yes	No	Medium	Data streaming	The dataset is too large to store fully on one machine, but can be processed sequentially in chunks (streaming).	Real-time logs, stock market tick data
No	Yes	Medium	Parallel processing	The dataset fits on one machine’s storage, but computation is too heavy for one CPU; requires multiple cores or parallel computing.	A few terabytes of data that need machine learning or heavy analytics
No	No	Big!	Parallel + distributed processing	The dataset is so large it cannot be stored or processed on a single machine; requires distributed storage and distributed computing.	Hadoop, Spark, social network data, clickstream logs

Summary*:

Small → Stored and processed on one laptop.
Medium (Streaming) → Cannot be stored fully, but can be processed sequentially.
Medium (Parallel) → Can be stored, but needs multiple cores or nodes to process.
Big → Cannot be stored or processed on one machine; requires distributed systems.

Corpus

webkinz

yin and yang balance

terabytes

tombstone

data lake

data warehoue

Week2: Cloud Computing

Questions for Jeff

two assignments don’t have access
Lab01 Question 5-4 New files not in deliverables
no output/ could be pushed
more expensive - usually it is

process on a single machine? (computation - Medium - Parallel Processing - RAM would explode)
store on a single machine? (storage - Medium - Data Streaming)

! Multiprocessing is not multithreading

特性	多进程 (Multiprocessing)	多线程 (Multithreading)
内存空间	每个进程独立，互不影响	线程共享同一进程的内存
资源开销	创建和切换成本较高	开销较小，切换快
安全性	更安全，一个进程崩了不影响另一个	风险高，一个线程崩溃可能拖垮整个进程
并行能力	适合多核 CPU，真正并行	在某些语言（如 Python）受 GIL 限制，不一定是真并行
通信方式	进程间通信 (IPC)，如管道、消息队列	直接共享变量，更方便，但易出错（需加锁）

Pre-cloud approach: Make one computer faster and faster (Moore’s Law - computer engineering)

transistors and chips

Configuration & Set-up

3 frequently used as an AWS account manager：compute, databse, storage

Note: storage is data lake, while database is data houseware.

EC2 - Elastic Cloud Computing
S3 - Simple Storage Service

Debug

When run Linux command line to connect EC2, the dsan_6000 folder can’t be opened with WSL simultaneously.
When ctrl + shift + P to deploy the config file, there won’t automatically show up three code paragraphs the same as Jeff’s machine.

perm: You would use a urgent temp Linux Shell (CLI) directly via AWS web page. This won’t allow you to run notebook(GUI). So consider that’s a CLI-version of the EC2 server(computer) we are connecting to. That instance live on the EC2 AWS Data Center somewhere in Virginia.

While S3 is just on your computer - no remote machine to connect.

Corpus

buzzword - fancy words

shrug

“Dog’s breakfast” is a chiefly British informal idiom referring to a confused, messy, or poorly executed situation or thing

Jeff是个天才Lecturer，线上真的限制了很多信息的传达。看着Jeff手舞足蹈地讲授，我突然萌生出一种孕妇对于婴儿的爱。是谁家的小孩这么聪明？原来是我们的孩子Jeff！

Week 3: Parallelization Concepts

Debug

When run Linux command line to connect EC2, the dsan_6000 folder can’t be opened with WSL simultaneously.
When ctrl + shift + P to deploy the config file, there won’t automatically show up three code paragraphs the same as Jeff’s machine.

AWS structure

EC2(DATA CENTER) - Lamda; S3; Lightsail; SageMaker(things surrounding the EC2)

Parallelization in General

Quick Survey Question, for Intuition-Building

Are humans capable of “true” multi-tasking?
- As in, doing two things at the exact same time?
(Or, do we instead rapidly switch back and forth between tasks?)

The Answer

(From what we understand, at the moment, by way of studies in neuroscience/cognitive science/etc…)
Humans are not capable of true multitasking! In CS terms, this would be called multiprocessing (more on this later)
We are capable, however, of various modes of concurrency!

	Multithreading	Asynchronous Execution
Unconsciously (you do it already, “naturally”)	Focus on one speaker within a loud room, with tons of other conversations entering your ears	Put something in oven, set alarm, go do something else, take out of oven once alarm goes off
Consciously (you can do it with effort/practice)	Pat head (up and down) and rub stomach (circular motion) “simultaneously”	Throw a ball in the air, clap 3 times, catch ball

You need to prepare training data for LLMs by cleaning and deduplicating 100TB of web-scraped text (cleaning is parallelizable; deduplicate is not)
You are building a RAG system that requires embedding and indexing millions of documents in parallel
You need to extract structured data from millions of PDFs using vision models for document AI (fine tune the model? - embarrassing parallel/not-embarrassing parallel )
You are preprocessing multimodal datasets with billions of image-text pairs for foundation model training
You need to run quality filtering on petabytes of Common Crawl data for training dataset
You are generating synthetic training data using LLMs to augment limited real-world datasets
You need to transform and tokenize text across 100+ languages for multilingual AI
You are building real-time data pipelines that process streaming data for online learning

not-embarrassing parallel - EC2 8 computers? - 2 vCPUs - cores?

Even we only have 2 vCPUs, it can be split into 8 independent spaces(computers); revirtualized VMs

Week 4: DuckDB, Polars, File Formats

separating computation from data it’s performed on

polars helps you immediately when you hit one of those walls

Q for Jeff

I didn’t find the folders I created on the cloudshell (urgent shell)
why is the public DNS change all the time? Should I update my config everytime I try to connect to it?

Storage

S3 - strorage (object storage); no hierarchy

All files in S3 only have a encrypted id (Like Hash)

Data Format

Tabular - MySQL, SQLite, PostgreDB - Plain Text (csv tsv fwf)

Document-based: Mongo DB - JSON, XML

Polars dataframe always print shape and data types for columns

Week 5: Data Engineering

What is data engineering?

collecting data
managing storage
cataloging
making it available for application like analytics & ML
security
lifecycle management

Telemetry Log

time1 user5 hearted photo133
time2 user1 followed user2
……

tombstone for each blob of data

OLTP & OLAP

OLTP (Online Transaction Processing) systems handle fast, high-volume transactional data for day-to-day operations, while OLAP (Online Analytical Processing) systems are designed for analyzing large datasets to identify trends and support decision-making

Analytical data is more higher level/aggregated data; Transaction data is event-based.

Database administrators therefore closely guard their OLTP databases. They are usually reluctant to let business analysts run ad hoc analytic queries on an OLTP database, since those queries are often expensive, scanning large parts of the dataset, which can harm the performance of concurrently executing transactions

Telemetry -> Analytics

~ Glue Crawler x Data Catalog x Athena - OLAP (Read-only copy of the Telemetry Log)

Facts Table/database - Dim table1, Dim table2, Dim table3, Dim table4…

Starnet with a center of facts table - illustration

Athena transform the transactional data into ready-to-analyze data (it has ML ability to do clustering or PCA or sth).

Lab

for urgent shell

tmux new -s <name>
tmux ls
exit
tmux a -t <name>

Question:

why data catalog not included into the Athena thing? - this is solved in second hour

Week 6: Spark

Q for Jeff

description in A03 - Configure security group for SSH access? what does this mean?

Use Spark other than Hadoop

L1-L4 Cache + register - attached to the processor

This is because PySpark is more close to the processor (main memory - RAM)

Hadoop is streaming around SSD/HDD.

Week 7/8: Spark 2

Spark MLlib

Use Spark to do text analysis and sentiment analysis

Week 10: ETL Pipeline Orchestration with Airflow

learn to use triggers

Week 11&12: Vec DB & Inclass OH

Airflow is for batch processing - more OTAP - more analytics; OLTP - Kafka

Airflow is not suitable for ChatGPT because currently it’s only for AWS

Postgre - ingres DB : POST is more good at CRUD rather than only inserting data.

Dev tip

Error-driven development

On ./setup-airflow.sh and ./run-airflow.sh

requirements:

opentelemetry
airflow env is pretty intense - bc all requirements use ==

airflow standalone: more like async thing ###opp### JS: await function - await is a rare situation [if you run await in Linux, everything would stop bc even vs code is running on it]

Locally run Airflow

go to Vocareum -

ls -lah
cd .aws
cat credentials

then you can config the credentials on your local machine and turn off EC2 so it won’t cost you money

OH question

Why my new AWS account so costly - almost $1/day. Can’t figure out through the bill details

Final Proj

中文语境中的“小作文”帖子对应reddit是哪种subreddit类型？

中文语境下的 “小作文”（尤其是网络舆论场、微博、豆瓣、小红书等语境中的“小作文”）是一种极具社会语用特征的文本类型。
要理解它在 Reddit 上的对应类型，得先拆解“小作文”的 内容结构 + 传播语用 + 情绪功能，再匹配到 Reddit 的语类生态。

🧩 一、什么是中文语境的“小作文”？

维度	特征说明
叙事形式	第一人称长文本（通常500字以上），讲述一个具体事件或情绪体验。
语言风格	情绪浓度高，夹叙夹议，常带讽刺、控诉或反思。
语用功能	借“讲故事”表达态度、立场或情绪（道德评判、情感发泄、自我辩护等）。
社会语境	常见于社交媒体热点事件或情感舆论区（如“我只是想说两句心里话”式开头）。
互动预期	不是单纯记叙文，而是希望获得「共鸣、安慰、支持、声援或评判」。

换句话说，

“小作文”不是写作体裁，而是一种情绪化、叙事化的自我表达行为。

🧠 二、在 Reddit 上，“小作文” 对应的内容生态

在 Reddit 上，最接近“小作文”的是这三大类 subreddit 群：

Reddit 类别	代表 Subreddit	对应中文“小作文”功能
1️⃣ 忏悔 / 倾诉类（Confessional Communities）	🔸 r/offmychest 🔸 r/TrueOffMyChest 🔸 r/confession	情绪宣泄、自我揭露型小作文。类似“我实在忍不住了，我想说说今天发生的事”。
2️⃣ 道德评判类（Moral Judgment Communities）	🔸 r/AmItheAsshole (AITA) 🔸 r/AITAH 🔸 r/AmIwrong	自我辩护型小作文。类似“事情是这样的，你们觉得我错了吗？”
3️⃣ 关系求助类（Relational Advice Communities）	🔸 r/relationship_advice 🔸 r/relationships 🔸 r/TwoHotTakes	情感困境型小作文。类似“我和男朋友吵架了，不知道该怎么办”。

🔍 三、语义对齐表（中文“小作文” vs Reddit 帖子类型）

中文“小作文”语气类型	Reddit 对应子类	示例帖子风格
💔 倾诉型（“我真的太累了”）	r/offmychest / r/TrueOffMyChest	“I’ve been bottling this up for months, but I need to tell someone…”
⚖️ 辩护型（“请大家评评理”）	r/AmItheAsshole / r/AITAH	“My friend said I’m selfish for not attending his wedding. Was I wrong?”
🧩 求助型（“我该怎么办”）	r/relationship_advice / r/relationships	“My partner hides things from me — should I confront them or let it go?”
🔥 控诉型（“我受够了某某行为”）	r/TwoXChromosomes / r/OffMyChest / r/TrueOffMyChest	“I’m done being polite about how people treat me at work.”
🎭 表演型 / 反讽型（“我不想说但还是得说”）	r/unpopularopinion / r/TrueAskReddit	“People pretend to care about mental health, but they really don’t.”

💡 四、语用机制对齐（小作文与 Reddit）

特征	中文“小作文”	Reddit 等价
文本形式	第一人称长文	同样以 first-person storytelling 为主
情绪浓度	强（愤怒、伤心、控诉、无助）	强，尤其在 offmychest 与 AITA 中
互动机制	评论区提供评判、共情、声援	Reddit 评论区功能完全相似（upvote/downvote + 评论）
传播机制	社交算法放大情绪内容	Reddit 首页算法（Hot/Best）亦倾向推送高情绪高互动内容
社会角色	“讲述者” vs “吃瓜评审团”	“OP (Original Poster)” vs “Reddit Jury”

🧭 五、总结：中文“小作文”在 Reddit 的对应层级

层级	英文定义	中文对应
宏观话语类型	Personal Emotional Narrative Posts	个人情绪叙事帖
中层社群类型	Confessional / Moral / Relationship Communities	忏悔 / 道德 / 关系类社群
代表 subreddit	r/offmychest, r/AITA, r/relationship_advice	对应“小作文式”长帖集中区
语用功能定义	“Storytelling for empathy, validation, or moral judgment”	“通过叙事寻求共情、认同或评判”

✅ 最简明总结一句话：

中文语境下的“小作文”在 Reddit 上的等价社群是：
r/offmychest、r/AITA、r/confession、r/relationship_advice
它们共同构成了 “个人叙事型情绪表达类帖子（Personal Narrative & Emotional Disclosure Posts）”，
是 Reddit 的“小作文生态系统”。

如果你想，我可以帮你画出一个图表：
左边是“小作文”语气类型（倾诉型、辩护型、求助型…），右边是对应的 Reddit 社群及帖子特征，形成一个中英对照研究图。
要我画出来吗？

一、Comments 表（评论数据）

这是所有 用户评论 的表格结构：

列名	类型	含义说明	可能用途
id	string	评论的唯一 ID	可用于数据合并、追踪用户行为
subreddit	string	评论所在的 subreddit 名称	区分社区类别（如 r/confession、r/advice、r/amitheasshole）
author	string	评论作者	分析用户分布、识别高活跃用户
body	string	评论正文文本内容	用于 NLP（情感分析、主题分析等）
score	long	评论得分（upvotes − downvotes）	衡量社区对评论的认可度
created_utc	long	评论创建时间（Unix 时间戳）	构建时间序列趋势或日/周活跃度分析
parent_id	string	父评论或帖子 ID	建立层级结构、计算讨论深度
link_id	string	对应的帖子 ID（即该评论属于哪篇帖子）	评论 ↔ 发帖的关联
controversiality	long	评论的“争议性指数”	判断评论区的两极分化程度
gilded	long	收到 Reddit 奖励的次数	衡量社区“认可”或情感共鸣强度
date	date	转换后的日期（YYYY-MM-DD）	便于按日/周/月聚合分析

总结：
→ comments 表是更“情感化”的层面，适合做 NLP / 情感分析 / 社区回应模式分析。
→ 重点字段：body, score, created_utc, controversiality, subreddit.

二、Submissions 表（帖子数据）

这是每个 subreddit 发帖本身的数据结构：

列名	类型	含义说明	可能用途
id	string	帖子 ID	主键
subreddit	string	发帖的 subreddit 名称	区分社区主题
author	string	帖子作者	可与评论表关联，分析发帖与评论重叠
title	string	帖子标题	NLP 分析主题词
selftext	string	帖子正文内容（若为空则是链接帖）	用于文本分析或情绪检测
score	long	帖子得分	衡量帖子的受欢迎程度
created_utc	long	发帖时间	时间趋势分析
num_comments	long	评论数量	衡量社区参与度
url	string	如果是外链帖，给出外部 URL	可以用来区分文本帖与链接帖
over_18	boolean	是否 NSFW	内容安全过滤
date	date	转换后的日期（YYYY-MM-DD）	时间聚合用

NLP 板块Maternode Pyspark session再验证：

>>> df_comments.printSchema()                                                   
root
 |-- id: string (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- body: string (nullable = true)
 |-- score: long (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- link_id: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- gilded: long (nullable = true)
 |-- date: date (nullable = true)

>>> df_submissions.printSchema()
root
 |-- id: string (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- title: string (nullable = true)
 |-- selftext: string (nullable = true)
 |-- score: long (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- url: string (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- date: date (nullable = true)

>>>

总结：
→ submissions 表是“信息发布”层面，适合做 EDA、行为分析、发帖趋势、关联评论分析。
→ 重点字段：title, selftext, score, num_comments, created_utc.

#lecNote #network

Cloud Computing Note

Week1: Course Overview

Cloud

Self-intro

Course targets

learn Linux twice

Midterm proj

Final proj

Big data

what is big data?

where does it come from?

How is it being created?

Apache (web server) log files

RAG and Context Engineering: The New Data Pipeline

Relative Data Sizes

Corpus

Week2: Cloud Computing

Questions for Jeff

Walls for data-related work

Configuration & Set-up

Debug

Corpus

Week 3: Parallelization Concepts

Debug

AWS structure

Parallelization in General

Quick Survey Question, for Intuition-Building

The Answer

Week 4: DuckDB, Polars, File Formats

Q for Jeff

Storage

Data Format

Week 5: Data Engineering

What is data engineering?

Telemetry Log

OLTP & OLAP

Telemetry -> Analytics

Lab

Question:

Week 6: Spark

Q for Jeff

Use Spark other than Hadoop

Week 7/8: Spark 2

Week 10: ETL Pipeline Orchestration with Airflow

Week 11&12: Vec DB & Inclass OH

Dev tip

On ./setup-airflow.sh and ./run-airflow.sh

Locally run Airflow

OH question

Final Proj

🧩 一、什么是中文语境的“小作文”？

🧠 二、在 Reddit 上，“小作文” 对应的内容生态

🔍 三、语义对齐表（中文“小作文” vs Reddit 帖子类型）

💡 四、语用机制对齐（小作文与 Reddit）

🧭 五、总结：中文“小作文”在 Reddit 的对应层级

✅ 最简明总结一句话：

一、Comments 表（评论数据）

二、Submissions 表（帖子数据）