Week1: Course Overview
Cloud
develop locally with scalability in mind
Disaggregated Economic Accounts - database - Big Data
What kind of big? Memory-intensive or compute-intensive?
counterfactual - hit a wall
Self-intro
Apache Airflow & Perfect
TA - to email and get help
Course targets
learn Linux twice
one on local machine; one on virtual machine (computing clusters)
sandboxed server - doesn’t have the same architecture as your laptop
Midterm proj
single assignment
Final proj
3-4 people group assignment
Big data
what is big data?
a lot is happening online - telemetry
All the tiny things you do online is recorded. Mouse tracking, eye tracking …
side-channel attack - differential power analysis
where does it come from?
user-generated content
health and scientific data - for archive purpose
How is it being created?
graph structure - relationship
Apache (web server) log files
never delete - always appending
A log of every data transaction
log -> data warehouse (sys admin) -> data lake (data scientist - higher aggregated level)
RAG and Context Engineering: The New Data Pipeline
Raw Data -> Data Lake -> Processing -> Vector DB -> LLM Context
Key Components:
- Data Lakes (S3, Azure Data Lake): Store massive unstructured data
- Data Warehouses (Snowflake, BigQuery): Structured data for context
- Vector Databases (Pinecone, Weaviate, Qdrant): Semantic search
- Embedding Models: Convert data to vectors
- Orchestration (Airflow, Prefect): Manage the pipeline
Relative Data Sizes
This framework classifies data sizes based on two key questions:
- Can the data be processed on a single machine?
- Can the data be stored on a single machine?
Depending on the answers, data falls into four categories:
| Can be processed on single machine? | Can be stored on single machine? | Category | Processing Approach | Explanation | Examples |
|---|---|---|---|---|---|
| Yes | Yes | Small | Single-machine processing | The dataset is small enough to fit on a single laptop/PC and can be handled by one machine without advanced tools. | An Excel sheet, a few million rows in CSV |
| Yes | No | Medium | Data streaming | The dataset is too large to store fully on one machine, but can be processed sequentially in chunks (streaming). | Real-time logs, stock market tick data |
| No | Yes | Medium | Parallel processing | The dataset fits on one machine’s storage, but computation is too heavy for one CPU; requires multiple cores or parallel computing. | A few terabytes of data that need machine learning or heavy analytics |
| No | No | Big! | Parallel + distributed processing | The dataset is so large it cannot be stored or processed on a single machine; requires distributed storage and distributed computing. | Hadoop, Spark, social network data, clickstream logs |
Summary*:
- Small → Stored and processed on one laptop.
- Medium (Streaming) → Cannot be stored fully, but can be processed sequentially.
- Medium (Parallel) → Can be stored, but needs multiple cores or nodes to process.
- Big → Cannot be stored or processed on one machine; requires distributed systems.
Corpus
webkinz
yin and yang balance
terabytes
tombstone
data lake
data warehoue
Week2: Cloud Computing
Questions for Jeff
- two assignments don’t have access
- Lab01 Question 5-4 New files not in deliverables
- no output/ could be pushed
- more expensive - usually it is
Walls for data-related work
- process on a single machine? (computation - Medium - Parallel Processing - RAM would explode)
- store on a single machine? (storage - Medium - Data Streaming)
! Multiprocessing is not multithreading
| 特性 | 多进程 (Multiprocessing) | 多线程 (Multithreading) |
|---|---|---|
| 内存空间 | 每个进程独立,互不影响 | 线程共享同一进程的内存 |
| 资源开销 | 创建和切换成本较高 | 开销较小,切换快 |
| 安全性 | 更安全,一个进程崩了不影响另一个 | 风险高,一个线程崩溃可能拖垮整个进程 |
| 并行能力 | 适合多核 CPU,真正并行 | 在某些语言(如 Python)受 GIL 限制,不一定是真并行 |
| 通信方式 | 进程间通信 (IPC),如管道、消息队列 | 直接共享变量,更方便,但易出错(需加锁) |
Pre-cloud approach: Make one computer faster and faster (Moore’s Law - computer engineering)
transistors and chips
Configuration & Set-up
3 frequently used as an AWS account manager:compute, databse, storage
Note: storage is data lake, while database is data houseware.
- EC2 - Elastic Cloud Computing
- S3 - Simple Storage Service
Debug
- When run Linux command line to connect EC2, the dsan_6000 folder can’t be opened with WSL simultaneously.
- When ctrl + shift + P to deploy the config file, there won’t automatically show up three code paragraphs the same as Jeff’s machine.
perm: You would use a urgent temp Linux Shell (CLI) directly via AWS web page. This won’t allow you to run notebook(GUI). So consider that’s a CLI-version of the EC2 server(computer) we are connecting to. That instance live on the EC2 AWS Data Center somewhere in Virginia.
While S3 is just on your computer - no remote machine to connect.
Corpus
buzzword - fancy words
shrug
“Dog’s breakfast” is a chiefly British informal idiom referring to a confused, messy, or poorly executed situation or thing
Jeff是个天才Lecturer,线上真的限制了很多信息的传达。看着Jeff手舞足蹈地讲授,我突然萌生出一种孕妇对于婴儿的爱。是谁家的小孩这么聪明?原来是我们的孩子Jeff!
Week 3: Parallelization Concepts
Debug
- When run Linux command line to connect EC2, the dsan_6000 folder can’t be opened with WSL simultaneously.
- When ctrl + shift + P to deploy the config file, there won’t automatically show up three code paragraphs the same as Jeff’s machine.
AWS structure
EC2(DATA CENTER) - Lamda; S3; Lightsail; SageMaker(things surrounding the EC2)
Parallelization in General
Quick Survey Question, for Intuition-Building
- Are humans capable of “true” multi-tasking?
- As in, doing two things at the exact same time?
- (Or, do we instead rapidly switch back and forth between tasks?)
The Answer
- (From what we understand, at the moment, by way of studies in neuroscience/cognitive science/etc…)
- Humans are not capable of true multitasking! In CS terms, this would be called multiprocessing (more on this later)
- We are capable, however, of various modes of concurrency!
| Multithreading | Asynchronous Execution | |
|---|---|---|
| Unconsciously (you do it already, “naturally”) | Focus on one speaker within a loud room, with tons of other conversations entering your ears | Put something in oven, set alarm, go do something else, take out of oven once alarm goes off |
| Consciously (you can do it with effort/practice) | Pat head (up and down) and rub stomach (circular motion) “simultaneously” | Throw a ball in the air, clap 3 times, catch ball |
- You need to prepare training data for LLMs by cleaning and deduplicating 100TB of web-scraped text (cleaning is parallelizable; deduplicate is not)
- You are building a RAG system that requires embedding and indexing millions of documents in parallel
- You need to extract structured data from millions of PDFs using vision models for document AI (fine tune the model? - embarrassing parallel/not-embarrassing parallel )
- You are preprocessing multimodal datasets with billions of image-text pairs for foundation model training
- You need to run quality filtering on petabytes of Common Crawl data for training dataset
- You are generating synthetic training data using LLMs to augment limited real-world datasets
- You need to transform and tokenize text across 100+ languages for multilingual AI
- You are building real-time data pipelines that process streaming data for online learning
not-embarrassing parallel - EC2 8 computers? - 2 vCPUs - cores?
Even we only have 2 vCPUs, it can be split into 8 independent spaces(computers); revirtualized VMs
Week 4: DuckDB, Polars, File Formats
separating computation from data it’s performed on
polars helps you immediately when you hit one of those walls
Q for Jeff
- I didn’t find the folders I created on the cloudshell (urgent shell)
- why is the public DNS change all the time? Should I update my config everytime I try to connect to it?
Storage
S3 - strorage (object storage); no hierarchy
All files in S3 only have a encrypted id (Like Hash)
Data Format
Tabular - MySQL, SQLite, PostgreDB - Plain Text (csv tsv fwf)
Document-based: Mongo DB - JSON, XML
Polars dataframe always print shape and data types for columns
Week 5: Data Engineering
What is data engineering?
- collecting data
- managing storage
- cataloging
- making it available for application like analytics & ML
- security
- lifecycle management
Telemetry Log
- time1 user5 hearted photo133
- time2 user1 followed user2
- ……
tombstone for each blob of data
OLTP & OLAP
OLTP (Online Transaction Processing) systems handle fast, high-volume transactional data for day-to-day operations, while OLAP (Online Analytical Processing) systems are designed for analyzing large datasets to identify trends and support decision-making
Analytical data is more higher level/aggregated data; Transaction data is event-based.
Database administrators therefore closely guard their OLTP databases. They are usually reluctant to let business analysts run ad hoc analytic queries on an OLTP database, since those queries are often expensive, scanning large parts of the dataset, which can harm the performance of concurrently executing transactions
Telemetry -> Analytics
~ Glue Crawler x Data Catalog x Athena - OLAP (Read-only copy of the Telemetry Log)
Facts Table/database - Dim table1, Dim table2, Dim table3, Dim table4…
Starnet with a center of facts table - illustration
Athena transform the transactional data into ready-to-analyze data (it has ML ability to do clustering or PCA or sth).
Lab
for urgent shell
tmux new -s <name>
tmux ls
exit
tmux a -t <name>
Question:
- why data catalog not included into the Athena thing? - this is solved in second hour
Week 6: Spark
Q for Jeff
description in A03 - Configure security group for SSH access? what does this mean?
Use Spark other than Hadoop
L1-L4 Cache + register - attached to the processor
This is because PySpark is more close to the processor (main memory - RAM)
Hadoop is streaming around SSD/HDD.
Week 7/8: Spark 2
Spark MLlib
Use Spark to do text analysis and sentiment analysis
Week 10: ETL Pipeline Orchestration with Airflow
learn to use triggers
Week 11&12: Vec DB & Inclass OH
Airflow is for batch processing - more OTAP - more analytics; OLTP - Kafka
Airflow is not suitable for ChatGPT because currently it’s only for AWS
Postgre - ingres DB : POST is more good at CRUD rather than only inserting data.
Dev tip
Error-driven development
On ./setup-airflow.sh and ./run-airflow.sh
requirements:
- opentelemetry
- airflow env is pretty intense - bc all requirements use
==
airflow standalone: more like async thing ###opp### JS: await function - await is a rare situation [if you run await in Linux, everything would stop bc even vs code is running on it]
Locally run Airflow
go to Vocareum -
ls -lah
cd .aws
cat credentials
then you can config the credentials on your local machine and turn off EC2 so it won’t cost you money
OH question
Why my new AWS account so costly - almost $1/day. Can’t figure out through the bill details
Final Proj
- 中文语境中的“小作文”帖子对应reddit是哪种subreddit类型?
中文语境下的 “小作文”(尤其是网络舆论场、微博、豆瓣、小红书等语境中的“小作文”)是一种极具社会语用特征的文本类型。
要理解它在 Reddit 上的对应类型,得先拆解“小作文”的 内容结构 + 传播语用 + 情绪功能,再匹配到 Reddit 的语类生态。
🧩 一、什么是中文语境的“小作文”?
| 维度 | 特征说明 |
|---|---|
| 叙事形式 | 第一人称长文本(通常500字以上),讲述一个具体事件或情绪体验。 |
| 语言风格 | 情绪浓度高,夹叙夹议,常带讽刺、控诉或反思。 |
| 语用功能 | 借“讲故事”表达态度、立场或情绪(道德评判、情感发泄、自我辩护等)。 |
| 社会语境 | 常见于社交媒体热点事件或情感舆论区(如“我只是想说两句心里话”式开头)。 |
| 互动预期 | 不是单纯记叙文,而是希望获得「共鸣、安慰、支持、声援或评判」。 |
换句话说,
“小作文”不是写作体裁,而是一种情绪化、叙事化的自我表达行为。
🧠 二、在 Reddit 上,“小作文” 对应的内容生态
在 Reddit 上,最接近“小作文”的是这三大类 subreddit 群:
| Reddit 类别 | 代表 Subreddit | 对应中文“小作文”功能 |
|---|---|---|
| 1️⃣ 忏悔 / 倾诉类(Confessional Communities) | 🔸 r/offmychest 🔸 r/TrueOffMyChest 🔸 r/confession | 情绪宣泄、自我揭露型小作文。 类似“我实在忍不住了,我想说说今天发生的事”。 |
| 2️⃣ 道德评判类(Moral Judgment Communities) | 🔸 r/AmItheAsshole (AITA) 🔸 r/AITAH 🔸 r/AmIwrong | 自我辩护型小作文。 类似“事情是这样的,你们觉得我错了吗?” |
| 3️⃣ 关系求助类(Relational Advice Communities) | 🔸 r/relationship_advice 🔸 r/relationships 🔸 r/TwoHotTakes | 情感困境型小作文。 类似“我和男朋友吵架了,不知道该怎么办”。 |
🔍 三、语义对齐表(中文“小作文” vs Reddit 帖子类型)
| 中文“小作文”语气类型 | Reddit 对应子类 | 示例帖子风格 |
|---|---|---|
| 💔 倾诉型(“我真的太累了”) | r/offmychest / r/TrueOffMyChest | “I’ve been bottling this up for months, but I need to tell someone…” |
| ⚖️ 辩护型(“请大家评评理”) | r/AmItheAsshole / r/AITAH | “My friend said I’m selfish for not attending his wedding. Was I wrong?” |
| 🧩 求助型(“我该怎么办”) | r/relationship_advice / r/relationships | “My partner hides things from me — should I confront them or let it go?” |
| 🔥 控诉型(“我受够了某某行为”) | r/TwoXChromosomes / r/OffMyChest / r/TrueOffMyChest | “I’m done being polite about how people treat me at work.” |
| 🎭 表演型 / 反讽型(“我不想说但还是得说”) | r/unpopularopinion / r/TrueAskReddit | “People pretend to care about mental health, but they really don’t.” |
💡 四、语用机制对齐(小作文与 Reddit)
| 特征 | 中文“小作文” | Reddit 等价 |
|---|---|---|
| 文本形式 | 第一人称长文 | 同样以 first-person storytelling 为主 |
| 情绪浓度 | 强(愤怒、伤心、控诉、无助) | 强,尤其在 offmychest 与 AITA 中 |
| 互动机制 | 评论区提供评判、共情、声援 | Reddit 评论区功能完全相似(upvote/downvote + 评论) |
| 传播机制 | 社交算法放大情绪内容 | Reddit 首页算法(Hot/Best)亦倾向推送高情绪高互动内容 |
| 社会角色 | “讲述者” vs “吃瓜评审团” | “OP (Original Poster)” vs “Reddit Jury” |
🧭 五、总结:中文“小作文”在 Reddit 的对应层级
| 层级 | 英文定义 | 中文对应 |
|---|---|---|
| 宏观话语类型 | Personal Emotional Narrative Posts | 个人情绪叙事帖 |
| 中层社群类型 | Confessional / Moral / Relationship Communities | 忏悔 / 道德 / 关系类社群 |
| 代表 subreddit | r/offmychest, r/AITA, r/relationship_advice | 对应“小作文式”长帖集中区 |
| 语用功能定义 | “Storytelling for empathy, validation, or moral judgment” | “通过叙事寻求共情、认同或评判” |
✅ 最简明总结一句话:
中文语境下的“小作文”在 Reddit 上的等价社群是:
r/offmychest、r/AITA、r/confession、r/relationship_advice
它们共同构成了 “个人叙事型情绪表达类帖子(Personal Narrative & Emotional Disclosure Posts)”,
是 Reddit 的“小作文生态系统”。
如果你想,我可以帮你画出一个图表:
左边是“小作文”语气类型(倾诉型、辩护型、求助型…),右边是对应的 Reddit 社群及帖子特征,形成一个中英对照研究图。
要我画出来吗?
一、Comments 表(评论数据)
这是所有 用户评论 的表格结构:
| 列名 | 类型 | 含义说明 | 可能用途 |
|---|---|---|---|
| id | string | 评论的唯一 ID | 可用于数据合并、追踪用户行为 |
| subreddit | string | 评论所在的 subreddit 名称 | 区分社区类别(如 r/confession、r/advice、r/amitheasshole) |
| author | string | 评论作者 | 分析用户分布、识别高活跃用户 |
| body | string | 评论正文文本内容 | 用于 NLP(情感分析、主题分析等) |
| score | long | 评论得分(upvotes − downvotes) | 衡量社区对评论的认可度 |
| created_utc | long | 评论创建时间(Unix 时间戳) | 构建时间序列趋势或日/周活跃度分析 |
| parent_id | string | 父评论或帖子 ID | 建立层级结构、计算讨论深度 |
| link_id | string | 对应的帖子 ID(即该评论属于哪篇帖子) | 评论 ↔ 发帖的关联 |
| controversiality | long | 评论的“争议性指数” | 判断评论区的两极分化程度 |
| gilded | long | 收到 Reddit 奖励的次数 | 衡量社区“认可”或情感共鸣强度 |
| date | date | 转换后的日期(YYYY-MM-DD) | 便于按日/周/月聚合分析 |
总结:
→ comments 表是更“情感化”的层面,适合做 NLP / 情感分析 / 社区回应模式分析。
→ 重点字段:body, score, created_utc, controversiality, subreddit.
二、Submissions 表(帖子数据)
这是每个 subreddit 发帖本身的数据结构:
| 列名 | 类型 | 含义说明 | 可能用途 |
|---|---|---|---|
| id | string | 帖子 ID | 主键 |
| subreddit | string | 发帖的 subreddit 名称 | 区分社区主题 |
| author | string | 帖子作者 | 可与评论表关联,分析发帖与评论重叠 |
| title | string | 帖子标题 | NLP 分析主题词 |
| selftext | string | 帖子正文内容(若为空则是链接帖) | 用于文本分析或情绪检测 |
| score | long | 帖子得分 | 衡量帖子的受欢迎程度 |
| created_utc | long | 发帖时间 | 时间趋势分析 |
| num_comments | long | 评论数量 | 衡量社区参与度 |
| url | string | 如果是外链帖,给出外部 URL | 可以用来区分文本帖与链接帖 |
| over_18 | boolean | 是否 NSFW | 内容安全过滤 |
| date | date | 转换后的日期(YYYY-MM-DD) | 时间聚合用 |
NLP 板块Maternode Pyspark session再验证:
>>> df_comments.printSchema()
root
|-- id: string (nullable = true)
|-- subreddit: string (nullable = true)
|-- author: string (nullable = true)
|-- body: string (nullable = true)
|-- score: long (nullable = true)
|-- created_utc: long (nullable = true)
|-- parent_id: string (nullable = true)
|-- link_id: string (nullable = true)
|-- controversiality: long (nullable = true)
|-- gilded: long (nullable = true)
|-- date: date (nullable = true)
>>> df_submissions.printSchema()
root
|-- id: string (nullable = true)
|-- subreddit: string (nullable = true)
|-- author: string (nullable = true)
|-- title: string (nullable = true)
|-- selftext: string (nullable = true)
|-- score: long (nullable = true)
|-- created_utc: long (nullable = true)
|-- num_comments: long (nullable = true)
|-- url: string (nullable = true)
|-- over_18: boolean (nullable = true)
|-- date: date (nullable = true)
>>>
总结:
→ submissions 表是“信息发布”层面,适合做 EDA、行为分析、发帖趋势、关联评论分析。
→ 重点字段:title, selftext, score, num_comments, created_utc.