MongoDB 101

Name

Humongous - Mongo

a type of no SQL database

Structure

Document

use JSON as document model - data is represented in JSON format (which is flexible)

Different from relational schema

Collection

a collection for documents

Database

a collection for collections

Node

a group of database

Cluster

a group of nodes

Data Modeling in MongoDB

  • Is your app read or write heavy?
  • What data is frequently accessed together?
  • What are your performance considerations?
  • How will your data set grow and scale?

Embedding vs referencing

One sideMany side
EmbeddingEmbedding on the “one” sideEmbedding on the “many” side
ReferencingReferencing on the “one” sideReferencing on the “many” side

MongoDB Atlas

  • Database-as-a-service (DBaaS)
  • Automatically manages MongoDB for you, including:
    • Deployment
    • Monitoring
    • Backup & restoration
    • Archiving
  • Security
  • Built-in Replication

CRUD

MongoDB Query API Syntax Structure

db (keyword) → collection (coll name) → operator (command) → query/filter (criteria) → options (settings)


Example MongoDB find command

db (keyword) → collection (call name) → operator (command) → query/filter (criteria) → options (settings)

db.collection.find(
  { age: { $gte: 25 } }, // query/filter: Find documents where age is greater than or equal to 25
  {
    projection: { name: 1, age: 1 }, // options/settings: Include only the name and age fields in the results
    sort: { age: -1 }, // options/settings: Sort the results by age in descending order
    limit: 10 // options/settings: Limit the number of results to 10
  }
);

Create

db.authors.insertOne({
  "name": "Jane Austen",
  "books": [ "0141439688", "0375757813", "1551114798" ],
  "aliases": [ "Austen, Jane", "Jane Austen" ]
})

Read

db.authors.find({
  "name": "Jane Austen"
})

Update

db.authors.updateOne(
  { name: "Jane Austin" },
  { $set: { name: "Jane Austen" } }
)

Delete

db.authors.deleteOne(
  { name: "Jane Austin" }
)

Hands-on

MongoDB Songs Playlist

TitleArtistGenreDurationPlays
Sucks To Be Ex-QL$Avg $MaxPop/Dance3:284
Changes Stream2.PackHip Hop3:338
Primary’s Gonna Be MeMSYNCPop2:404
I Haven’t Met AI YetMiCachel BublésonJazz3:366
Where Is The Log?Backed By DBsPop3:106
All About that DatabaseSchema TrainorPop2:3420
What About $toDateDoc-TreeRock3:4818
NoSQL ParadiseCLI-ioHip Hop2:5726
NoS8L BoyDocueseneceRock/Punk3:2726
Back or FrontMichael JSONPop3:1126
Free-Tier StylerD-BomFunktion MCsHip Hop/Electro2:5520
Thnks Fr th MdbMemrrizDoc Out BoyRock/Punk3:1518
Mocking Data BoardSlimShardyHip Hop3:3120
My _id, My RideSpring BootsCountry3:4824
S OneNULLyHip Hop3:1216
$moreLikeThisDolla $earchHip Hop/Trap2:5332
Let It BeatThe Bugs70s/Pop3:5632
Cluster BusyconnecTsean PoolDancehall2:5522
Index of Change (ESR)Crystal ClustersTechno3:2022
Code Vibin'Masked WiredTigerHip Hop/Trap3:0152
URI HeroIndexBackRock3:4020
Tiering Up My AppMSYNCPop3:2334
3 Little NodesDB MarleyReggae3:4550
Am I in the VoidRelational MigratorsIndustrial Modern/Rock3:4524
I Believe I Can ShardR KlusterlyR&B/Soul3:3838
Harder, Better, Faster, SecuredUS_West_2Hip Hop/Trap3:1234
No SCRAMTLSR&B3:0752
Colores en la CloudJ. CloudinReggaeton3:0940
No Sequel NeededSaaS GirlsPop3:1140
Clave De ShardSharde DezonaReggaeton2:0648

The badges on MLH are associated to the MongoDB official credentials - finish on


VectorSearch: Beginner to Pro

Do you ever find yourself…

  • Looking for something but you don’t quite have the words?
  • Remembering some characteristics of a book but not the title?
  • Trying to get another sweatshirt just like the one you had back in the day, but you don’t know how to search for it?

Now the lexicon search (keyword searching) is not working anymore.

What?

  • Keyword search

When?

  • Your text corpus closely matches how users search
  • First pass at text-based relevancy

What?

  • Semantic similarities

When?

  • “Vocabulary gap” between corpus and how users search
  • Text, image, audio, video search

Vector example

store (Home depot)

[aisle, bin] - 2 dimensional vectors to locate

Embeddings

Definition:
Numeric, multi-dimensional representation of a piece of information

Key Points:

  • Capture semantic qualities of data
  • Semantically similar data ends up close together in vector space

Example (Vector Space Representation):

  • dog → [0.243, 0.765, …]
  • cat → [0.293, 0.774, …]
  • apple → [0.443, 0.965, …]
  • orange → [0.493, 0.9774, …]

How to embed data

Flow:

  1. Data → Raw input data (text, image, audio, etc.)
  2. Embedding model → Processed by an embedding model
  3. Vector → Converted into a vector representation, e.g. [0.3, 0.1, 0.2, ..., 0.4]

Adding embeddings to existing data

Before:

{
  "_id": "0028608488",
  "title": "David Copperfield's Tales of the Impossible",
  "cover": "https://images.isbndb.com/covers/22/86/9780061052286.jpg",
  "year": 1995,
  "pages": 385,
  "synopsis": "David Copperfield, Arguably The Greatest Illusionist–magician..."
}

After (with embeddings):

{
  "_id": "0028608488",
  "title": "David Copperfield's Tales of the Impossible",
  "cover": "https://images.isbndb.com/covers/22/86/9780061052286.jpg",
  "year": 1995,
  "pages": 385,
  "synopsis": "David Copperfield, Arguably The Greatest Illusionist–magician...",
  "embedding": [
    0.03898080065846443,
    -0.05879044095304909,
    0.04323239979442215,
    ...,
    0.034243063451233547
  ]
}

Recap

  • Embeddings are an array of numbers that capture semantic qualities of data
  • Embeddings are generated by specialized ML models
  • Embeddings can be added in-place into existing MongoDB documents

Vector search

Definition:
Search based on intent/meaning using embeddings

Process:

  1. User submits a query
  2. Query is processed by an Embedding Model
  3. The model converts the query into a Query vector
  4. Perform similarity search (e.g., k=3 nearest neighbors)
  5. Return the most relevant results in vector space

How vector search works in MongoDB

Hierarchical Navigable Small Worlds (HNSW)

  • Creates layered, connected graphs with vectors as nodes, edges created based on distance in vector space
  • Coarse search at top layers, refinement at lower layers
  • Efficient way of searching through large datasets

Source: Towards Data Science

Calculating distance in vector space

Euclidean Distance

  • Measures absolute distance between vectors

Example:

  • dog → [0.243, 0.765, …]
  • cat → [0.243, 0.774, …]
  • Euclidean Distance = √((0.243 - 0.243)² + (0.765 - 0.774)² + …)

Dot product

  • Vector multiplication as a measure of alignment

Example:

  • Vectors: [4, 0, 1] and [3, 1, 2]
  • Calculation:
    (4 × 3) + (0 × 1) + (1 × 2)
    = 12 + 0 + 2
    = 14

Cosine similarity

  • Measures the angle between vectors

Example:

  • dog → [0.243, 0.765, …]
  • cat → [0.293, 0.774, …]

Formula:
Cosine Similarity = (A · B) / (||A|| × ||B||)

Where:

  • A · B = dot product of vectors A and B
  • ||A||, ||B|| = magnitude (length) of each vector

Recap

  • Vector search retrieves documents closest to the query embedding in vector space
  • Use the same embedding model to embed the data you want to search on, and the user queries
  • Distance in vector space is calculated using mathematical functions
  • Cosine similarity works well with most embedding models

Vector Search in MongoDB

Create a vector search index

{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 512,
      "similarity": "cosine"
    }
  ]
}

Send a vector search query

pipeline = [
  {
    "$vectorSearch": {
      "index": "vector_index",
      "path": "embedding",
      "queryVector": [0.02421053, -0.022372592, ...],
      "numCandidates": 150,
      "limit": 10
    }
  },
  {
    "$project": {
      "_id": 0,
      "title": 1,
      "score": { "$meta": "vectorSearchScore" }
    }
  }
]

Integrated platform that simplifies your application architecture

  • Data is automatically synchronized between the database and vector index
  • Developers work with database and vector search via the unified MongoDB Query API
  • Fully managed for you so you can focus on your application
  • Search nodes scale your search workloads independent of the operational database

Benefits

  • Vector search simplified
  • Avoid the tax synchronization
  • Remove operational heavy lifting

Recap

  • To perform vector search in MongoDB, you need to generate embeddings, create a vector search index and send a query
  • The number of dimensions in the vector search index depends on the embedding model used
  • Vector search should always be the first stage in a vector search aggregation pipeline

pre-filtering (save time)

apply first filter condition to narrow down

{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    },
    {
      "type": "filter",
      "path": "pages"
    }
  ]
}
Last modified on 2025-12-04 • Suggest an edit of this page
← Prev: Over The Wire Log
Next: Hacking with GitHub Copilot →