Understanding the infrastructure giants powering modern technology
Kafka streams live driver locations → MongoDB stores ride history → Spark calculates surge pricing → Elasticsearch enables "find my ride" search
# Distributed processing across cluster
features_rdd = spark.sparkContext.parallelize(features)
results = features_rdd.map(score_feature).collect()
# Hadoop: 6 hours to process 1TB
# Spark: 10 minutes to process 1TB
Before Kafka: App → Database → Polling (60 second delay)
After Kafka: App → Kafka → Instant (millisecond latency)
// No schema required - just store JSON
db.users.insert({
id: 1,
name: "John",
preferences: { theme: "dark" },
// Add new fields anytime, no migration
devices: ["mobile", "web"]
})
SQL LIKE query: Scans entire table, minutes
Elasticsearch: Inverted indexes, 100ms for 1TB
| Strategy | Description | Example |
|---|---|---|
| Open Core | Free open source core + Paid enterprise features | MongoDB free → MongoDB Enterprise $$$ |
| Land & Expand | Developer uses free → Company pays millions | Year 1: $0 → Year 5: $1M+/year |
| Cloud Hosting | Managed service in the cloud | MongoDB Atlas, Databricks clusters |
| Usage-Based | Pay per compute/storage | Databricks: $1-2/machine-hour |
Build In-House: 5 engineers × $200K/year = $1M/year + infrastructure
Buy Service: $100-500K/year, zero engineering overhead
Winner: At scale, buying is cheaper than building
PySpark on Resume: ✅ You implemented distributed feature engineering
Databricks on Resume: 🤔 Debatable (didn't use the platform)
Understanding Ecosystem: ✅ Makes you better at architecture interviews
Your FX System (2GB data): PySpark locally is perfect
Enterprise (100TB+ data): Databricks makes sense
Real-time events: Kafka is the standard
Flexible schema: MongoDB wins
Search/Logs: Elasticsearch is default
These four companies represent $73B in combined market value,
powering the infrastructure behind every major tech company,
processing petabytes of data daily.