Horizontal vs. Vertical Scaling in Modern Data Engines
As the Automotive Dataset grew from tens of thousands of records to hundreds of millions, we inevitably reached the harsh physical limits of vertical scaling. Initially, our strategy was simple: when the database gets slow, rent a bigger server. However, throwing more RAM and CPU cores at a monolithic database eventually becomes an equation with severe diminishing returns and exorbitant cloud costs.
The Breaking Point of Monoliths
At a certain threshold, the hardware upgrades simply stop yielding performance gains. Lock contention on high-frequency write tables during our nightly NMVTIS ingestion runs began causing massive latency spikes for read queries on the public API. We realized that our monolithic Postgres instance was becoming a single point of failure that could cripple the entire platform.
The Transition to Horizontal Sharding
We abandoned monolithic structures in favor of horizontal scaling. By sharding our databases based on regional lookup frequencies and specific VIN hash ranges, we distributed the compute across a wider array of smaller, specialized nodes. This dramatically reduced lock-contention on write operations. A massive bulk insert in the "European Market" shard now has zero impact on the read performance of the "North American" shard.
Embracing Eventual Consistency
Distributed databases introduce the CAP theorem problem. We cannot have absolute consistency, high availability, and partition tolerance simultaneously. Not all data needs to be strictly ACID compliant across every node instantaneously. By embracing eventual consistency for non-critical reads (such as historical market averages), we massively accelerated the availability of our API layer. Critical financial transactions remain strictly consistent, but read-heavy analytics utilize asynchronous read-replicas.
Conclusion
Horizontal scaling is significantly harder to architect than vertical scaling, requiring complex orchestration and robust application-layer logic to handle routing. However, it is the only viable path to truly infinite scale and cost-effective database management.
