Scaling Your Database with Sharding

With 18+ years of experience building enterprise tech solutions, I know: as data volumes grow, scaling traditional relational databases can quickly become a bottleneck—you must get creative to manage them at scale. Here’s the concept Sharding comes in—an essential partitioning technique that horizontally scales databases by distributing data across multiple servers or partitions, known as shards. However, not all databases provide built-in sharding capabilities. This tech concept dives into how to implement sharding manually, examples of manual sharding in databases without native support, and a comparison of databases that offer built-in sharding along with their use cases.

What is Sharding?

Sharding is the process of breaking up a large dataset into smaller, more manageable parts, called shards, which are distributed across multiple databases or servers. Each shard holds a subset of the data, and together they form the complete dataset. By distributing the load across multiple machines, sharding improves performance and allows systems to scale horizontally.

Manual Sharding: How to Implement It

For databases without built-in sharding support (like MySQL, PostgreSQL), you can implement sharding manually. The core idea is to partition data across multiple databases based on a shard key.

Step-by-Step Process for Manual Sharding

Choose a Shard Key:
The shard key determines how data is distributed across shards. Common shard keys include:

User ID (e.g., users 1-1000 in shard 1, users 1001-2000 in shard 2)
Geographical location (e.g., users from the North region in one shard, West region in another and so on.)
Date ranges (e.g., transactions from 2020 in shard 1, from 2021 in shard 2) It’s critical to choose a shard key that evenly distributes the data across shards to avoid hotspots.

Design Shard Databases:
For each shard, create a separate database or schema. Each shard will hold a subset of your data. Example: Partitioning users table by user ID.

Database 1: users_1_to_1000
Database 2: users_1001_to_2000

   -- Creating shard tables manually
   CREATE TABLE users_1_to_1000 (
       user_id INT PRIMARY KEY,
       user_name VARCHAR(255),
       email VARCHAR(255)
   );

   CREATE TABLE users_1001_to_2000 (
       user_id INT PRIMARY KEY,
       user_name VARCHAR(255),
       email VARCHAR(255)
   );

Implement Application-Level Shard Routing:
The application needs logic to route read/write requests to the correct shard based on the shard key. Example (Python):

   def get_shard(user_id):
       if user_id <= 1000:
           return 'users_1_to_1000'
       else:
           return 'users_1001_to_2000'

   def get_user(user_id):
       shard = get_shard(user_id)
       query = f"SELECT * FROM {shard} WHERE user_id = {user_id};"
       # Execute the query and return result
       return execute_query(query)

Handle Cross-Shard Queries:
Queries that span multiple shards (like an aggregated report across all users) require querying all shards and combining the results at the application level. Example: Querying all users:

   def get_all_users():
       users_shard_1 = execute_query("SELECT * FROM users_1_to_1000;")
       users_shard_2 = execute_query("SELECT * FROM users_1001_to_2000;")
       return users_shard_1 + users_shard_2

Managing Shard Growth:
As data grows, you may need to add new shards or redistribute data across existing shards. This could involve migrating users from one shard to another, updating routing logic, or adding new database servers.

Example: Manual Sharding in MySQL

Suppose you have a large users table, and you decide to shard it by user_id.

Shard Setup:

Shard 1: Users with user_id 1-1000
Shard 2: Users with user_id 1001-2000

Create two separate databases for each shard:

-- Database for Shard 1
CREATE DATABASE users_shard_1;
USE users_shard_1;
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    user_name VARCHAR(255),
    email VARCHAR(255)
);

-- Database for Shard 2
CREATE DATABASE users_shard_2;
USE users_shard_2;
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    user_name VARCHAR(255),
    email VARCHAR(255)
);

The application routes queries to the correct database based on the user’s user_id.

Databases with Native Sharding Support

Several databases provide built-in sharding, automating the process and relieving developers from manually implementing and managing shards.

1. MongoDB

MongoDB has native sharding support (modern day DB marvel), making it a go-to NoSQL database for distributed data storage. With MongoDB, sharding is managed automatically by selecting a shard key and enabling sharding on collections.

Use Case:

Applications with large datasets that need to scale horizontally.
E-commerce, social networks, and content management systems.

Example:

// Enable sharding on a collection
sh.shardCollection("mydb.users", { user_id: 1 });

2. Cassandra

Cassandra is a distributed NoSQL database that automatically shards data across its nodes using consistent hashing. It’s ideal for high availability and horizontal scalability.

Use Case:

Systems requiring fast writes and high availability, such as IoT systems or messaging platforms.

Example:

CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    user_name TEXT,
    email TEXT
) WITH CLUSTERING ORDER BY (user_id ASC);

3. Google Cloud Spanner

Google Cloud Spanner provides automatic sharding and horizontal scaling for relational databases. It allows users to scale SQL databases across multiple regions without manual shard management.

Use Case:

Applications needing global scale with strong consistency, such as financial services or global inventory systems.

Example:

CREATE TABLE Users (
    UserId INT64 NOT NULL,
    UserName STRING(100),
    Email STRING(100)
) PRIMARY KEY(UserId);

4. CockroachDB

CockroachDB is a distributed SQL database that automatically handles sharding. It’s designed for global scalability and transactional consistency.

Use Case:

Applications requiring transactional consistency and horizontal scaling, such as SaaS platforms or financial systems.

Example:

CREATE TABLE users (
    user_id SERIAL PRIMARY KEY,
    user_name STRING,
    email STRING
);

My TechAdvice: While databases like MySQL and PostgreSQL require manual sharding, several modern databases come with built-in sharding capabilities, simplifying the process of scaling large datasets. MongoDB, Cassandra, Google Cloud Spanner, and CockroachDB offer automated sharding, each suitable for different use cases depending on data access patterns, scale, and consistency requirements.
#AskDushyant

#TechConcept #DataBase #DataTech

What is Sharding?

Manual Sharding: How to Implement It

Step-by-Step Process for Manual Sharding

Example: Manual Sharding in MySQL

Shard Setup:

Databases with Native Sharding Support

1. MongoDB

2. Cassandra

3. Google Cloud Spanner

4. CockroachDB

Section

Tags

Leave a Reply Cancel reply