Handling Real-Time Data Stream Processing with Kafka, Hadoop, and NoSQL

Home » #Technology » Handling Real-Time Data Stream Processing with Kafka, Hadoop, and NoSQL

Real-time data streaming is transforming how businesses process and analyze information. With technologies like Apache Kafka, Hadoop, and NoSQL databases, you can build powerful, scalable systems to handle real-time data streams. With 20 years of experience driving tech excellence, I’ve redefined what’s possible for organizations, unlocking innovation and building solutions that scale effortlessly. My guidance empowers businesses to embrace transformation and achieve lasting success. In this tech concept, we’ll walk through a use case, explain the technology stack, and guide you on how to build a real-time data processing pipeline.

What Is Real-Time Data Streaming?

Real-time data streaming refers to the continuous processing of data as it is generated. Unlike traditional batch processing, real-time streaming ensures immediate analysis and decision-making.

Why Use Kafka, Hadoop, and NoSQL for Real-Time Data?

Kafka: A distributed message broker that ingests real-time data with high throughput.
Hadoop: A framework for distributed storage and large-scale batch data processing.
NoSQL: Databases like MongoDB and Cassandra provide fast data storage and querying capabilities for real-time analytics.

Use Case: Real-Time Clickstream Data Analysis

Scenario

Imagine an e-commerce or financial tech platform that wants to monitor user behavior in real time. By analyzing clickstream data, the platform can:

Deliver personalized product recommendations.
Detect anomalies, such as fraudulent activity or bot behavior.
Instantly measure the success of marketing campaigns.

Building a Real-Time Data Processing Pipeline

Step 1: Ingest Real-Time Data with Apache Kafka

Kafka acts as the backbone of the pipeline by collecting and distributing real-time clickstream data.

Setting Up Kafka

Create Kafka Topics

kafka-topics.sh --create --topic clickstream-data --bootstrap-server localhost:9092 --replication-factor 1 --partitions 3

Send Data to Kafka
Use a Python producer to simulate sending real-time user data:

from kafka import KafkaProducer  
import json  

producer = KafkaProducer(bootstrap_servers='localhost:9092', value_serializer=lambda v: json.dumps(v).encode('utf-8'))  

data = {"user_id": "1234", "action": "view_product", "timestamp": "2024-12-29T12:30:45Z"}  
producer.send('clickstream-data', value=data)  
producer.close()

Step 2: Process Data with Hadoop

Hadoop complements Kafka by analyzing historical clickstream data stored in HDFS. It’s ideal for generating insights from long-term trends.

Processing Data with Hadoop Streaming

Run a MapReduce job to count clicks per user:

hadoop jar /path/to/hadoop-streaming.jar \
  -input hdfs:///clickstream-data/ \
  -output hdfs:///processed-data/ \
  -mapper "python3 /path/to/mapper.py" \
  -reducer "python3 /path/to/reducer.py" \
  -file /path/to/mapper.py \
  -file /path/to/reducer.py

Mapper Script (mapper.py)

import sys
import json

def mapper():
    # Reading input from stdin (streaming input)
    for line in sys.stdin:
        try:
            # Parse each line as a JSON object
            data = json.loads(line.strip())

            # Extract user_id and action from the Kafka event
            user_id = data.get('user_id')
            action = data.get('action')

            if user_id and action:
                # Emit the user_id as the key and action as the value
                print(f"{user_id}\t{action}")
        
        except json.JSONDecodeError:
            # Skip lines that are not valid JSON
            continue

if __name__ == "__main__":
    mapper()

Reducer Script (reducer.py)

import sys

def reducer():
    current_user = None
    action_count = {}

    for line in sys.stdin:
        # Split each input line into user_id and action
        user_id, action = line.strip().split("\t")

        if current_user == user_id:
            # If the same user, increase the count of the action
            if action in action_count:
                action_count[action] += 1
            else:
                action_count[action] = 1
        else:
            # New user, print the current user's action counts and reset
            if current_user:
                for action, count in action_count.items():
                    print(f"{current_user}\t{action}\t{count}")
            
            # Reset for new user
            current_user = user_id
            action_count = {action: 1}

    # Print the final user data
    if current_user:
        for action, count in action_count.items():
            print(f"{current_user}\t{action}\t{count}")

if __name__ == "__main__":
    reducer()

Step 3: Store and Query Data with NoSQL

Processed data is stored in a NoSQL database like MongoDB for fast querying and dashboard visualization.

Storing Data in MongoDB

Insert Processed Data

from pymongo import MongoClient
import sys

# Connect to MongoDB
client = MongoClient('localhost', 27017)
db = client['clickstream_db']
collection = db['user_clicks']

# Reading output from Hadoop streaming (sys.stdin)
for line in sys.stdin:
    try:
        # Split the Hadoop output into user_id, action, and count
        user_id, action, count = line.strip().split("\t")
        
        # Convert count to an integer
        count = int(count)
        
        # Prepare the data to be inserted
        data = {
            "user_id": user_id,
            "click_count": count,
            "last_action": action
        }
        
        # Insert the data into MongoDB
        collection.insert_one(data)
        
    except ValueError:
        # Skip malformed lines or errors
        continue

Run the above code, After hadoop processing

# After running the hadoop job, insert the output into MongoDB 
cat /path/to/hadoop-output/result/part-00000 | python /path/to/mongo_insert.py

Query Insights

for user in collection.find({"click_count": {"$gt": 10}}): print(user)

Integration Workflow

Kafka: Captures and distributes real-time data.
Hadoop: Processes historical data for batch analysis.
NoSQL: Stores and serves processed data for fast querying.

Benefits of This Architecture

Scalability: Handle massive data volumes efficiently.
Low Latency: Process data with minimal delays.
Flexibility: Adapt the pipeline for advanced analytics or machine learning.

Challenges and Solutions

Data Duplication: Use Kafka’s unique offsets to ensure data is processed only once.
Fault Tolerance: Enable replication across Kafka, Hadoop, and NoSQL for system reliability.
Query Performance: Use indexing in NoSQL for faster query response times.

My Tech Advice: As data scales with application, businesses must evolve and adopt new technologies to effectively address various aspects of their requirements. Building a real-time data processing system with Kafka, Hadoop, and NoSQL enables businesses to unlock actionable insights from their data streams. By following this architecture, you can process data at scale, detect trends in real-time, and drive data-informed decisions. Embrace this powerful technology stack to enhance your data capabilities and stay ahead in the digital era.
#AskDushyant

Note: The example and pseudo code is for illustration only. You must modify and experiment with the concept to meet your specific needs.

#TechConcept #TechAdvice #Hadoop #NoSQL  #Kafka #Streaming #DataTech