This project involves using Kafka to retrieve and process stock market data AAPL (historical data) via Yahoo Finance API in real-time. The data is extracted, cleaned, processed, and sent through a consumer pipeline.
The captured data are presented in the folder screenshot. In the folder code we have the main folder and the following files:
- test_api.py → retrieves real-time stock data for Apple (AAPL) over a one-day period with a 1-minute interval, then displays the closing price and volume for each minute. If an error occurs, it shows an error message.
- extract_clean_data.py → fetches intraday stock data for Apple (AAPL) from Yahoo Finance, cleans the data (extracting timestamp, closing price, and volume), and sends the cleaned data to a Kafka topic for real-time streaming.
- data_producer.py → simulates real-time Apple (AAPL) stock data by generating random price and volume changes, then sends this data to a Kafka topic every second.
- visualize_data.py → creates visualizations of Apple (AAPL) stock data.
- data_visualizer.py → consumes stock data from Kafka, processes it, and visualizes closing prices and volumes in real-time.
- data_processor.py → consumes data from a Kafka topic, processes the closing prices to calculate averages, and handles errors with logging.
- sliding_window_processor.py → processes Kafka data with a sliding window, allowing the user to choose between two configurations: a 10-minute window with a 1-minute slide or a 30-minute window with a 5-minute slide. It calculates average closing prices and volumes, then visualizes the results.
'test_api.py' :
- Fetches Data: Retrieves real-time stock data for Apple (AAPL) from Yahoo Finance with a 1-minute interval.
- Displays Data: Extracts and prints the closing price and volume for each minute.
- Error Handling: Catches and displays errors if no data is retrieved or if there's an issue with the API request. extract_clean_data.py
- Fetch Intraday Data: It retrieves intraday stock data for Apple (AAPL) with a 1-minute interval from Yahoo Finance using yfinance.
- Clean Data: It processes the raw data, extracting the timestamp, closing price, and volume, and stores them in a clean format.
- Send to Kafka: It sends the cleaned data to a Kafka topic, using the Confluent Kafka Producer to push the data into the configured Kafka cluster.
'data_producer.py'
- Simulates Data: It generates random variations for stock price and volume for Apple (AAPL) to simulate real-time market data.
- Sends Data to Kafka: The simulated data is sent to a Kafka topic every second, including stock details like open, close, high, low, and volume.
- Runs Continuously: It continuously produces data until interrupted, with each message sent to Kafka containing a timestamp and stock details.
'visualize_data.py' - Extracts Data: It extracts average closing prices and volumes from the processed data. - Plots Two Graphs: It generates two plots: one for the average closing price and another for the average volume over time. -Displays the Graphs: The graphs are displayed with proper labels, titles, and legends, improving readability with rotated timestamps.
'data_visualizer.py'
- Consumes Real-Time Data: It consumes stock data from a Kafka topic in real time.
- Extracts Relevant Information: It extracts timestamps, closing prices, volumes, and moving averages from the data.
- Visualizes the Data: It generates two plots, one for the closing prices with a moving average and another for transaction volumes, with clear labels and titles.
'sliding_window_processor.py' The new code processes real-time data from Kafka by applying a sliding window. The user can choose between two window and slide configurations. Option 1: A 10-minute window with a 1-minute slide. Option 2: A 30-minute window with a 5-minute slide. Window Length: 30 minutes (smoothing the noise and focusing on trends). Slide Length: 5 minutes (balancing responsiveness with performance). By adopting these new lengths, the sliding window approach becomes more efficient for large-scale data analysis, especially when processing high-frequency, time-sensitive data like stock market prices or financial data streams.
- Configures Window and Slide: The code offers two options for window length and slide size: 10-minute window with a 1-minute slide or 30-minute window with a 5-minute slide.
- Consumes Data from Kafka: It retrieves stock data from a Kafka topic, processes it in a sliding window, and calculates average closing prices and volumes.
- Visualizes Results: The processed data is visualized by plotting average closing prices and volumes over time using Matplotlib.
Create a virtual environment and install the required dependencies:
- Python 3.12.8
- Yahoo Finance API
- Python libraries: confluent_kafka, yfinance, ect
- Confluent Cloud (Apache Kafka 3.9.0): Managed cloud service for real-time data ingestion and stream management, ensuring high availability and optimal scalability. Confluent Cloud: https://www.confluent.io/
- Python 3.x: Main environment for developing data processing and analysis scripts.
- Pandas: For data processing and analysis after ingestion from Kafka, particularly with sliding window techniques.
- Matplotlib: For visualizing the results of data stream processing.
- VS Code: Recommended development environment for writing and managing Python scripts.
- Java 11 or 17: Required for specific libraries and compatibility with backend components if needed.
- IntelliJ IDEA Community Edition (optional, if specific Java developments are planned).
The messages as JSON file contains stock data messages sent via Kafka from Confluent Cloud. These messages pertain to the AAPL (Apple) stock and include specific financial information such as the opening price, closing price, high and low prices, and trading volume. Each message includes metadata like the Kafka partition, offset, and creation timestamp. This data can be used in a real-time processing project to analyze or visualize the evolution of AAPL stock prices.
The .env.template file is used for storing and managing environment variables to configure connections and authentication in your scripts. It contains the necessary information to connect to a Kafka cluster on Confluent Cloud, including:
The topic name (KAFKA_TOPIC) where data will be sent or read from.
The Kafka server address (KAFKA_BOOTSTRAP_SERVERS), which allows connection to the Kafka service.
The API key and API secret (KAFKA_API_KEY and KAFKA_API_SECRET) that are used to authenticate and establish the connection to the Kafka cluster.
This file helps centralize sensitive information in a separate file, making it easy to use in the scripts without exposing it directly in the source code. It also serves as a template for securely and flexibly configuring a development or production environment.
Adding .env.template to the .gitignore ensures that the file containing sensitive environment variables (such as Kafka API keys and server addresses) is not tracked or pushed to GitHub, keeping the credentials secure.
Open a terminal in the root folder of the project and run : python *.py for each name script file to execute it, for example : python test_api.py
--> Note:
In this project, Confluent Cloud replaces a local installation of Apache Kafka, enabling more reliable ingestion and simplified stream management. Processing is primarily handled by Pandas for analysis and Matplotlib for result visualization.
In my project, Flink (which is often used for real-time stream processing) has been replaced by:
- Kafka: For real-time data ingestion. Kafka is used for publishing and consuming messages, as seen in the code with the Kafka consumer.
- Pandas: For data processing and applying business logic, as in the sliding window processing scripts (sliding_window_processor.py), where data is stored in DataFrames and analyzed using Pandas.
In summary, Flink has been replaced by a combination of Kafka for real-time data stream management and Pandas for data processing and analysis.
Données récupérées avec succès :
AAPL: $258.45 Volume: 982,636
AAPL: $258.26 Volume: 125,413
AAPL: $257.37 Volume: 179,116
AAPL: $257.22 Volume: 164,265
AAPL: $256.98 Volume: 137,277
AAPL: $256.91 Volume: 122,334
AAPL: $256.74 Volume: 161,144
AAPL: $257.02 Volume: 134,728
AAPL: $256.20 Volume: 227,341
AAPL: $256.23 Volume: 165,086
AAPL: $256.62 Volume: 120,419
AAPL: $256.58 Volume: 110,368
AAPL: $256.65 Volume: 82,143
AAPL: $256.61 Volume: 108,719
AAPL: $256.44 Volume: 69,601
AAPL: $256.69 Volume: 96,514
AAPL: $256.88 Volume: 94,678
AAPL: $256.92 Volume: 80,673
AAPL: $256.85 Volume: 82,001
...This project demonstrates a real-time stock data processing pipeline using Kafka (via Confluent Cloud) for data ingestion and Pandas for processing. We successfully streamed AAPL stock data from the Yahoo Finance API, processed it in real-time with Pandas, and visualized the results.
Key highlights of the project include:
Real-Time Data Stream with Confluent Cloud: We successfully streamed real-time stock market data into Kafka hosted on Confluent Cloud, enabling a scalable and reliable flow of updated stock prices and trading volumes.
Efficient Data Processing with Pandas: Using Pandas, we applied sliding window techniques and aggregation to efficiently process incoming stock data.
Data Visualization: Processed data was visualized using Matplotlib, providing clear insights into stock trends and trading activity.
In conclusion, this project showcases how Kafka (via Confluent Cloud) and Pandas can work together to build a scalable, real-time financial data processing and visualization system.
@ Student : Samar Krimi