What is Sentimental Analysis?
Sentiment analysis is a fascinating tool that decodes the emotional undercurrents in written or spoken language. It’s like having a superpower that allows you to gauge whether people are feeling positive, negative, or neutral about a particular topic. By leveraging advanced technologies such as machine learning and natural language processing, sentiment analysis transforms raw text into valuable emotional insights.
Enter the game-changing world of Large Language Models (LLMs). These sophisticated AI tools have revolutionized sentiment analysis by rapidly processing and interpreting text, converting complex emotions into quantifiable data. This breakthrough has significantly enhanced the speed and efficiency of sentiment analysis, opening up new possibilities for its application.

In this article, we’ll explore cutting-edge libraries and tools, with a special focus on Llama3.1:8b. We’ll analyze recent financial news articles, extract their sentiment scores, and compare these insights with actual stock market data. Join us as we uncover how the collective sentiment surrounding stocks can provide valuable insights into market trends and potentially influence investment decisions.
Benchmarking Our Sentiment Analysis Model: Kaggle Sentiment Analysis for Finance Data
Before we unleash our sentiment analysis model on the wild world of financial news, it’s crucial to put it through its paces. To do this, we’re turning to a gold-standard dataset from Kaggle, specifically designed for financial sentiment analysis. This rigorous evaluation will give us a clear picture of our model’s capabilities and areas for improvement.
The dataset we’re using is a treasure trove of financial sentiment information. It contains 4,844 carefully curated texts, each categorized as positive, neutral, or negative. You can find this valuable resource here: Kaggle Financial Sentiment Analysis Dataset.
Our goal? To see how our LLM model stacks up against human-labeled sentiment. It’s like a financial literacy test for our AI!
First things first, let’s get our hands on that juicy data. We’ll download the dataset from Kaggle and convert it into a pandas DataFrame for easy manipulation. Here’s a pro tip: If you encounter the dreaded ‘utf-8’ codec error, don’t panic! Simply add encoding = "ISO-8859-1"
to your pandas read_csv function. It’s like a magic spell that wards off encoding gremlins.
import pandas as pd
# data from kaggle : https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-for-financial-news
df = pd.read_csv('./data/FinancialPhraseBank/all-data.csv', engine='python', encoding='ISO-8859-1') # encoding for handling 'utf-8' codec error
# Take a peek at our data
print(df.head())
print(f"Total samples: {len(df)}")
With our data locked and loaded, we’re ready to put our sentiment analysis model to the test. In the next section, we’ll dive into the evaluation process and see just how well our AI can read the financial room!
Now, let’s dive into the fascinating world of prompt engineering – a crucial aspect of our sentiment analysis project. I’ve crafted a specialized prompt that serves as the backbone of our analysis, and I’m excited to walk you through its intricacies.
This carefully designed prompt is the secret sauce that enables our AI model to extract meaningful insights from financial news articles. It’s a perfect blend of clear instructions and targeted questions that guide the model to focus on what matters most in the context of stock market sentiment.
Let’s take a closer look at this prompt and we use few shot, chain of thought and output structuring engineering. Donβt panic I will explain detail in the later section.
# prompt for the sentimental analysis
prompt = """You are an sentiment analyzer specialized in classifying sentiment of short financial texts.
Your task is to analyze the sentiment of the provided financial text and convert it into string format. Never include any other information or strings but output format.
Follow these steps and respond only in the specified output format:
# Step 1: Read the provided financial text carefully.
# Step 2: Assign a sentiment score between 0 and 1 based on financial perspective.
# Step 3: Do a sentimental analysis and classify it into positive, negative or neutral category and get the reason why in the financial perspective.
# Step 4: Convert the classification into the specified output format.
#### output format:
<sentimental analysis>
### Example
# Text : The international electronic industry company Elcoteq has laid off tens of employees from its Tallinn facility ; contrary to earlier layoffs the company contracted the ranks of its office workers , the daily Postimees reported
# Output : negative
# Text : Technopolis plans to develop in stages an area of no less than 100,000 square meters in order to host companies working in computer technologies and telecommunications , the statement said .
# Output : neutral
# Text : 'With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .'
# Output : positive
# Text : Rinkuskiai 's beer sales fell by 6.5 per cent to 4.16 million litres , while Kauno Alus ' beer sales jumped by 6.9 per cent to 2.48 million litres.
# Output : neutral
"""
The output from our sentiment analysis model will be a concise string containing valuable insights extracted from the input text. This streamlined approach allows for easy integration into our broader analysis pipeline. Let’s dive deeper into the code implementation in the following sections to see how we transform raw text into actionable sentiment data, which will be also explained more detail in the later section
## llm model
from langchain_community.chat_models import ChatOllama
from openai import OpenAI
import keyring
import pandas as pd
# sentimental analysis
def sentiment_analysis(prompt=prompt, content=None, model='llama'):
# getting contents from the link
query = prompt + "\n\n#### Text:\n\n" + content
# getting model's response
if model == 'llama':
llm = ChatOllama(model='llama3.1')
llm = ChatOllama(model='llama3.1')
response = llm.invoke(query)
return response.content
else:
llm = OpenAI(api_key=keyring.get_password('openai', 'key_for_windows'))
response = llm.chat.completions.create(
model=model,
messages=[
{'role':'system', 'content':'You are a helpful assistant.'},
{'role':'user', 'content':query}
]
)
return response.choices[0].message.content
To rigorously evaluate our sentiment analysis model, we’ll employ a robust testing methodology. Here’s our approach:
- Random Sampling: We’ll randomly select 100 samples from our dataset, ensuring a diverse range of financial texts.
- Iterative Testing: To account for variability, we’ll repeat this process 10 times.
- Accuracy Measurement: For each iteration, we’ll use the
sklearn.metrics.accuracy_score
function to quantify our model’s performance. - Statistical Analysis: After completing all iterations, we’ll calculate the average accuracy score, providing a robust measure of our model’s overall effectiveness.
This comprehensive evaluation strategy will give us valuable insights into our sentiment analysis model’s strengths and areas for improvement, paving the way for more accurate financial predictions.
# select randomly 100 samples for 10 iterations
import random
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score
import warnings
import numpy as np
warnings.filterwarnings('ignore') # warning message not showing
accuracy_list = []
for i in tqdm(range(10)):
numbers = random.sample(range(1, len(df)), 100)
y_true_list = []
y_pred_list = []
accuracy = 0
for n in tqdm(numbers):
y_true = df.iloc[n][0]
y_pred = sentiment_analysis(content=df.iloc[n][1])
y_true_list.append(y_true)
y_pred_list.append(y_pred)
accuracy = accuracy_score(y_true_list, y_pred_list)
accuracy_list.append(accuracy)
print(np.average(accuracy_list)
The accuracy rate of approximately 46% may seem modest at first glance, but it’s important to note that our language model outperformed random selection, which has a probability of 33%. This performance is quite encouraging, especially considering we haven’t yet fine-tuned the model for this specific task. In a future post, we’ll delve into the process of model fine-tuning, which has the potential to significantly enhance our sentiment analysis accuracy.
Gathering the News Data for Sentimental Analysis
For sentiment analysis in the stock market, we need access to recent, relevant documents. News articles are the gold standard for this purpose, offering timely insights into market sentiment. While you could manually crawl news websites using libraries like bs4
or selenium
, there are more efficient options available.
Let’s explore two powerful libraries that streamline the process of gathering financial news:
- GDELT DOC (
gdeltdoc
): This library offers a simple yet effective way to analyze news coverage on a smaller scale. It’s perfect for our sentiment analysis project, but keep in mind that commercial use may require additional permissions from content providers. - Newspaper3k (
newspaper
): A robust tool for extracting and parsing article content from URLs. It’s an excellent complement to GDELT DOC for retrieving full article text.
While other options exist, such as the unofficial Yahoo Finance library (yfinance
) for ticker-specific news, we’ll focus on these two libraries for their versatility and ease of use.
Let’s start by installing GDELT DOC:
pip install gdeltdoc
GDELT DOC supports two primary query modes: ArtList for article searches and Timeline* for temporal analysis. These modes allow you to retrieve relevant news articles and track trends over time, providing a comprehensive view of media coverage on your chosen topics.
from gdeltdoc import Gdeltdoc, Filters
f = Filters(
keyword = 'climate change',
start_date = '2020-01-01',
end_date = '2023-12-29'
)
gd = Gdeltdoc()
# search for articles matching the filters
articles = gd.article_search()
# get a timeline of the number of articles matching the filters
timeline = gd.timeline_search('timelinevol', f)
Next, let’s explore Newspaper3k, a powerful library for extracting and curating articles. This versatile tool will be instrumental in our sentiment analysis project. To get started, you can easily install Newspaper3k using pip:
pip install newspaper3k
Newspaper3k is a powerful library that simplifies the process of downloading and parsing articles from URLs. With just a few lines of code, you can extract valuable information such as the article text, authors, publish date, and even associated media. This makes it an invaluable tool for our sentiment analysis project, allowing us to efficiently gather and process large volumes of news content.
from newspaper import Article
url = 'https://url'
article = Article(url)
article.download()
# you can get the HTML
article.HTML
# parse the HTML
article.parse()
# you can get the autors.
article.authors
# you can get the publish date
article.publish_date
# you can get the text
article.text
# you can get the medias
article.top_image
article.movies
In our sentiment analysis project, we’ll harness the power of two robust libraries to efficiently gather and process financial news articles. The gdeltdoc
library will serve as our primary tool for retrieving links related to specific keywords, providing us with a wealth of relevant news sources. To complement this, we’ll employ the newspaper
library, which excels at extracting and parsing article content from these URLs.
This powerful combination allows us to streamline our data collection process, enabling us to focus on the core task of sentiment analysis. By leveraging these libraries, we can quickly amass a comprehensive dataset of recent financial news, setting the stage for insightful analysis of market sentiment trends.
# import libraries
from gdeltdoc import GdeltDoc, Filters
from newspaper import Article
from datetime import datetime
# getting keyword related links function
def get_news_links(keyword='apple', start_date='2020-01-01', end_date='today'):
if end_date == 'today':
end_date = datetime.now().strftime('%Y-%m-%d')
f = Filters(
keyword=keyword,
start_date=start_date,
end_date=end_date
)
gd = GdeltDoc()
links = gd.article_search(f)
return links
With these powerful tools at our disposal, we’ve unlocked the ability to effortlessly gather relevant article content using just keywords. This streamlined process opens up a world of possibilities for sentiment analysis in the stock market.

Imagine being able to instantly access a wealth of information on any company or market trend you’re interested in. Whether you’re tracking established tech giants like Apple or keeping an eye on emerging players like OpenAI, our keyword-based content retrieval system puts the latest news and insights at your fingertips.
This efficient data collection method sets the stage for our next exciting step: applying sentiment analysis to uncover valuable market insights. Stay tuned as we dive deeper into the fascinating world of AI-powered financial analysis!
It’s important to note that while these tools provide powerful data gathering capabilities, using this information for commercial purposes requires careful consideration. To ensure legal compliance and respect for intellectual property rights, it’s crucial to obtain proper licensing or establish contracts with the original content providers before using their articles for any commercial applications. This step not only protects your business from potential legal issues but also supports the continued production of high-quality journalism.
π Automate Your Workflow with Make.com!
πΉ Save time & boost productivity with powerful llm models.
πΉ No coding required β integrate your favorite apps effortlessly.
πΉ Start for free and unlock limitless possibilities!
Unveiling the Power of Sentiment Analysis in Stock Market Insights
Welcome to the fascinating world of sentiment analysis in the stock market! In this section, we’ll explore how cutting-edge AI technologies can decode the emotional undercurrents of financial news and transform them into valuable insights for investors.
Let’s dive into the heart of our sentiment analysis process, where Large Language Models (LLMs) take center stage. We’ll walk you through our carefully crafted approach, designed to extract meaningful sentiment data from a sea of financial news articles.
The Art of Prompt Engineering
At the core of our sentiment analysis lies the art of prompt engineering. We’ve developed a specialized prompt that guides our LLM to act as an information extractor, focusing on corporations listed on NASDAQ or Dow Jones. This targeted approach ensures that we capture relevant sentiment data for the stocks that matter most to investors.
Step-by-Step Analysis with Chain of Thought (CoT)
Our process employs a Chain of Thought (CoT) methodology, breaking down the sentiment analysis into clear, logical steps:
- Identify the main corporation from the article
- Analyze the sentiment (positive or negative) associated with the company
- Assign a sentiment score between 0 (negative) and 1 (positive)
- Convert the information into a structured JSON format
This step-by-step approach not only enhances the accuracy of our sentiment analysis but also provides transparency in how we arrive at our conclusions.
Structured Output for Easy Integration
To make our sentiment analysis results easily consumable and integrable into various financial analysis tools, we’ve designed a clear JSON output format. This structured data includes key information such as the company name, stock ticker, sentiment classification, sentiment score, and the article’s publication date.
By combining advanced prompt engineering, a methodical analysis process, and structured output, we’ve created a powerful tool for extracting valuable sentiment insights from financial news. In the next section, we’ll see this process in action and explore how it can inform investment decisions in the dynamic world of stock markets.
prompt = """You are an information extractor specialized in identifying corporations listed on NASDAQ or Dow Jones from news articles.
Your task is to extract key information from the given text and convert it into JSON format. Never include any other information or string but output format.
Never summarize the provided article.
Follow these steps and respond only in the specified output format:
# Step 1: Read the provided article carefully.
# Step 2: Identify the main corporation from the article.
# Step 3: Assign a sentiment score between 0 and 1 based on the article.
# Step 4: Do the sentimental analysis and classify it into positive, negative or neutral category in the sentimental perspective.
# Step 5: Convert the information into exact JSON output format without additional information or string.
#### output format:
{'entity':<company_name>, 'ticker':<stock_ticker>, 'sentiment':<positive/negative>, 'score':<float>, 'datetime':<provider publishing date -> YYYY-MM-DD>}
"""
Next, we define the heart of our sentiment analysis process: the sentiment_analysis
function. This versatile function takes three key parameters: the prompt, the article link, and the model to be used. It then returns the sentiment analysis results in a structured JSON format, making it easy to process and analyze the data further.
One of the most powerful features of this function is its ability to compare results across different language models. We’ve designed it to work seamlessly with Llama3.1, GPT-3.5-Turbo, and GPT-4, allowing us to benchmark performance and ensure we’re getting the most accurate sentiment analysis possible.
Let’s take a closer look at how this function works:
from langchain_community.chat_models import ChatOllama
from openai import OpenAI
import keyring
import pandas as pd
# sentimental analysis
def sentiment_analysis(prompt=prompt, link=None, model='llama'):
# getting contents from the link
article = get_article_from_link(link)
article_text = article.text if article.text else "No article text available."
article_date = article.publish_date.strftime('%Y-%m-%d') if article.publish_date else "Unkown publish date"
query = f"{prompt}\n\n#### article:\n\n{article_text}\n\n#### publish date:\n\n{article_date}"
# getting model's response
if model == 'llama':
llm = ChatOllama(model='llama3.1')
response = llm.invoke(query)
return response.content, article.text
else:
llm = OpenAI(api_key=keyring.get_password('openai', 'key_for_windows'))
response = llm.chat.completions.create(
model=model,
messages=[
{'role':'system', 'content':'You are a helpful assistant.'},
{'role':'user', 'content':query}
]
)
return response.choices[0].message.content, article.text
Now, let’s put our code to the test and see how it performs in real-world scenarios. To begin, we’ll fetch a collection of news articles related to a hot topic in the tech world: OpenAI. This cutting-edge artificial intelligence company has been making waves in the industry, and its developments often have significant impacts on the stock market.
By using our get_news_links()
function with the keyword ‘openai’, we can quickly gather a set of relevant articles for analysis. This step showcases the power and efficiency of our data collection process, setting the stage for our sentiment analysis adventure.
# example : links with keyword 'openai'
links = get_news_links(keyword='openai')
Now, let’s put our sentiment analysis tool to the test with a real-world example. We’ll focus on OpenAI, a company that’s been making waves in the tech industry and influencing stock market trends.
First, we’ll use our get_news_links()
function to fetch recent articles about OpenAI. Then, we’ll analyze the sentiment of one of these articles using our Llama3.1 model. This process showcases the power and efficiency of our data collection and analysis pipeline.
Here’s how it works:
answer = sentiment_analysis(link=links['url'][2])
answer
"{'entity': 'OpenAI', 'ticker': 'None (Private Company)', 'sentiment': 'negative', 'score': 0.35, 'datetime': '2024-09-03'}"
To provide a comprehensive analysis and ensure the accuracy of our results, we employ multiple state-of-the-art language models. We compare the outputs from Llama3.1, GPT-3.5-Turbo, and GPT-4, allowing us to cross-validate our findings and identify any potential biases or inconsistencies in the sentiment analysis. This multi-model approach not only enhances the reliability of our results but also offers valuable insights into the strengths and limitations of each AI model in the context of financial sentiment analysis..
answer = sentiment_analysis(link=links['url'][2], model='gpt-3.5-turbo')
answer
"{'entity':'OpenAI', 'ticker':'', 'sentiment':'negative', 'score':0.2, 'datetime':'2024-09-03'}"
answer = sentiment_analysis(link=links['url'][2], model='gpt-4o')
answer
'```json\n{\n "entity": "OpenAI",\n "ticker": null,\n "sentiment": "negative",\n "score": 0.3,\n "datetime": "2024-09-03"\n}\n```'
Our comparative analysis of sentiment analysis models yields fascinating results. Llama3.1 demonstrates performance on par with the GPT models in this task, showcasing the rapid advancements in open-source AI. Interestingly, even the highly advanced GPT-4 model occasionally produces output in an unexpected format, highlighting the importance of robust error handling in production systems.
To enhance the robustness and accuracy of your sentiment analysis, consider implementing an ensemble approach. By leveraging multiple models and aggregating their results, you can create a more reliable and nuanced analysis system. This method not only improves overall performance but also helps mitigate individual model biases, providing a more comprehensive view of market sentiment.
Diving into Sentiment Analysis: Extracting Insights from Financial News
In this section, we’ll explore the fascinating world of sentiment analysis applied to financial news. We’ll be focusing on three tech giants: Apple, Microsoft, and Tesla. Our goal is to create comprehensive dataframes that capture not just the news content, but also the sentiment behind it.
These dataframes will include a wealth of information:
- News article titles and links
- Publication dates and when we first encountered the articles
- The full text content of each article
- Identified entities (companies) and their stock tickers
- Sentiment classification (positive or negative)
- Numerical sentiment scores
To streamline this process, we’ll create a powerful function that takes a dataframe of news links as input. This function will iterate through each link, perform sentiment analysis, and return a comprehensive dataframe with all the information we need. Let’s dive into the code and see how it works!
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
# function getting links dataframe return data dataframe
def get_combined_dataframe(df):
links = []
titles = []
texts = []
seen_dates = []
entities = []
tickers = []
sentiments = []
scores = []
publish_dates = []
for i in tqdm(range(len(df))):
link = df['url'][i]
links.append(link)
title = df['title'][i]
titles.append(title)
seen_date = df['seendate'][i]
seen_dates.append(seen_date)
try:
answer, text = sentiment_analysis(link=link)
answer = eval(answer)
texts.append(text)
entity = answer['entity']
entities.append(entity)
ticker = answer['ticker']
tickers.append(ticker)
sentiment = answer['sentiment']
sentiments.append(sentiment)
score = answer['score']
scores.append(score)
publish_date = answer['datetime']
publish_dates.append(publish_date)
except Exception as e:
print(e)
texts.append('')
entities.append('')
tickers.append('')
sentiments.append('')
scores.append(np.nan)
publish_dates.append('')
continue
print(f'title: {len(titles)}, link: {len(links)}, seen_date: {len(seen_dates)}, text: {len(texts)}, entity: {len(entities)}, score: {len(scores)}, publsih_date: {len(publish_dates)}')
# combine all in the DataFrame
articles = pd.DataFrame({
'title': titles,
'link': links,
'seen_date': seen_dates,
'texts': texts,
'entity': entities,
'ticker': tickers,
'score': scores,
'publish_date': publish_dates
})
return articles
Next, we dive into the heart of our data collection process. We’ll gather news articles and sentiment data for three tech giants that consistently make waves in the stock market: Apple, Microsoft, and Tesla. To ensure we’re working with the most current information, we’ll set our start date to January 1, 2024.
Here’s how we’ll proceed:
- Fetch news articles using our custom
get_news_links()
function- For each company, we’ll collect articles published since the start of 2024
- This gives us a comprehensive dataset of recent news coverage
- Process the collected data with our
get_combined_dataframe()
function- This function will perform sentiment analysis on each article
- It will also extract key information like publication dates, entity names, and sentiment scores
By the end of this process, we’ll have rich, structured datasets for each company, providing a solid foundation for our subsequent analysis and machine learning tasks.
# get the apple's dataframe
links_apple = get_news_links(start_date='2024-08-01')
df_apple = get_combined_dataframe(links_apple)
# get the miscrosoft's dataframe
links_microsoft = get_news_links(keyword='microsoft', start_date='2024-08-01')
df_micosoft = get_combined_dataframe(links_microsoft)
# get the tesla's dataframe
links_tesla = get_news_links(keyword='tesla', start_date='2024-08-01')
df_tesla = get_combined_dataframe(links_tesla)
With our sentiment analysis pipeline in place, we’ve successfully gathered a treasure trove of data. Our dataset now includes not only the raw news articles but also valuable sentiment scores for each piece. This rich combination of textual content and quantified sentiment provides us with a powerful tool for understanding market dynamics.

Let’s take a moment to appreciate what we’ve accomplished:
- We’ve collected recent news articles for major tech companies like Apple, Microsoft, and Tesla
- Each article has been processed through our sophisticated sentiment analysis models
- We now have sentiment scores that quantify the emotional tone of each piece of news
The fusion of news content and sentiment scores opens up exciting possibilities for predicting market movements and understanding the complex interplay between media coverage and stock performance.
Bringing It All Together: Sentiment Analysis Meets Stock Prices
Ready to dive into the exciting world where sentiment meets stock prices? Let’s embark on a data-driven journey to uncover the hidden connections between public opinion and market movements. Our mission: to explore whether our sentiment analysis scores can be a crystal ball for stock price predictions.
Picture this: We’ve got a treasure trove of sentiment data on tech giants like Apple, Microsoft, and Tesla. Now, we’re going to see if we can connect the dots between these sentiment scores and the ups and downs of stock prices. It’s like being a financial detective, searching for clues in a sea of data!
But before we can start our investigation, we need to clean up our data. Think of it as decluttering our digital workspace:
- We’ll sweep away any rows with missing data (goodbye, null values!)
- We’ll transform our ‘publish_date’ into a proper date format (because time is of the essence)
- And finally, we’ll sort everything by date (chronology is key in our time-traveling analysis)
With our data polished and primed, we’re ready to uncover insights that could potentially power our future AI-driven trading strategies. Excited? Let’s roll up our sleeves and dive into the numbers!
# microsoft
# drop na rows
df_microsoft_processed = df_micosoft.dropna()
# dropout rows with no score
df_microsoft_processed = df_microsoft_processed[df_microsoft_processed['score'] != '']
# convert 'seen_date' from str to date
df_microsoft_processed['date'] = pd.to_datetime(df_microsoft_processed['seen_date'], format='%Y%m%dT%H%M%SZ')
# df_microsoft_processed['date'] = df_microsoft_processed['date'].dt.strftime('%Y-%m-%d')
df_microsoft_processed.set_index('date', inplace=True)
df_microsoft_processed.sort_index(inplace=True)
Upon analyzing the daily sentiment score plot, an intriguing pattern emerges. At first glance, the data points appear scattered, seemingly without a discernible trend. This randomness, however, is not unusual in sentiment analysis of financial news.
import pandas as pd
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.scatter(df_microsoft_processed.index, df_microsoft_processed['score'], marker='o', label='score')
plt.title('score')
plt.xlabel('date')
plt.ylabel('score')
plt.xticks(rotation=45)
plt.legend()
plt.grid()
plt.tight_layout()
plt.show()

To smooth out daily fluctuations and capture broader trends, we calculate a 5-day rolling average of sentiment scores. This window aligns perfectly with the typical business week, providing a balanced view of market sentiment over time. Let’s add this insightful metric to our dataset:
# 5 days rolling
df_microsoft_processed['5d_avg'] = df_microsoft_processed['score'].rolling(window='5D').mean()
# plot scatter graph
plt.figure(figsize=(10, 5))
plt.scatter(df_microsoft_processed.index, df_microsoft_processed['5d_avg'], marker='o', label='score')
plt.title('score')
plt.xlabel('date')
plt.ylabel('score')
plt.xticks(rotation=45)
plt.legend()
plt.grid()
plt.tight_layout()
plt.show()
This new column, ‘5d_avg’, offers a more stable representation of sentiment trends, potentially revealing patterns that might be obscured by day-to-day volatility.

Now that we have our sentiment scores, it’s time to compare them with real-world stock prices. This comparison will allow us to explore potential correlations between public sentiment and market movements. To obn tain reliable stock data, we’ll leverage the powerful yfinance
library, which provides an easy and efficient way to fetch historical stock information.
Here’s why using yfinance
is a game-changer for our analysis:
- Effortless data retrieval: With just a few lines of code, we can access comprehensive stock data for any publicly traded company.
- Up-to-date information:
yfinance
provides real-time and historical data, ensuring our analysis is based on the most current market trends. - Rich dataset: Beyond just prices, we can access additional metrics like trading volume and dividend information, opening up possibilities for more nuanced analysis.
By combining our sentiment scores with this robust stock data, we’re setting the stage for a fascinating exploration of the interplay between public opinion and stock market performance. Let’s dive in and see what insights we can uncover!
Now, let’s dive into an exciting exploration of Microsoft stock prices during the same timeframe as our sentiment analysis. This comparison promises to unveil fascinating insights into the relationship between public sentiment and market performance.
# stock data
import yfinance as yf
msft = yf.Ticker('MSFT')
price_micorsoft = msft.history(start='2024-08-01', end='2024-09-20')
# plot line graph
plt.figure(figsize=(10, 5))
plt.plot(price_micorsoft.index, price_micorsoft['Close'], marker='o', label='Miscrosoft stock price')
plt.title('score')
plt.xlabel('date')
plt.ylabel('stock price')
plt.xticks(rotation=45)
plt.legend()
plt.grid()
plt.tight_layout()
plt.show()

The similarity between the sentiment analysis rolling average scores and stock price graphs is striking, suggesting a potential correlation. To delve deeper into this intriguing relationship, we’ll conduct a statistically meaningful correlation analysis.
To capture the dynamic nature of market reactions, we’re implementing a novel approach: creating a “time machine” of stock prices. By adding columns for shifted stock prices – ranging from 1 day to 30 days into the future – we can track how today’s sentiment might influence tomorrow’s market movements, next week’s trends, or even longer-term price shifts.
This time-shifted correlation analysis promises to uncover hidden patterns in the interplay between sentiment and stock performance. Are investors reacting immediately to news, or does sentiment take time to manifest in stock prices? Our innovative approach aims to shed light on these crucial questions, potentially revolutionizing our understanding of market dynamics.
# shifted stock price
df_conbined_miscrosoft['Close_shift_1'] = df_conbined_miscrosoft['Close'].shift(1)
df_conbined_miscrosoft['Close_shift_2'] = df_conbined_miscrosoft['Close'].shift(2)
df_conbined_miscrosoft['Close_shift_3'] = df_conbined_miscrosoft['Close'].shift(3)
df_conbined_miscrosoft['Close_shift_5'] = df_conbined_miscrosoft['Close'].shift(5)
df_conbined_miscrosoft['Close_shift_10'] = df_conbined_miscrosoft['Close'].shift(10)
df_conbined_miscrosoft['Close_shift_15'] = df_conbined_miscrosoft['Close'].shift(15)
df_conbined_miscrosoft['Close_shift_20'] = df_conbined_miscrosoft['Close'].shift(20)
df_conbined_miscrosoft['Close_shift_30'] = df_conbined_miscrosoft['Close'].shift(30)
# simple correlations
corr_miscrosoft = df_conbined_miscrosoft[['score', '5d_avg', 'Close', 'Close_shift_1', 'Close_shift_2', 'Close_shift_3', 'Close_shift_5', 'Close_shift_10', 'Close_shift_20', 'Close_shift_30']].corr()
corr_miscrosoft

# correlation p-value
from scipy.stats import pearsonr
from tqdm.notebook import tqdm
days = [0, 1, 2, 3, 5, 10, 20, 30]
correlations = []
p_values = []
for d in tqdm(days):
if d == 0:
column = 'Close'
else:
column = f'Close_shift_{d}'
correlation, p_value = pearsonr(df_combined_miscrosoft['5d_avg'][d:], df_combined_miscrosoft[column][d:])
correlations.append(correlation)
p_values.append(p_value)
df_corr_microsoft = pd.DataFrame({
'shifted_day': days,
'correlation': correlations,
'p-value': p_values
})
df_corr_microsoft.round(6)

Our analysis has yielded fascinating results. The sentiment scores and stock prices demonstrate a statistically significant correlation, with the relationship strengthening over time. Interestingly, this correlation becomes more pronounced when we look at future stock prices, peaking at around 20 days forward.
To provide a more comprehensive view, I extended my analysis to include other major tech stocks: AAPL (Apple) and TSLA (Tesla). While the correlation wasn’t as strong as with Microsoft, we observed statistically significant relationships when examining stock price shifts 10 to 20 days after sentiment analysis. This finding suggests a potential delayed impact of public sentiment on stock performance across different tech companies.
These results highlight the complexity of the relationship between sentiment and stock prices. They indicate that:
- The impact of public sentiment on stock prices may not be immediate, but rather unfold over days or weeks.
- Different companies may experience varying degrees of sentiment influence on their stock prices.
- The time lag between sentiment shifts and stock price movements could provide valuable insights for investors and analysts.
Further research is needed to explore these patterns across a broader range of companies and industries. This could lead to more robust models for predicting stock price movements based on sentiment analysis, potentially offering a competitive edge in the financial markets.

Summary
Our initial test provides a promising foundation for exploring the potential of sentiment analysis in stock market predictions. However, there are several avenues for improvement and refinement:
- Fine-tuning the large language model could enhance our reasoning capabilities and sentiment analysis accuracy.
- Experimenting with different parameters, such as the rolling average window size and the analysis period, could yield more insightful results.
- Exploring various LLM models might offer different perspectives and potentially more accurate sentiment scores.
While sentiment scores may not be the sole determinant of stock prices, they could serve as a valuable factor in training a stock trading agent using reinforcement learning. This exciting possibility will be explored in a future post, where we’ll delve deeper into the intersection of sentiment analysis and algorithmic trading strategies.
You can see the jupyter notebook code in the github
One Comment