In the wake of ChatGPT’s groundbreaking debut, the world of large language models has exploded with innovation. Tech giants like Google and Anthropic have introduced their own models, such as Gemini and Claude, respectively. Meanwhile, the open-source community has been making significant strides, with Meta’s Llama leading the charge. These open-source initiatives have democratized AI, making it possible for individuals and organizations to build their own chatbots with relative ease.
The appeal of on-premise small language models (sLLMs) is multifaceted:
- Cost-Effectiveness: On-premise models eliminate ongoing subscription fees, making them a budget-friendly option for many organizations.
- Customization: These models can be fine-tuned to suit specific needs, allowing for greater flexibility and specialization.
- Enhanced Knowledge Integration: Through Retrieval Augmented Generation (RAG) techniques, on-premise LLMs can tap into your organization’s proprietary databases, enriching their responses with domain-specific knowledge.
In this blog post series, we’ll explore how to harness the power of open-source tools like Llama, Chroma, and Hugging Face to create your own DIY on-premise chatbot with RAG capabilities.
The Architecture of Our DIY Chatbot
Before we dive into the nitty-gritty of our project, let’s take a bird’s-eye view of our chatbot’s architecture. Don’t worry if you’re not a coding wizard – basic Python knowledge is all you need to get started!
1. User Interface: Streamlit1. User Interface: Streamlit
We’ll use Streamlit, a user-friendly library that allows us to create a sleek chat interface with just a few lines of code. It’s simple, efficient, and perfect for our needs.
2. Knowledge Base: RAG and Vector Database
Our chatbot will use Retrieval Augmented Generation (RAG) with a vector database. This powerful combination allows for quick information retrieval by comparing vectorized data. We’ll use Hugging Face’s open-source embedder to convert text into vectors and Chroma as our vector database.
3. Brain of the Bot: Llama and Prompt Engineering
The final step involves some clever prompt engineering. We’ll combine the user’s question with our vector retrieval results to create a tailored prompt. This prompt is then fed to our large language model, Llama, which generates the response.
And voilΓ ! The chatbot sends Llama’s response back to the user. It might sound complex, but don’t worry – we’ll break it down step by step. The diagram below illustrates this process in a simple, easy-to-understand format.

In this post, we’ll walk you through the setup process for creating a RAG chatbot. Let’s start with the foundation: downloading Llama 3.1.
Downloading Llama
There are several methods to download and install the Llama 3.1 model, but we’ll focus on using the Ollama library. Ollama is an open-source project that provides a user-friendly platform for running Large Language Models (LLMs) on your local machine. It’s versatile, allowing you to easily download and install various open-source LLM models, including Mistral and Gemma2.
Here’s how to get started:

- Visit the official Ollama website (https://ollama.com/) and install the software.
- Once installation is complete, open your terminal (Mac) or Command Prompt (Windows) and run the following command to download Llama 3.1:
ollama run llama3.1
It’s worth noting that Llama 3.1 is available in different sizes: 8B, 70B, and 405B parameters. For most local PC setups, the 8B model is recommended as it offers performance comparable to GPT-3.5 while being more manageable in terms of computational resources.
After the download is complete, you can start interacting with Llama 3.1. To exit the Llama interface, simply use the key combination Ctrl + D.

Installing Required Libraries
To build our RAG chatbot, we’ll need to install several key libraries. Here’s a list of the necessary packages along with their installation commands:
# langchain
pip install langchain
# pip install langchain-core # installed with langchain
pip install langchain-community
# sentence transformers
pip install sentence_transformers
# Vector database
pip install chromadb
# Hugging Face embeddings
pip install llama-index-embeddings-huggingface
pip install llama-index-embeddings-instructor
# LlamaIndex for advanced indexing and querying
pip install llama-index
# Streamlit for creating the user interface
pip install streamlit
With these steps completed, you’ve laid the groundwork for your RAG chatbot.
π Automate Your Workflow with Make.com!
πΉ Save time & boost productivity with powerful llm models.
πΉ No coding required β integrate your favorite apps effortlessly.
πΉ Start for free and unlock limitless possibilities!
Having a Conversation with Llama
We can use the Llama model by importing the Ollama library and instantiating it as shown below. You can write this code in your Jupyter Notebook or Google Colab:
from langchain_community.chat_models import ChatOllama
# Instantiate the model
llm = ChatOllama(model='llama3.1')
# a prompt
question = "What is the currency of Thailand?"
# Get a response
response = llm.invoke(question)
# print answer
print(response.content)
With the Ollama library, getting a response is as simple as calling the invoke function. It’s that straightforward! Now you can have a conversation with Llama right on your PC.
content='The official currency of Thailand is the Thai Baht (THB). It\'s denoted by the symbol "ΰΈΏ" and comes in various denominations, including coins (1 baht to 5 baht) and banknotes (20 baht to 1,000 baht).\n\nSo, if you\'re planning a trip to Thailand or want to send money there, it\'s essential to know that THB is the currency used!' response_metadata={'model': 'llama3.1', 'created_at': '2024-09-15T02:42:52.421247Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 8683344042, 'load_duration': 31358833, 'prompt_eval_count': 17, 'prompt_eval_duration': 1302898000, 'eval_count': 89, 'eval_duration': 7345597000} id='run-58440fb2-7805-45ce-9f7e-3f9b9e74b42b-0'
Llama returns a response in a format that includes ‘content’ and ‘response_metadata’. Since we’re primarily interested in the answer, we can simply print the ‘content’ attribute using response.content.
Embedding: Transforming Documents into Vectors
Let’s dive into the fascinating world of embedding! Before we get into the nitty-gritty of the code, it’s important to understand two key concepts: chunking and embedding.
Chunking is like slicing a long text into small-sized pieces. Imagine you’re trying to eat a foot-long sandwich – it’s much easier to handle when cut into smaller portions, right? That’s exactly what chunking does for our text data. It breaks down lengthy documents into manageable chunks, making it easier for our language model to digest and understand.
Embedding, on the other hand, is the process of translating these text chunks into numbers – or more specifically, vectors. Think of it as creating a unique numerical “fingerprint” for each piece of text. This allows our AI to understand and compare different pieces of text more efficiently.

To make this magic happen, we’ll be using three powerful tools:
- RecursiveCharacterTextSplitter: Our expert text slicer for chunking
- HuggingFaceEmbeddings: Our numerical translator, using the ‘BAAI/bge-m3’ model
- PyPDF: Our PDF whisperer, helping us read and process PDF files
Let’s start by loading an example PDF. For this demonstration, we’ll use the famous paper “Attention is All You Need” – a true classic in the world of natural language processing. Don’t worry if you don’t have this specific paper; any PDF will do for our purposes.
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
# load PDF and chunking documents function
def chunk_docs(docs, chunk_size=200, chunk_overlap=20):
loader = PyPDFLoader(docs)
pages = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
data = splitter.split_documents(pages)
return data
Retrieving Information from Vector Databases
Now that we’ve stored our documents in the vector database, we can easily retrieve information related to any input query. Let’s walk through this process with a practical example.
Imagine we want to ask, “How many layers do the encoder and decoder have according to the paper?” Here’s what happens behind the scenes:
- Our embedder transforms the query into a vector.
- The Chroma database searches for the most similar documents based on vector values.
- It returns the top ‘k’ related documents (in our example, we’ve set k=3).
This process gives us the three most relevant documents from our database, which we can then pass to our LLM model for further processing. It’s like having a super-smart research assistant at your fingertips!
from langchain.vectorstores import Chroma
# settings
vectordb_path = './vectorstore'
embedding = HuggingFaceEmbeddings(model_name='BAAI/bge-m3')
vectorstore = Chroma(persist_directory=vectordb_path, embedding_function=embedding)
# retriever with 3 closest result
retriever = vectorstore.as_retriever(search_kwargs={'k':3})
# query
query = "How many layers do the encoder and decoder have according to the paper?"
docs = retriever.invoke(query)
print(docs)
We’ve retrieved the three most relevant documents from our vector database.
[Document(
metadata=
{'page': 2, 'source': './data/NIPS-2017-attention-is-all-you-need-Paper.pdf'}, page_content='Decoder: The decoder is also composed of a stack of N= 6identical layers. In addition to the two\nsub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head'
), Document(
metadata=
{'page': 2, 'source': './data/NIPS-2017-attention-is-all-you-need-Paper.pdf'}, page_content='Decoder: The decoder is also composed of a stack of N= 6identical layers. In addition to the two\nsub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head'
), Document(
metadata=
{'page': 1, 'source': './data/NIPS-2017-attention-is-all-you-need-Paper.pdf'}, page_content='Encoder: The encoder is composed of a stack of N= 6 identical layers. Each layer has two\nsub-layers. The ο¬rst is a multi-head self-attention mechanism, and the second is a simple, position-\n2'
)
]
Mastering the Art of Prompt Engineering: Unlocking the Full Potential of LLMs
In the ever-evolving world of AI, prompt engineering has emerged as a crucial skill for anyone working with Large Language Models (LLMs). But what exactly is prompt engineering, and why is it so important? Let’s dive in and explore this fascinating aspect of AI interaction.
What is Prompt Engineering?
Prompt engineering is the art and science of crafting inputs (prompts) that effectively communicate tasks to LLMs, resulting in accurate and useful outputs. It’s like learning to speak the language of AI, allowing us to unlock the full potential of these powerful models.
Key Techniques in Prompt Engineering
Let’s explore three essential techniques that can significantly enhance your interactions with LLMs:
1. Few-Shot Learning: Teaching by Example
When dealing with complex tasks, sometimes the best approach is to show, not tell. Few-shot learning involves providing the LLM with a few examples of the desired input-output pattern. This technique can dramatically improve the accuracy and consistency of the model’s responses.
# Few-shot Example
News : {news}
Corp : <name>
Sentiment : <positive/negative>
Score : <number>
###
News : 'OpenAI Unveils New ChatGPT That Can Reason Through Math and Science'
Corp : 'OpenAI'
Sentiment : 'positive'
Score : 0.9
###
News : 'U.S Argues Google Created Ad Tech Monopoly'
Corp : 'Google'
Sentiment : 'negative'
Score : 0.2
2. Output Structuring: Shaping the Response
By specifying a particular output format, we can ensure that the LLM’s responses are consistent and easy to integrate into other systems. This approach reduces errors and streamlines the process of working with AI-generated content.
# Output structuring
Give the final answer as a valid JSON as below
###
Format :
[{
'news': <news>,
'corp': <corp name>,
'sentiment' : <positive or negative>,
'score' : <float>
}, ...]
3. Prompting Personas: Tailoring the AI’s Voice
The choice of words in our prompts can significantly influence the model’s output. By creating specific “personas” for the LLM, we can shape its tone and style to better suit our needs. This technique allows for more nuanced and context-appropriate responses.
By mastering these prompt engineering techniques, you’ll be well-equipped to harness the full power of LLMs in your projects. Remember, effective prompt engineering is as much an art as it is a science – don’t be afraid to experiment and find what works best for your specific use case!
{"role" : "system",
"content" : "you are a helpful assistant"}
Putting It All Together: Building Our Chatbot
Now that we’ve explored the individual components of our chatbot, it’s time to bring everything together. In this section, we’ll walk through the process of integrating the LLM, vector database, and retriever to create a functional question-answering system.
Step-by-Step Integration
Let’s break down the process into manageable steps:
- Initialize the LLM model, vector database, and retriever
- Define a helper function to format retrieved documents
- Formulate a question and retrieve relevant documents
- Construct a prompt using the retrieved information
- Generate a response using the LLM
Here’s the code that brings these steps to life:
from langchain_community.chat_models import ChatOllama
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
# Initialize LLM model
llm = ChatOllama(model='llama3.1')
embedder = HuggingFaceEmbeddings(model_name='BAAI/bge-m3')
# Set up retrieval system
vectordb_path = './vectorstore'
vectorstore = Chroma(persist_directory=vectordb_path, embedding_function=embedder)
retriever = vectorstore.as_retriever(search_kwargs={'k':3})
# Helper function to format documents
def format_docs(docs):
return '\n\n'.join([doc.page_content for doc in docs])
# Example question
question = "How many layers does the decoder have?"
# Retrieve relevant documents
retrieved = retriever.invoke(question)
docs = format_docs(retrieved)
# Construct the prompt
prompt = f"""You are a helpful assistant chatbot. Answer only in English and based on the documents provided.
### Documents:
{docs}
### Question:
{question}
"""
# Generate response
response = llm.invoke(prompt)
print(response.content)
When we run this code, we get a concise and accurate response based on the information in our vector database:
According to the documents, the decoder has "a stack of N= 6 identical layers". Therefore, the decoder has 6 layers.
This example demonstrates how we can leverage the power of LLMs, vector databases, and retrieval systems to create a knowledgeable chatbot that can answer specific questions based on the information it has been trained on.
Creating a User-Friendly Interface with Streamlit
Now that we’ve built the core functionality of our chatbot, let’s make it accessible to a wider audience by creating an intuitive user interface. We’ll use Streamlit, a powerful library that simplifies the process of building web applications for machine learning and data science projects.
Setting Up the Chatbot Interface
Streamlit offers a range of components to quickly compose a sleek UI. Here’s how we can create a basic chatbot interface:
1. Set the chat title:
import streamlit as st
st.title('Llama with RAG')
2. Manage chat history:
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
3. Create an input field for user questions:
question = st.chat_input("Ask me anything!")
Putting It All Together
Here’s a simplified version of our Streamlit app that combines these elements with our RAG-powered chatbot:
import streamlit as st
from langchain_community.chat_models import ChatOllama
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
# Initialize components (LLM, embeddings, retriever)
# ... (code omitted for brevity)
st.title("Llama with RAG")
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []
# Display chat history
for message in st.session_state.chat_history:
with st.chat_message(message["role"]):
st.markdown(message["content"])
if question := st.chat_input("Ask me anything!"):
# Add user question to chat history
st.session_state.chat_history.append({"role": "user", "content": question})
# Retrieve relevant documents
docs = retriever.get_relevant_documents(question)
# Generate response using LLM
response = generate_response(question, docs)
# Add assistant response to chat history
st.session_state.chat_history.append({"role": "assistant", "content": response})
# Display the latest response
with st.chat_message("assistant"):
st.markdown(response)
Running Your Streamlit App
To launch your chatbot app, simply run the following command in your terminal:
streamlit run your_chatbot_app.py
Streamlit will provide a local URL where you can access your chatbot through a web browser.


Conclusion
By leveraging Streamlit, we’ve transformed our RAG-powered chatbot into a user-friendly web application. This interface makes it easy for users to interact with our AI assistant, ask questions, and receive informative responses based on the knowledge we’ve embedded in our vector database.
The complete source code for this project, including the step-by-step Jupyter notebook (llm_rag.ipynb) and the Streamlit app (llm_rag.py), is available on GitHub. Feel free to explore, modify, and build upon this foundation to create your own AI-powered applications!
Stay tuned for more exciting tutorials on AI and machi
One Comment