Text extraction from PDFs: Use case of embeddings in generative AI

Text extraction from PDFs: Use case of embeddings in generative AI
Posted :

Embeddings are a type of representation that can be used to represent data in a way that is more meaningful to machines. They are typically used to represent text, images, or audio data.

In the case of documents, embeddings are typically represented as vectors of numbers. The meaning of each number in the vector is learned from a large corpus of text. For example, the number 0 might represent the word “the”, the number 1 might represent the word “of”, and so on.

Embeddings can be used to represent the relationships between different pieces of data. For example, if two words are often used together, their embeddings will be similar. This can be used to do things like find similar words, translate text from one language to another, or cluster documents together.

Why are embeddings important?

Embeddings allow machines to understand the meaning of data in a way that is more natural and intuitive. It is used for generating text, translating languages, or understanding the meaning of images.

Embeddings are becoming increasingly important as machine learning and artificial intelligence become more sophisticated. They are a key enabler of many of the latest advances in these fields.

Different types of embeddings

  • Word embeddings: These embeddings represent words as vectors of numbers.
  • Sentence embeddings: These embeddings represent sentences as vectors of numbers.
  • Image embeddings: These embeddings represent images as vectors of numbers.
  • Audio embeddings: These embeddings represent audio as vectors of numbers.

In this blog, we will explore sentence embeddings.

Embedded and generative AI

Both are created using neural networks. Embeddings can be used to represent the input data for a generative AI model. This can help the model generate more accurate and relevant output.

Tools and frameworks

For our use case, we will utilize two key tools: Azure OpenAI and LangChain.

Azure OpenAI offers a powerful language model that enables natural language processing (NLP) capabilities. It allows you to generate human-like responses to queries, making it ideal for understanding and responding to user input.

LangChain is a library that simplifies the integration of AI models into various applications. It provides a set of tools and APIs that help in building conversational agents and executing AI-powered actions.

Implementing embedding with generative AI

Let’s dive into the technical steps involved in implementing embeddings along with generative AI for text extraction from a PDF using Azure Open AI and LangChain.

1. Import libraries

First, we need to import the necessary libraries for our project. These libraries include the LangChain libraries for AzureOpenAI, document loader, text splitter and embedding model which will be Vector DB for our project.



import os

from langchain.llms import AzureOpenAI
from langchain.prompts import PromptTemplate
from langchain.document_loaders import PyPDFLoader
from langchain.memory import ConversationBufferMemory
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

2. Define LLM

Next, we define the LLM using Azure OpenAI. We set up the required environment variables and instantiated the AzureOpenAI class.



os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2023-03-15-preview"
os.environ["OPENAI_API_BASE"] = "Your_Azure_openaiapi_base"
os.environ["OPENAI_API_KEY"] = "your_api_key"



llm = AzureOpenAI(
    deployment_name="your_azure_deployment",
    model_name="your_model_name",
    temperature=0.9)

3. Define your document and transfer it into chunks

Now, we must define the path for the documents for which we need to embed and want to do Q&A. This step involves reading documents from the given path and dividing the entire document into different chunks. In this example, we have taken a PDF document. PDFLoader is available for PDF parsing.



directory = "your_document_physical_path"
pages = []
docs = []
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    if os.path.isfile(f) and filename.endswith(".pdf"):
        loader = PyPDFLoader(f)        
        pages += loader.load_and_split()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(pages)

4. Define embedding

Now, we must define the embedding model. This model will help us query the related data from the given chunks. To store the data in vector form, in this example, we have taken the chroma CD. It will help the embedding model search for similar data based on the user query.



embeddings = OpenAIEmbeddings(
                deployment="your_embedding_deployment_name",
                model="your_embedding_model"                
            )

docsearch = Chroma.from_texts([t.page_content for t in texts], embeddings, metadatas=[{"source": i} for i in range(len(texts))])

5. Defining LangChain’s Q&A chain with LLM Prompt

LangChain provides Q&A chains for retrieving user search content from the PDF. The prompt is also an essential part of defining the chain. It will instruct the LLM model about its role, what should be done when a user asks a question, and what form of result is expected.



from langchain.chains.question_answering import load_qa_chain

template = """You are a AI agent which is trained to answer questions about Avnet's guidelines.
Given the following extracted parts of a pdf document and a question, create a final answer.
{context}
{chat_history}
Human: {human_input}
Answer:
Chatbot:"""

prompt = PromptTemplate(
    input_variables=["chat_history", "human_input", "context"], 
    template=template
)

6. Document similarity search with LLM QA chain

Once we have an embedding model defined for vector search, it is time to find the relevant information using similarity search. We will collect content that is useful based on the question a user has asked. Once related documents are found, they will be part of the QA chain for the LLM. Based on these documents, the final output will be given as an answer.



query = "your query from the given document?"
docs = docsearch.similarity_search(query, include_metadata=True)
chain = load_qa_chain(llm=llm, chain_type="stuff", memory=memory, prompt=prompt)
answer = chain({"input_documents":docs, "question":query, "human_input": query}, return_only_outputs=True)
#print(answer)

Conclusion

In this blog post, we have explained how embeddings will be useful for extracting text in a simple question and answer form from your PDF. It will be more helpful when you have large documents and want to filter out the content that you are looking for. We demonstrated how the LangChain and Vector DB will be helpful for your queries within the documents.
This approach simplifies the search process for large documents and enhances the user experience. With the code provided, you can integrate this functionality into your applications and provide your users with a seamless and intuitive search experience.

Softweb Solutions’ generative AI solutions are well-suited for text extraction from PDFs in a simple question and answer form. This approach can simplify the search process for large documents and enhance the user experience.

Our generative AI consultants are committed to developing innovative AI solutions that can help you address real-world problems. We are excited to see how our generative AI capabilities can be useful for your organization. Schedule a call now with our experts!

Need Help?
We are here for you

Step into a new land of opportunities and unearth the benefits of digital transformation.