Enhancing Python Projects with AI: A Claude 3 API Tutorial

Apr 03, 2024

Generative AI is becoming increasingly prevalent, akin to the internet revolution of the 2000s. As a programmer, it’s vital to grasp the opportunities AI offers and integrate them into your projects.

In this article series, You will learn to incorporate Artificial Intelligence into your Python projects ranging from a simple API integration to more complex projects.

In the first article, you’ll create a Python script using the Claude 3 API to streamline invoice processing. Claude 3 is the leading AI model according to various benchmarks surpassing even OpenAI models (GPT-3.5).

Step 1: Import Libraries

import os
import re
import pandas as pd
from pdfminer.high_level import extract_text
import anthropic

Import essential libraries for your Python script:

os for interacting with the operating system
re for regular expressions
pandas for handling data
extract_text from pdfminer.high_level for extracting text from PDF files
anthropic for interfacing with the Claude 3 API

Step 2: Define Function to Extract Text from PDF

def extract_text_from_pdf(pdf_path):
    text = extract_text(pdf_path)
    return text

Create a function to extract text from a PDF file
The pdfminer package is known for its accuracy in extracting text from PDF files. It can handle complex layouts and fonts well, providing reliable results.

Step 3: Feed The Text to The AI Model

def extract_invoice_data(pdf_text):
    # Setting up Claude 3 API
    client = anthropic.Anthropic(
        # Your API key here
        api_key="YOUR_API_KEY",
    )

    # Processing invoice text with Claude 3
    message = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens= 1000,
        temperature=0.0,
        system="extract invoice price, date and number",
        messages=[
            {"role": "user", "content": pdf_text},
            {"role": "assistant", "content": "Invoice price, date and number is:"}
        ]
    )

    # Extracting relevant information using regular expressions
    input_string = message.content[0].text.strip()
    invoice_number = re.search(r"Invoice number: (\d+)", input_string).group(1)
    invoice_date = re.search(r"Invoice date: (\d{2}-\d{2}-\d{2})", input_string).group(1)
    invoice_total_value = re.search(r'Invoice total value:\s*EUR\s*([\d\.,]+)', input_string).group(1)

    # Combining extracted information into a dictionary
    data = {
        "Invoice number": invoice_number,
        "Invoice date": invoice_date,
        "Invoice total value": invoice_total_value
    }

    return data

This function takes the text extracted from a PDF as input and leverages the Claude 3 API to extract vital invoice details, including the invoice number, date, and total value. You can obtain a free API key from Claude3 by visiting: Claude3 API Documentation.

Once you have your API key, you’ll need to fine-tune the AI client as follows:

Model Selection: The model is configured to use Opus, the latest version available.
Max Tokens: This parameter determines the maximum number of words or tokens the model can generate.
Temperature: This parameter controls the randomness of the predictions. Setting it to zero ensures consistent responses.
System Specification: Here, you specify the parameters you’re interested in — price, number, and date — directing the AI accordingly.
Message Interaction: The AI is interacted with in two roles:
User: Providing the AI with the user input (the PDF file).
Assistant: Instruct the AI to only return the extracted invoice number, date, and price from the text.

Regular expressions are crucial in extracting specific details from the processed text. Once extracted, the data is consolidated into a dictionary format and returned for further processing.

Step 4: Iterate Over PDF Files in Folder and Extract Data

folder_path = r"C:\YourFolder"
all_invoice_data = pd.DataFrame(columns=["Invoice number", "Invoice date", "Invoice total value"])
lst = []
for filename in os.listdir(folder_path):
    if filename.endswith('.pdf'):
        pdf_path = os.path.join(folder_path, filename)
        pdf_text = extract_text_from_pdf(pdf_path)
        invoice_data = extract_invoice_data(pdf_text)
        lst.append(invoice_data)
all_invoice_data = pd.DataFrame(lst)

You specify the path to the folder containing PDF invoices and initialize an empty DataFrame to store the extracted invoice data.

Iterate Over Files: Go through each file in the specified folder.

Check if the file ends with “.pdf”.
If it does, move on to the next step.

Extract Text: Use the extract_text_from_pdf() function to get text from the PDF file.

Extract Invoice Data: Utilize the extract_invoice_data() function to extract invoice details from the text.

Append Data to List: Add the extracted invoice data to a list.

Convert to DataFrame: Transform the list of extracted invoice data into a DataFrame using pd.DataFrame().

If you enjoy this content, follow me for more. Remember, it’s not just about the outcome, but the journey itself!

Full Code:

# -*- coding: utf-8 -*-
import os
import re
import pandas as pd
from pdfminer.high_level import extract_text
import anthropic

# Function to extract text from PDF
def extract_text_from_pdf(pdf_path):
    text = extract_text(pdf_path)
    return text

# Function to extract invoice data
def extract_invoice_data(pdf_text):
    # Setting up Claude 3 API
    client = anthropic.Anthropic(
        # Your API key here
        api_key="",
    )

    # Processing invoice text with Claude 3
    message = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1000,
        temperature=0.0,
        system="extract invoice price, date and number",
        messages=[
            {"role": "user", "content": pdf_text},
            {"role": "assistant", "content": "Invoice price, date and number is:"}
        ]
    )

    # Extracting relevant information
    input_string = message.content[0].text.strip()
    invoice_number = re.search(r"Invoice number: (\d+)", input_string).group(1)
    invoice_date = re.search(r"Invoice date: (\d{2}-\d{2}-\d{2})", input_string).group(1)
    invoice_total_value = re.search(r'Invoice total value:\s*EUR\s*([\d\.,]+)', input_string).group(1)

    # Putting it all together
    data = {
        "Invoice number": invoice_number,
        "Invoice date": invoice_date,
        "Invoice total value": invoice_total_value
    }

    return data

# Path to the folder containing PDF invoices
folder_path = r"C:\YourFolder"

# Initialize an empty DataFrame
all_invoice_data = pd.DataFrame(columns=["Invoice number", "Invoice date", "Invoice total value"])

# Iterating over each file in the folder
lst = []
for filename in os.listdir(folder_path):
    if filename.endswith('.pdf'):
        pdf_path = os.path.join(folder_path, filename)
        pdf_text = extract_text_from_pdf(pdf_path)
        invoice_data = extract_invoice_data(pdf_text)
        lst.append(invoice_data)

# Putting all the data together
all_invoice_data = pd.DataFrame(lst)

# Showing the final result
print(all_invoice_data)

Edward’s Substack

Discussion about this post