Processing PDF documents with R and AI models using tidychatmodels

AI
Author

Albert Rapp

Published

March 17, 2024

AI bots like chatGPT and others can be used to extract information from all sorts of unstructured texts. Of course, no one want to copy and paste the text into a chat interface and then copy the result back into a text file. That’s tedious. In this blog post I’ll show you how to do all of that automatically.

Specifically, we will use the tidychatmodels package to extract information from product reviews. We will first generate some fictitious product reviews and save them as pdf-files. Then we will extract the company, product, rating, ways to improve the product, and what is particularly helpful from the product from the pdf-files. And as always, if you want to see the video version of this blog post, you can find it on my YouTube channel.

Create fake product reviews

Alright, let’s start by creating some fictitious product reviews. Do to so, we have to first set up a chat with an AI model. Here, let’s just use a simple model from mistral.ai.

Set up a chat

We will follow steps along my last blog post that introduced the tidychatmodels package. This means that we will load our API keys from a .env file and then create a chat with the mistral-small-latest model.

library(tidyverse)
library(tidychatmodels)
dotenv::load_dot_env('.env')

create_chat(
    vendor = 'mistral', 
    api_key = Sys.getenv('MISTRAL_DEV_KEY')
  ) |> 
    add_model('mistral-small-latest')
## Chat Engine: mistral 
## Messages: 0 
## Model: mistral-small-latest

Afterwards, we can add some parameters to the chat. Here, let’s just throw in a temperature of 0.8. This should make the model a bit more creative in its responses.

create_chat(
    vendor = 'mistral', 
    api_key = Sys.getenv('MISTRAL_DEV_KEY')
  ) |> 
    add_model('mistral-small-latest') |> 
    add_params(temperature = 0.8)
## Chat Engine: mistral 
## Messages: 0 
## Model: mistral-small-latest 
## Parameters: 
##    temperature: 0.8

Then, we add a system message to instruct our AI model what we want it to do. Specifically, we want to include a lot of things in our fictitious product reviews. And these things should be mentioned in an unstructured way inside the text. Otherwise, the AI model might just include all the relevant information at the beginning of the text. And that would not be a particular realistic scenario for real-world applications of text processing.

create_chat(
    vendor = 'mistral', 
    api_key = Sys.getenv('MISTRAL_DEV_KEY')
  ) |> 
    add_model('mistral-small-latest') |> 
    add_params(temperature = 0.8) |> 
    add_message(
      role = 'system',
      message = 'You write 500-word reviews for ficticious products from ficticious companys. You will mention the following things in each review:
      - the name of the company
      - the name of the product
      - a rating (up to 5 stars)
      - ways to improve the product
      - what you find particularly helpful from the product
      
    All of these things are given IN AN UNSTRUCTURED WAY INSIDE THE TEXT. Do not mention these in a header-like structure.'
    )
## Chat Engine: mistral 
## Messages: 1 
## Model: mistral-small-latest 
## Parameters: 
##    temperature: 0.8

Now, all that’s left to do is to add a user message that instructs the AI bot to write a fictitious review for a specific product. And to make sure that we can resuse this code, we will wrap all of this into a function that takes the product as an argument.

product_review <- function(product) {
  chat <- create_chat(
    vendor = 'mistral', 
    api_key = Sys.getenv('MISTRAL_DEV_KEY')
  ) |> 
    add_model('mistral-small-latest') |> 
    add_params(temperature = 0.8) |> 
    add_message(
      role = 'system',
      message = 'You write 500-word reviews for ficticious products from ficticious companys. You will mention the following things in each review:
      - the name of the company
      - the name of the product
      - a rating (up to 5 stars)
      - ways to improve the product
      - what you find particularly helpful from the product
      
    All of these things are given IN AN UNSTRUCTURED WAY INSIDE THE TEXT. Do not mention these in a header-like structure.'
    ) |> 
    add_message(glue::glue('Write a ficitious review for {product}')) |> 
    perform_chat()
  
  extract_chat(chat, silent = TRUE)$message[3]
}

Notice that this function actually performs the chat, i.e. sends the data to mistral.ai (via perform_chat()) and then we extract the response from the chat (via extract_chat()). And with that we can iterate over a list of products and generate a review for each of them.

reviews_dat <- tibble(
  product = c(
    'a bad tech gadget',
    'a glorious beauty product',
    'a kids toy I bought as a present for my kids which was really fun but fell apart after 1 day',
    'an okay-ish newsletter subscription'
  ),
  review = map_chr(product, product_review)
)

Cool, this should give us a bunch of reviews like this:

reviews_dat |> 
  pluck('review', 1) |> 
  str_sub(1, 500) |>  # Cut off the text after 500 characters for demonstration purposes
  cat()
## In the vast and ever-expanding world of technology, it's not uncommon to stumble upon devices that leave you scratching your head in bewilderment. Today, I find myself in just such a predicament as I share my thoughts on the "TechnoBabble TangleTracker," a product from the relatively obscure company known as HyperLooptic Innovations.
## 
## Now, let me start by saying that the concept behind the TangleTracker is intriguing. This small, palm-sized gadget is designed to untangle your earphones, charging

Save reviews as pdf-files

Finally, we should save these reviews as pdf-files. After all, our fake scenario here later on is that we process PDF files. To do so, we’ll just fill in a quarto document with the text and then render it as a pdf-file.

walk2(
  reviews_dat$review, 
  seq_along(reviews_dat$review), 
  \(x, y) {
    # Create temporary quarto document
    writeLines(
      c(
        '---',
        'title: "Review"',
        'format: pdf',
        '---',
        x
      ),
      'review.Rmd'
    )
    # Render the document
    quarto::quarto_render(
      'review.Rmd',
      output_format = "pdf",
      output_file = paste0('review', y, '.pdf')
    )
  }
)

Extract information from pdf-files

Nice, we have a bunch of PDF files that look like this now:

Now, let’s extract

  • the company,
  • product,
  • rating,
  • ways to improve the product, and
  • what is particularly helpful from the product

from the pdf-files. All we have to do is to first extract the text from the pdf-files and then send it to the AI model. For the first part, we can just use the pdftools package.

pdf_text <- pdftools::pdf_text('review1.pdf') |> 
  paste0(collapse = ' ')

# Cut off the text after 500 characters for demonstration purposes
pdf_text |> 
  str_sub(1, 500) |>  
  cat()
##                                          Review
##                                         Albert Rapp
## 
## 
## 
## Table of contents
## 
## In the vast and ever-expanding world of technology, it’s not uncommon to stumble upon
## devices that leave you scratching your head in bewilderment. Today, I find myself in just such
## a predicament as I share my thoughts on the “TechnoBabble TangleTracker,” a product from
## the relatively obscure company known as HyperLooptic Innovations.
## Now, let me start by saying that the conce

Well this doesn’t look nicely formatted. But it doesn’t matter. Hopefully, our AI model cares only about the content.

To make sure that we get a good results, let’s use a large model from Anthropic. The new models from Anthropic are currently hyped in the AI community. So we might as well give it a try. Here’s how we could set up a chat just like before.

create_chat(
    vendor = 'anthropic', 
    api_key = Sys.getenv('ANTHROPIC_DEV_KEY'),
    api_version = '2023-06-01'
  ) |> 
    add_model('claude-3-sonnet-20240229') |> 
    add_params(temperature = 0.2, max_tokens = 1000) |> 
    add_message(
      role = 'system',
      message = 'You are an AI system that excels at extracting relevant information from product reviews. In the following the user will give you the text of a product review and you will extract the following information:
      
      Company: << Insert Name here >>
      
      Product: << Insert Product here >>
      
      Rating: << Insert number here >> stars
      
      Ways to improve: << Insert a short summary in at most 2 sentences here >>
      
      Helpful product features: << Insert a short summary in at most 2 sentences here >>'
    ) |> 
    add_message(pdf_text)
## Chat Engine: anthropic 
## Messages: 2 
## Model: claude-3-sonnet-20240229 
## Parameters: 
##    temperature: 0.2 
##    max_tokens: 1000

And since we want to reuse this code, let’s wrap it into a function that takes the pdf-file as an argument. And of course, we have to make sure that we actually perform the chat and extract the response.

extract_pdf_information <- function(pdf_file) {
  pdf_text <- pdftools::pdf_text(pdf_file) |> 
    paste0(collapse = ' ')

  chat <- create_chat(
    vendor = 'anthropic', 
    api_key = Sys.getenv('ANTHROPIC_DEV_KEY'),
    api_version = '2023-06-01'
  ) |> 
    add_model('claude-3-sonnet-20240229') |> 
    add_params(temperature = 0.2, max_tokens = 1000) |> 
    add_message(
      role = 'system',
      message = 'You are an AI system that excels at extracting relevant information from product reviews. In the following the user will give you the text of a product review and you will extract the following information:
      
      Company: << Insert Name here >>
      
      Product: << Insert Product here >>
      
      Rating: << Insert number here >> stars
      
      Ways to improve: << Insert a short summary in at most 2 sentences here >>
      
      Helpful product features: << Insert a short summary in at most 2 sentences here >>'
    ) |> 
    add_message(pdf_text) |> 
    perform_chat()
  
  extract_chat(chat, silent = TRUE)$message[3]
}

summaries <- tibble(
  pdf_file = paste0('review', seq_along(reviews_dat$review), '.pdf'),
  summary = map_chr(pdf_file, extract_pdf_information)
)
summaries
## # A tibble: 4 × 2
##   pdf_file    summary                                                           
##   <chr>       <chr>                                                             
## 1 review1.pdf "Company: HyperLooptic Innovations\n\nProduct: TechnoBabble Tangl…
## 2 review2.pdf "Company: Luminosa Cosmetics\n\nProduct: Enchanté Lumière Radiant…
## 3 review3.pdf "Company: WhimsyWonders\n\nProduct: Laugh-A-Lot Lion\n\nRating: 3…
## 4 review4.pdf "Company: Interstellar Ink\n\nProduct: Cosmic Chronicle (daily ne…

Clean up the summaries

Nice, it looks like Claude-3-Sonnet extracted the information and put it into our desired format. Let’s just clean up the summaries a bit so that we can have a better look.

summaries |> 
  separate_wider_delim(
    cols = summary, 
    delim = '\n\n',
    names = c('company', 'product', 'rating', 'improvement_potential', 'helpful_features')
  ) |> 
  mutate(
    across(
      company:helpful_features,
      \(x) str_remove_all(
        x, 
        'Company: |Product: |Rating: |Ways to improve: |Helpful product features: '
      )
    )
  ) 
## # A tibble: 4 × 6
##   pdf_file    company      product rating improvement_potential helpful_features
##   <chr>       <chr>        <chr>   <chr>  <chr>                 <chr>           
## 1 review1.pdf HyperLoopti… Techno… 2 sta… The grooves need to … It's surprising…
## 2 review2.pdf Luminosa Co… Enchan… 4.5 s… The price could be m… It contains nou…
## 3 review3.pdf WhimsyWonde… Laugh-… 3 sta… The product could be… The toy's inter…
## 4 review4.pdf Interstella… Cosmic… 3 sta… Establish a more rig… Provides a dive…

Nice with that we have extracted the information from the pdf-files. You could verify the results by looking at the pdf-files and comparing the extracted information with the actual content. But I’m pretty sure that the results are good enough for our fictitious scenario. Let’s leave it at that for now.

Conclusion

We used the fancy AI stuff to extract information from pdf-files. As always with AI models, your milage may vary. Still, I hope you enjoyed this little tutorial. Have a great day and see you next time. And if you found this helpful, here are some other ways I can help you:


Enjoyed this blog post?

Here are three other ways I can help you:

3 Minutes Wednesdays

Every week, I share bite-sized R tips & tricks. Reading time less than 3 minutes. Delivered straight to your inbox. You can sign up for free weekly tips online.

Data Cleaning With R Master Class

This in-depth video course teaches you everything you need to know about becoming better & more efficient at cleaning up messy data. This includes Excel & JSON files, text data and working with times & dates. If you want to get better at data cleaning, check out the course page.

Insightful Data Visualizations for "Uncreative" R Users

This video course teaches you how to leverage {ggplot2} to make charts that communicate effectively without being a design expert. Course information can be found on the course page.