library(tidyverse)
library(tidychatmodels)
::load_dot_env('.env')
dotenv
create_chat(
vendor = 'mistral',
api_key = Sys.getenv('MISTRAL_DEV_KEY')
|>
) add_model('mistral-small-latest')
## Chat Engine: mistral
## Messages: 0
## Model: mistral-small-latest
Processing PDF documents with R and AI models using tidychatmodels
AI bots like chatGPT and others can be used to extract information from all sorts of unstructured texts. Of course, no one want to copy and paste the text into a chat interface and then copy the result back into a text file. That’s tedious. In this blog post I’ll show you how to do all of that automatically.
Specifically, we will use the tidychatmodels
package to extract information from product reviews. We will first generate some fictitious product reviews and save them as pdf-files. Then we will extract the company, product, rating, ways to improve the product, and what is particularly helpful from the product from the pdf-files. And as always, if you want to see the video version of this blog post, you can find it on my YouTube channel.
Create fake product reviews
Alright, let’s start by creating some fictitious product reviews. Do to so, we have to first set up a chat with an AI model. Here, let’s just use a simple model from mistral.ai.
Set up a chat
We will follow steps along my last blog post that introduced the tidychatmodels
package. This means that we will load our API keys from a .env
file and then create a chat with the mistral-small-latest
model.
Afterwards, we can add some parameters to the chat. Here, let’s just throw in a temperature of 0.8. This should make the model a bit more creative in its responses.
create_chat(
vendor = 'mistral',
api_key = Sys.getenv('MISTRAL_DEV_KEY')
|>
) add_model('mistral-small-latest') |>
add_params(temperature = 0.8)
## Chat Engine: mistral
## Messages: 0
## Model: mistral-small-latest
## Parameters:
## temperature: 0.8
Then, we add a system message to instruct our AI model what we want it to do. Specifically, we want to include a lot of things in our fictitious product reviews. And these things should be mentioned in an unstructured way inside the text. Otherwise, the AI model might just include all the relevant information at the beginning of the text. And that would not be a particular realistic scenario for real-world applications of text processing.
create_chat(
vendor = 'mistral',
api_key = Sys.getenv('MISTRAL_DEV_KEY')
|>
) add_model('mistral-small-latest') |>
add_params(temperature = 0.8) |>
add_message(
role = 'system',
message = 'You write 500-word reviews for ficticious products from ficticious companys. You will mention the following things in each review:
- the name of the company
- the name of the product
- a rating (up to 5 stars)
- ways to improve the product
- what you find particularly helpful from the product
All of these things are given IN AN UNSTRUCTURED WAY INSIDE THE TEXT. Do not mention these in a header-like structure.'
)## Chat Engine: mistral
## Messages: 1
## Model: mistral-small-latest
## Parameters:
## temperature: 0.8
Now, all that’s left to do is to add a user message that instructs the AI bot to write a fictitious review for a specific product. And to make sure that we can resuse this code, we will wrap all of this into a function that takes the product as an argument.
<- function(product) {
product_review <- create_chat(
chat vendor = 'mistral',
api_key = Sys.getenv('MISTRAL_DEV_KEY')
|>
) add_model('mistral-small-latest') |>
add_params(temperature = 0.8) |>
add_message(
role = 'system',
message = 'You write 500-word reviews for ficticious products from ficticious companys. You will mention the following things in each review:
- the name of the company
- the name of the product
- a rating (up to 5 stars)
- ways to improve the product
- what you find particularly helpful from the product
All of these things are given IN AN UNSTRUCTURED WAY INSIDE THE TEXT. Do not mention these in a header-like structure.'
|>
) add_message(glue::glue('Write a ficitious review for {product}')) |>
perform_chat()
extract_chat(chat, silent = TRUE)$message[3]
}
Notice that this function actually performs the chat, i.e. sends the data to mistral.ai (via perform_chat()
) and then we extract the response from the chat (via extract_chat()
). And with that we can iterate over a list of products and generate a review for each of them.
<- tibble(
reviews_dat product = c(
'a bad tech gadget',
'a glorious beauty product',
'a kids toy I bought as a present for my kids which was really fun but fell apart after 1 day',
'an okay-ish newsletter subscription'
),review = map_chr(product, product_review)
)
Cool, this should give us a bunch of reviews like this:
|>
reviews_dat pluck('review', 1) |>
str_sub(1, 500) |> # Cut off the text after 500 characters for demonstration purposes
cat()
## In the vast and ever-expanding world of technology, it's not uncommon to stumble upon devices that leave you scratching your head in bewilderment. Today, I find myself in just such a predicament as I share my thoughts on the "TechnoBabble TangleTracker," a product from the relatively obscure company known as HyperLooptic Innovations.
##
## Now, let me start by saying that the concept behind the TangleTracker is intriguing. This small, palm-sized gadget is designed to untangle your earphones, charging
Save reviews as pdf-files
Finally, we should save these reviews as pdf-files. After all, our fake scenario here later on is that we process PDF files. To do so, we’ll just fill in a quarto document with the text and then render it as a pdf-file.
walk2(
$review,
reviews_datseq_along(reviews_dat$review),
\(x, y) {# Create temporary quarto document
writeLines(
c(
'---',
'title: "Review"',
'format: pdf',
'---',
x
),'review.Rmd'
)# Render the document
::quarto_render(
quarto'review.Rmd',
output_format = "pdf",
output_file = paste0('review', y, '.pdf')
)
} )
Extract information from pdf-files
Nice, we have a bunch of PDF files that look like this now:
Now, let’s extract
- the company,
- product,
- rating,
- ways to improve the product, and
- what is particularly helpful from the product
from the pdf-files. All we have to do is to first extract the text from the pdf-files and then send it to the AI model. For the first part, we can just use the pdftools
package.
<- pdftools::pdf_text('review1.pdf') |>
pdf_text paste0(collapse = ' ')
# Cut off the text after 500 characters for demonstration purposes
|>
pdf_text str_sub(1, 500) |>
cat()
## Review
## Albert Rapp
##
##
##
## Table of contents
##
## In the vast and ever-expanding world of technology, it’s not uncommon to stumble upon
## devices that leave you scratching your head in bewilderment. Today, I find myself in just such
## a predicament as I share my thoughts on the “TechnoBabble TangleTracker,” a product from
## the relatively obscure company known as HyperLooptic Innovations.
## Now, let me start by saying that the conce
Well this doesn’t look nicely formatted. But it doesn’t matter. Hopefully, our AI model cares only about the content.
To make sure that we get a good results, let’s use a large model from Anthropic. The new models from Anthropic are currently hyped in the AI community. So we might as well give it a try. Here’s how we could set up a chat just like before.
create_chat(
vendor = 'anthropic',
api_key = Sys.getenv('ANTHROPIC_DEV_KEY'),
api_version = '2023-06-01'
|>
) add_model('claude-3-sonnet-20240229') |>
add_params(temperature = 0.2, max_tokens = 1000) |>
add_message(
role = 'system',
message = 'You are an AI system that excels at extracting relevant information from product reviews. In the following the user will give you the text of a product review and you will extract the following information:
Company: << Insert Name here >>
Product: << Insert Product here >>
Rating: << Insert number here >> stars
Ways to improve: << Insert a short summary in at most 2 sentences here >>
Helpful product features: << Insert a short summary in at most 2 sentences here >>'
|>
) add_message(pdf_text)
## Chat Engine: anthropic
## Messages: 2
## Model: claude-3-sonnet-20240229
## Parameters:
## temperature: 0.2
## max_tokens: 1000
And since we want to reuse this code, let’s wrap it into a function that takes the pdf-file as an argument. And of course, we have to make sure that we actually perform the chat and extract the response.
<- function(pdf_file) {
extract_pdf_information <- pdftools::pdf_text(pdf_file) |>
pdf_text paste0(collapse = ' ')
<- create_chat(
chat vendor = 'anthropic',
api_key = Sys.getenv('ANTHROPIC_DEV_KEY'),
api_version = '2023-06-01'
|>
) add_model('claude-3-sonnet-20240229') |>
add_params(temperature = 0.2, max_tokens = 1000) |>
add_message(
role = 'system',
message = 'You are an AI system that excels at extracting relevant information from product reviews. In the following the user will give you the text of a product review and you will extract the following information:
Company: << Insert Name here >>
Product: << Insert Product here >>
Rating: << Insert number here >> stars
Ways to improve: << Insert a short summary in at most 2 sentences here >>
Helpful product features: << Insert a short summary in at most 2 sentences here >>'
|>
) add_message(pdf_text) |>
perform_chat()
extract_chat(chat, silent = TRUE)$message[3]
}
<- tibble(
summaries pdf_file = paste0('review', seq_along(reviews_dat$review), '.pdf'),
summary = map_chr(pdf_file, extract_pdf_information)
)
summaries## # A tibble: 4 × 2
## pdf_file summary
## <chr> <chr>
## 1 review1.pdf "Company: HyperLooptic Innovations\n\nProduct: TechnoBabble Tangl…
## 2 review2.pdf "Company: Luminosa Cosmetics\n\nProduct: Enchanté Lumière Radiant…
## 3 review3.pdf "Company: WhimsyWonders\n\nProduct: Laugh-A-Lot Lion\n\nRating: 3…
## 4 review4.pdf "Company: Interstellar Ink\n\nProduct: Cosmic Chronicle (daily ne…
Clean up the summaries
Nice, it looks like Claude-3-Sonnet extracted the information and put it into our desired format. Let’s just clean up the summaries a bit so that we can have a better look.
|>
summaries separate_wider_delim(
cols = summary,
delim = '\n\n',
names = c('company', 'product', 'rating', 'improvement_potential', 'helpful_features')
|>
) mutate(
across(
:helpful_features,
companystr_remove_all(
\(x)
x, 'Company: |Product: |Rating: |Ways to improve: |Helpful product features: '
)
)
) ## # A tibble: 4 × 6
## pdf_file company product rating improvement_potential helpful_features
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 review1.pdf HyperLoopti… Techno… 2 sta… The grooves need to … It's surprising…
## 2 review2.pdf Luminosa Co… Enchan… 4.5 s… The price could be m… It contains nou…
## 3 review3.pdf WhimsyWonde… Laugh-… 3 sta… The product could be… The toy's inter…
## 4 review4.pdf Interstella… Cosmic… 3 sta… Establish a more rig… Provides a dive…
Nice with that we have extracted the information from the pdf-files. You could verify the results by looking at the pdf-files and comparing the extracted information with the actual content. But I’m pretty sure that the results are good enough for our fictitious scenario. Let’s leave it at that for now.
Conclusion
We used the fancy AI stuff to extract information from pdf-files. As always with AI models, your milage may vary. Still, I hope you enjoyed this little tutorial. Have a great day and see you next time. And if you found this helpful, here are some other ways I can help you:
- 3 Minute Wednesdays: A weekly newsletter with bite-sized tips and tricks for R users
- Insightful Data Visualizations for “Uncreative” R Users: A course that teaches you how to leverage
{ggplot2}
to make charts that communicate effectively without being a design expert.