daiR is an R package for Google Document AI, a powerful server-based OCR processor with support for over 60 languages. The package provides an interface for the Document AI API and comes with additional tools for output file parsing and text reconstruction. See the daiR
website for more details.
Quick OCR short documents:
## NOT RUN
library(daiR)
response <- dai_sync("file.pdf")
text <- text_from_dai_response(response)
cat(text)
Batch process asynchronously via Google Storage:
## NOT RUN
library(googleCloudStorageR)
library(purrr)
my_files <- c("file1.pdf", "file2.pdf", "file3.pdf")
map(my_files, gcs_upload)
dai_async(my_files)
contents <- gcs_list_objects()
output_files <- grep("json$", contents$name, value = TRUE)
map(output_files, ~ gcs_get_object(.x, saveToDisk = file.path(tempdir(), .x)))
sample_text <- text_from_dai_file(file.path(tempdir(), output_files[1]))
cat(sample_text)
Turn images of tables into R dataframes:
Google Document AI is a paid service that requires a Google Cloud account and a Google Storage bucket. I recommend using Mark Edmondson’s googleCloudStorageR
package in combination with daiR
.
Install the latest development version from Github: