Skip to content

cardboardci/pdftools

cardboardci/pdftools is a Docker image built with continuous integration builds in mind. Each tag contains any binaries and tools that are required for builds to complete successfully in a continuous integration environment. This includes jq, curl, bash and utilities for working with PDFs.

Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.

You can see the cli reference here.

Getting Started

This image can be used with the docker type for different types of continuous integration platforms. For example:

# GitHub Actions
jobs:
    my_first_job:
        steps:
            - name: My first step
              uses: docker://ghcr.io/cardboardci/pdftools:edge
              with:
                  args: "pdftools --version"

Pull latest image

The edge or latest version of the image is available with the tag edge. This isn't intended to long term use, but for working with the latest version of the image. To pull the latest image, run the following:

docker pull ghcr.io/cardboardci/pdftools:edge

Test interactively

Sometimes it can be useful to run the image in an interactive shell for experimentation. To shell into an image, run the following:

docker run -it ghcr.io/cardboardci/pdftools:edge /bin/bash

Run a basic command

To run a single command from the context of the docker image, run the following:

docker run -it -v `pwd`:/workspace ghcr.io/cardboardci/pdftools:edge pdftools --version

Fundamentals

All images in the CardboardCI namespace are built from cardboardci/base. This image is intended to provide a common set of dependencies and expectations about how the images will behave. The image will always be built from the base image, to ensure any changes seen in the base are included in the downstream image.