Simplified PDF Data Extraction

Related tags

Testing pdf-mantis
Overview

PDF Mantis

Simplified PDF Data Extraction

Generic badge Generic badge Generic badge

Table of Contents

What is PDF Mantis

PDF Mantis is a Java based high level API which aims to simplify PDF data extraction.

It provides a unified set of extraction features:

  • Simple text extraction
  • Identification of text coordinates, font and colour
  • Detection and extraction of images by page or page coordinates
  • Extract standard metadata fields as well as exposing hidden ones
  • Detect and extract simple data tables
  • Straightforward optical character recognition for scanned PDF's

Available for use on Windows, Linux and macOS.

Why was PDF Mantis created and who is it for

Despite not being particularly well regarded by the tech community, the PDF is one of the world's most popular document formats - As a result of this, it is not uncommon to find PDF generation nestled somewhere within a typical businesses internal systems.

The idea for PDF Mantis came about while performing some work at one such client - The end output of their system was a PDF which was sent directly to their customers. We were tasked with writing a test automation framework which validated these outputs.

After scouring Github for suitable OSS candidates, we managed to find a collection of different libraries which would help us complete the task.

The problem was that it took quite a bit of work to become familiar with the nuances of the libraries in conjunction with the idiosyncrasies of PDFs. After a lot of trial and error, we ended up with a solid bunch of helper classes which performed common extraction tasks, such as analysing text etc.

We couldn't help thinking at the time that this should be easier - much easier. We only wanted to extract from PDFs, not build them!.

And that's what PDF Mantis aims to be, a simple way to extract data from PDFs without necessarily understanding the complexities of PDFs.

Requirements

PDF Mantis requires at least Java 8 and then either Maven or Gradle.

Installation

Maven

For Maven, add the below dependency to your pom.xml file:

<dependency>
    <groupId>com.graysonnorland.pdfmantis</groupId>
    <artifactId>pdf-mantis</artifactId>
    <version>0.0.1</version>
    <scope>test</scope>
</dependency>

Gradle

Alternatively for Gradle, add the following to your build.gradle file:

testCompile 'com.graysonnorland.pdfmantis:pdf-mantis:0.0.1'

Usage

Loading a PDF

First things first, you need to create a PdfMantis object - This effectively represents the PDF. It is from this object that you will be able to access all extraction features.

There are several ways you can load a PDF:

// Load by String path
PdfMantis pdf = new PdfMantis("/home/example.pdf");

// Load by File object
PdfMantis pdf = new PdfMantis(new File("/home/example.pdf"));

// Load by URL
PdfMantis pdf = new PdfMantis(new URL("https://www.somewebpage.com/example.pdf"));

// Load by Input Stream
PdfMantis pdf = new PdfMantis(this.getClass().getResourceAsStream("example.pdf"));

If your PDF is encrypted, just provide the password as a secondary input parameter like so:

// Load encrypted PDF by String path
PdfMantis pdf = new PdfMantis("/home/example.pdf", "SuperSecret00!");

// Load encrypted PDF by URL
PdfMantis pdf = new PdfMantis(new URL("https://www.somewebpage.com/example.pdf", "abc123"));

Closing a PDF

It is best practise to close a PDF once your work on it is complete - You can do this like so:

// Load a PDF
PdfMantis pdf = new PdfMantis("/home/example.pdf");

// Then close it
pdf.closePdf();

Text

The getText() method handles text extraction.

You can get all the text from the PDF as a String:

String allText = pdf.getText().getAll();

Or get text from a certain page:

String textFromPage2 = pdf.getText().getAllFromPage(2);

Or even get the text from a page range:

String textFromPages2To3 = pdf.getText().getAllFromPageRange(2, 3);

You can also extract text from an area on a page via coordinates:

int page = 2;
double x = 72.02;
double y = 717.45;
double height = 4.98;
double width = 95.64;
        
String textFromArea = pdf.getText().getFromArea(page, x, y, height, width);

But how does one determine these coordinates?

For that we can utilise TextIndex, which not only provides text coordinates but also exposes font and colour information as well.

TextIndex

The simplest way to utilise TextIndex is to use the getForString()method:

List<String> textIndexForPhrase = pdf.getTextIndex().getForString("test string");

This method combs the entire PDF and returns a prettified String representation of the TextIndex object for any occurrences of the provided phrase, with each occurrence looking something like this:

Page Number=1,
Word=test string,
Font=ABCDEE+Calibri,
Font Size=9.0,
Colour=FILL:GRAY 0.0;,
X=72.02,
Y=717.45,
Height=4.98,
Width=95.64

One thing to note for the getForString() method is that if the phrase you're searching for breaks to a new line it will capture the surrounding text identified in the rectangle. For example, consider the below text: Alt text

If you searched for the String Apollo 11 it would capture the coordinates for the highlighted area: Alt text

However, if you searched for the String Kennedy Space Center it would capture the coordinates for the highlighted area: Alt text

This is because coordinates are based on the Rectangle class so in order to capture the requested String it had to draw a rectangle which caught all the required keywords.

While the getForString() method is the simplest way to use TextIndex, you can of course drill down into it in much further detail.

TextIndex Expanded

There are two ways to build the TextIndex.

One way is to utilise the buildUnicodeIndex() method - This extracts coordinates, font and colour information for every single unicode character in the PDF document:

List <TextIndex> unicodeIndex = pdf.getTextIndex().buildUnicodeIndex();

The other way is to utilise the buildWordIndex() method. This extracts coordinates, font and colour information for every single word in the PDF document:

List <TextIndex> wordIndex = pdf.getTextIndex().buildWordIndex();

Once we have the TextIndex, there are a number of methods available to expose key information for each entry:

// Get the index
List <TextIndex> wordIndex = pdf.getTextIndex().buildWordIndex();

// Iterate over each entry in the index
for (TextIndex word : wordIndex) {

    // And then we can expose information like so...
    word.getWord();
    word.getPageNumber();
    word.getFont();
    word.getFontSize();
    word.getColour();
    word.getX();
    word.getY();
    word.getHeight();
    word.getWidth();
}

If desired, we can convert TextIndex into a String List for easy viewing:

// Get your index
List <TextIndex> wordIndex = pdf.getTextIndex().buildWordIndex();
// Then prettify it!
List <String> prettyWordIndex = pdf.getTextIndex().prettifyIndex(wordIndex);

Images

The getImage() method handles image extraction.

It returns a Map; the key is a String containing the unique image name, and the value is a BufferedImage.

You can get all the images from the PDF like so:

Map<String, BufferedImage> allImages = pdf.getImage().getAllImages();

Or get all images from a certain page:

Map<String, BufferedImage> imagesFromPage3 = pdf.getImage().getImagesFromPage(3);

You can also extract an image from an area on a page via coordinates:

int page = 2;
double x = 48.699;
double y = 688.5;
double height = 120.0;
double width = 650.0;

BufferedImage actualImage = pdf.getImage().getFromArea((page, x, y, height, width);

To obtain image coordinates, we can utilise ImageIndex.

ImageIndex

You can build the ImageIndex by utilising the getImageIndex() method:

List <ImageIndex> imageIndex = pdf.getImageIndex().buildIndex();

Once we have the ImageIndex, there are a number of methods available to expose key information for each entry:

// Get the index
List <ImageIndex> imageIndex = pdf.getImageIndex().buildIndex();

// Iterate over each entry in the index
for (ImageIndex image : imageIndex) {

    // And then we can expose information like so...
    image.getPageNumber();
    image.getImageName();
    image.getImage();
    image.getX();
    image.getY();
    image.getHeight();
    image.getWidth();
}

If desired, we can convert ImageIndex into a String List for easy viewing:

// Get your index
List <ImageIndex> imageIndex = pdf.getImageIndex().buildIndex();
// Then prettify it!
List <String> prettyImageIndex = pdf.getImageIndex().prettifyIndex(imageIndex);

Metadata

The getMeta() method handles metadata extraction.

You can pull several standard metadata fields like so:

String creationDate = pdf.getMeta().getCreationDate();
String modifiedDate = pdf.getMeta().getModifiedDate();
String producer = pdf.getMeta().getProducer();
String keywords = pdf.getMeta().getKeywords();
String creator = pdf.getMeta().getCreator();
String subject = pdf.getMeta().getSubject();
String author = pdf.getMeta().getAuthor();
String title = pdf.getMeta().getTitle();

We can also go a bit deeper and attempt to expose hidden metadata fields via the getAll() method - This method takes advantage of Apache Tika's auto-parsing capabilities to achieve this.

It returns a String Map, with the key being the metadata field name, and the value being the metadata value extracted:

Map<String, String> allMetadata = pdf.getMeta().getAll();

Tables

The getTable() method handles table extraction.

This method utilises Tabulas's table detection algorithms to find and extract simple table data from PDFs.

You can either extract tables from a single page:

List<Table> tablesFromPage2 = pdf.getTable().extractFromPage(2);

Or extract tables from the entire PDF document:

List<Table> allTables = pdf.getTable().extractAll();

Once you have your table, you can navigate and query it using Tabulas's Table Class like so:

// Get the first table from page 1
Table firstTableFromPage1 = pdf.getTable().extractFromPage(1).get(0);

// Get the cell value from the second column on the third row
String secondColumnThirdRowValue = table.getCell(2, 1).getText();

// Get the total number of rows in the table
int totalRows = table.getRowCount();

If desired, you can prettify the table into a String, so you can easily view what has been extracted:

// Get the first table from page 1
Table firstTableFromPage1 = pdf.getTable().extractFromPage(1).get(0);

// Prettify it!
String prettyTable = pdf.getTable().prettifyTables(firstTableFromPage1);

If you print the String then it will look something like this:

╔═════════════════╤══════════════════════╗
║ Number of CoilsNumber of Paperclips ║
╠═════════════════╪══════════════════════╣
║ 53, 5, 4              ║
╟─────────────────┼──────────────────────╢
║ 107, 8, 6              ║
╟─────────────────┼──────────────────────╢
║ 1511, 10, 12           ║
╟─────────────────┼──────────────────────╢
║ 2015, 13, 14           ║
╚═════════════════╧══════════════════════╝

Please note, the more complex the table, the harder it is to detect.

OCR

Please note, if you're using Windows then the OCR features will work out of the box - However, if you're using Linux or macOS then some initial set-up is required.

The getOCRText() method handles OCR.

This method utilises Tesseract's deep learning neural networks to perform optical character recognition against scanned PDFs.

You can perform OCR against the whole PDF:

String allOCRText = pdf.getOCRText().getAll();

Or against a certain page:

String ocrTextFromPage3 = pdf.getOCRText().getAllFromPage(3);

Or even against a page range:

String ocrTextFromPages1To3 = pdf.getOCRText().getAllFromPageRange(1, 3);

OCR is defaulted to 300 DPI as this is the resolution that works best with Tesseract, but you can override this if desired by providing it as a secondary input parameter:

// Lower resolution (150 DPI)
String allOCRTextLowerResolution = pdf.getOCRText().getAll(150);

// Higher resolution (500 DPI)
String ocrTextFromPage3HigherResolution = pdf.getOCRText().getAllFromPage(3, 500);

FAQ

Why does PDF Mantis work perfectly on some PDFs but not others

The PDF format was created so that a document could be displayed exactly the same on any machine, regardless of what operating system it was created with - However, it's important to point out that not all PDFs are created equal.

It's an incredibly versatile and highly configurable format - Generally speaking, no two PDFs are the same and therein lies the problem.

As a result of this, if you are facing difficulties which are not covered within this README, it is recommended that you raise an issue and share the offending PDF.

Why am I struggling to extract certain tables

When Tabula's table detection algorithms work, it feels a bit like magic. Alas, it's not actually magic but just some very well-designed code. However, there are some occasions where the algorithms cannot correctly identify a table due to its complexity.

There are two options in this case:

  • You can use Text Index to determine the co-ordinates of the columns and extract accordingly.
  • You can switch to Tabula itself, which offers a host of different tuning options to identify trickier tables.

How can I get OCR working on Linux and macOS

While the Tesseract binaries for Windows are included in Tess4j, they are not for Linux & macOS. This means you will to have to build/install the necessary libraries before you can use the OCR methods on these platforms.

The current version of PDF Mantis uses Tess4j version 4.5.4, which requires Tesseract version 4.1.1.

You can generally just use a package manager to install the necessary components, such as apt install tesseract-ocr if your on Linux or brew install tesseract if your on macOS.

If you face issues installing via package managers, you can install directly from git instead.

I have a question

Before you raise an issue, it is best to search for existing issues that might help you. If you have found a suitable issue but still need clarification, you can write your question in this issue.

Also, consider that PDF Mantis is effectively an abstraction layer which sits over a bunch of well established projects. As such, you may very well find answer to your problems on Stack Overflow and the like, so it's worth having a scan of the internet first.

If you then still feel the need to ask a question and need clarification, we recommend the following:

  • Open an Issue
  • Provide as much context as you can about what problem you're running into.
  • Attach the PDF.

I want to contribute

Create a pull request, make your changes, add your tests and then submit for review.

Please note, if your making changes to OCR and you're working on Linux or macOS, then due to this, the associated tests are skipped by default.

You can force it to run by passing the system property of -DforceOCRTests=true like so:

mvn clean test -DforceOCRTests=true

Acknowledgments

PDF Mantis makes use of the following open source projects:

You might also like...

Unirest in Java: Simplified, lightweight HTTP client library.

Unirest for Java Install With Maven: !-- Pull in as a traditional dependency -- dependency groupIdcom.konghq/groupId artifactIdunire

Jan 5, 2023

Unirest in Java: Simplified, lightweight HTTP client library.

Unirest for Java Install With Maven: !-- Pull in as a traditional dependency -- dependency groupIdcom.konghq/groupId artifactIdunire

Jan 5, 2023

A simplified and multi-functional tool for spigot developers

A simplified and multi-functional tool for spigot developers. There are dozens of features you can use in it, and it is completely open source code. hCore supports all versions from 1.8.x to 1.18.2. Also you can find all these APIs usages from here.

Jan 1, 2023

Maven plugin to help creating CHANGELOG by keeping one format and solving merge request conflicts problem by extraction of new CHANGELOG entries to seperate files.

keep-changelog-maven-plugin CHANGELOG.md is one of the most important files in a repository. It allows others to find out about the most important cha

Aug 28, 2022

A proof-of-concept Android application to detect and defeat some of the Cellebrite UFED forensic toolkit extraction techniques.

LockUp An Android-based Cellebrite UFED self-defense application LockUp is an Android application that will monitor the device for signs for attempts

Dec 4, 2022

Web-Scale Open Information Extraction

ReVerb ReVerb is a program that automatically identifies and extracts binary relationships from English sentences. ReVerb is designed for Web-scale in

Nov 26, 2022

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

Jan 2, 2023

A proof-of-concept Android application to detect and defeat some of the Cellebrite UFED forensic toolkit extraction techniques.

LockUp An Android-based Cellebrite UFED self-defense application LockUp is an Android application that will monitor the device for signs for attempts

Dec 4, 2022

Extract tables from PDF files

Extract tables from PDF files

tabula-java tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that powers Tabula (repo). You can use t

Jan 9, 2023

Core Java Library + PDF/A, xtra and XML Worker

iText 5 is EOL, and has been replaced by iText 7. Only security fixes will be added Known Security Issues org.apache.santuario:xmlsec vul

Jan 9, 2023

The Apache PDFBox library is an open source Java tool for working with PDF documents

Apache PDFBox The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents,

Dec 31, 2022

Simple Cordova plugin to save a pdf file in MediaStore.Downloads

com-thesis-plugins-pdfstore This simple Cordova plugin saves a bytestring (the byte image of a pdf) to a pdf file in MediaStore.Downloads. The purpose

Feb 2, 2022

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

Jan 4, 2023

A scientific charting library focused on performance optimised real-time data visualisation at 25 Hz update rates for data sets with a few 10 thousand up to 5 million data points.

A scientific charting library focused on performance optimised real-time data visualisation at 25 Hz update rates for data sets with a few 10 thousand up to 5 million data points.

ChartFx ChartFx is a scientific charting library developed at GSI for FAIR with focus on performance optimised real-time data visualisation at 25 Hz u

Jan 2, 2023

A scientific charting library focused on performance optimised real-time data visualisation at 25 Hz update rates for data sets with a few 10 thousand up to 5 million data points.

A scientific charting library focused on performance optimised real-time data visualisation at 25 Hz update rates for data sets with a few 10 thousand up to 5 million data points.

ChartFx ChartFx is a scientific charting library developed at GSI for FAIR with focus on performance optimised real-time data visualisation at 25 Hz u

Dec 30, 2022

Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Firehose - Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Dec 22, 2022

Infinispan is an open source data grid platform and highly scalable NoSQL cloud data store.

The Infinispan project Infinispan is an open source (under the Apache License, v2.0) data grid platform. For more information on Infinispan, including

Dec 31, 2022

Easy to use cryptographic framework for data protection: secure messaging with forward secrecy and secure data storage. Has unified APIs across 14 platforms.

Easy to use cryptographic framework for data protection: secure messaging with forward secrecy and secure data storage. Has unified APIs across 14 platforms.

Themis provides strong, usable cryptography for busy people General purpose cryptographic library for storage and messaging for iOS (Swift, Obj-C), An

Dec 29, 2022
Owner
null
A library for setting up Java objects as test data.

Beanmother Beanmother helps to create various objects, simple and complex, super easily with fixtures for testing. It encourages developers to write m

Jaehyun Shin 113 Nov 7, 2022
Java fake data generator

jFairy by Devskiller Java fake data generator. Based on Wikipedia: Fairyland, in folklore, is the fabulous land or abode of fairies or fays. Try jFair

DevSkiller 718 Dec 10, 2022
Arbitrary test data generator for parameterized tests in Java inspired by AutoFixture.

AutoParams AutoParams is an arbitrary test data generator for parameterized tests in Java inspired by AutoFixture. Sometimes setting all the test data

null 260 Jan 2, 2023
JUnit 5 Parameterized Test Yaml Test Data Source

Yamaledt — JUnit 5 Parameterized Tests Using Yaml and Jamal Introduction and usage Note This is the latest development documentation. This is a SNAPSH

Peter Verhas 4 Mar 23, 2022
Distributed, masterless, high performance, fault tolerant data processing

Onyx What is it? a masterless, cloud scale, fault tolerant, high performance distributed computation system batch and stream hybrid processing model e

Onyx 2k Dec 30, 2022
Simplified PDF Data Extraction

PDF Mantis Simplified PDF Data Extraction Table of Contents What is PDF Mantis Why was PDF Mantis created and who is it for Requirements Installation

null 5 Dec 1, 2021
Extract text from a PDF (pdf to text). Api for PHP/JS/Python and others.

Extract text from a PDF (pdf to text). API in docker. Why did we create this project? In the Laravel project, it was necessary to extract texts from l

dotcode.moscow 6 May 13, 2022
Data extraction from smartphones and GPS and Accelerometer data "fusion" with Kalman filter.

This is library for GPS and Accelerometer data "fusion" with Kalman filter. All code is written in Java. It helps to increase position accuracy and GP

Rahul Goel 4 Nov 22, 2022
Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low latency stream processing and data analysis framework. Milliseconds latency and 10+ times faster than Flink for complicated use cases.

Table-Computing Welcome to the Table-Computing GitHub. Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low la

Alibaba 34 Oct 14, 2022