The Apache PDFBox library is an open source Java tool for working with PDF documents

Overview

codeql java

Apache PDFBox

The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities. PDFBox is published under the Apache License, Version 2.0.

PDFBox is a project of the Apache Software Foundation.

Binary Downloads

You can download binary versions for releases currently under development or older releases from our Download Page.

Build

You need Java 8 (or higher) and Maven 3 to build PDFBox. The recommended build command is:

mvn clean install

The default build will compile the Java sources and package the binary classes into jar packages. See the Maven documentation for all the other available build options.

Contribute

There are various ways to help us improve PDFBox.

Support

Please follow the guidelines at our Support Page.

If you have questions about how to use PDFBox do ask on the Users Mailing List. This will get you help from the entire community.

The PDFBox examples and the test code in the sources will also provide additional information.

And there are additional resources available on sites such as Stack Overflow.

If you are sure you have found a bug the please report the issue in our Issue Tracker.

Known Limitations and Problems

See the Issue Tracker for the full list of known issues and requested features. Some of the more common issues are:

  1. You get text like "G38G43G36G51G5" instead of what you expect when you are extracting text. This is because the characters are a meaningless internal encoding that point to glyphs that are embedded in the PDF document. The only way to access the text is to use OCR. This may be a future enhancement.

  2. You get an error message like "java.io.IOException: Can't handle font width" this MIGHT be due to the fact that you don't have the org/apache/pdfbox/resources directory in your classpath. The easiest solution is to simply include the apache-pdfbox-x.x.x.jar in your classpath.

  3. You get text that has the correct characters, but in the wrong order. This mght be because you have not enabled sorting. The text in PDF files is stored in chunks and the chunks do not need to be stored in the order that they are displayed on a page. By default, PDFBox does not sort the text.

License (see also LICENSE.txt)

Collective work: Copyright 2015 The Apache Software Foundation.

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Export control

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See https://www.wassenaar.org/ for more information.

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache PDFBox uses the Java Cryptography Architecture (JCA) and the Bouncy Castle libraries for handling encryption in PDF documents.

Comments
  • Lazier clipping

    Lazier clipping

    Calculating the intersection of two Area can take a lot of time. However, depending on the Graphics2D that is used for rendering, it may not be necessary to actually perform this operation.

    For instance, when generating an SVG, the individual clipping paths can be serialized individually, and the intersection is then calculated at runtime, when the SVG file is rendered.

    The idea of this PR is to replace PDGraphicsState.clippingPath with a list of GeneralPaths, which is lazily evaluated, truncated & cached when getCurrentClippingPath() is called (effectively leaving the current behaviour of PdfBox unchanged, and it should also not have any significant impact on the performance).

    Additionally, a new method protected void transferClip(PDGraphicsState graphicsState, Graphics2D graphics) is added to PageDrawer. By default, this method makes use of getCurrentClippingPath() to call graphics.setClip(...), which again is what PdfBox currently does.

    However, classes that extend PageDrawer can override this method, and directly access the individual clipping paths, without any need to calculate their intersection.

    In some cases (shading fills & transparency groups), it is still necessary to calculate the intersection.

    opened by sschwieb 12
  • Faster PDImageXObject.applyMask

    Faster PDImageXObject.applyMask

    Less data copy operations over all. Linewise bulk copy instead of per pixel.

    The softMask / matte calculation might need to be checked. I have only found one PDF in https://issues.apache.org/jira/browse/PDFBOX-4267

    opened by Schmidor 12
  • optimize applyMask in PdImageXObject

    optimize applyMask in PdImageXObject

    There was a severe performance issue with really big masks if the image needs to be scaled to it (i.e. 10000*10000 pixels). Scaling bicubic can take 6-10 seconds. This patch tries to switch to bilinear resizing for these cases, although the threshold might have to be fine tuned, still.

    There was also a double allocation for the final masked image when we can simply use the image since applyMask() is always fed with a newly created one. Reference hogging and needless allocation have been removed.

    Additionally the alpha blending routines were very slow, working on pixels. There is now a staggered approach by:

    • direct byte masking which is very fast even for big images (right now does not work with padded buffers),
    • exploiting data buffer's sample system to merge the alpha component into the ARGB image, letting the sample model do the bit masking,
    • slow pixel expansion to reverse premultiply matte values (but using fixed point integer arithmetics).

    Additionally also using the interpolation flag of the mask to decide if the mask should be interpolated.

    opened by gunnar-ifp 7
  • Enable rendering of Indian languages, by reading and utilizing the GSUB table

    Enable rendering of Indian languages, by reading and utilizing the GSUB table

    Implemented proper rendering of Indian languages, which need extensive Glyph substitution. The GSUB table has been read and used effectively to replace some compound words with their respective Glyphs. All tests are passing. I have tested this for the Bengali font. Please review these changes and let me know if it makes sense to incorporate these.

    Thanks, Palash.

    opened by paawak 6
  • Add build using GitHub actions

    Add build using GitHub actions

    This PR adds a GitHub workflow to enable checks from all JDKs from version 8 to 15.

    The output of GitHub actions is better integrated in GitHub. Furthermore, the errors are highlighted better:

    grafik

    opened by koppor 5
  • CCITTFaxDecoderStream Update

    CCITTFaxDecoderStream Update

    I want to propose the following update of the CCITT decoder.

    We had a few fixes since the version from 2016. Futher I later helped to use the decoder in ICEpdf and did some changes for the EncodedByteAlign-Flag.

    opened by Schmidor 5
  • Fix Potential Flaky Test in PDFObjectStreamParserTest.java

    Fix Potential Flaky Test in PDFObjectStreamParserTest.java

    This PR fixes the potential flaky test in PDFObjectStreamParserTest.java file.

    In the test testOffsetParsing(), we convert the keys of objectNumbers to the array and assign this array to numbers.

    class PDFObjectStreamParserTest
    {
        @Test
        void testOffsetParsing() throws IOException
        {
            COSStream stream = new COSStream();
            stream.setItem(COSName.N, COSInteger.TWO);
            stream.setItem(COSName.FIRST, COSInteger.get(8));
            OutputStream outputStream = stream.createOutputStream();
            outputStream.write("1 0 2 5 true false".getBytes());
            outputStream.close();
            PDFObjectStreamParser objectStreamParser = new PDFObjectStreamParser(stream, null);
            Map<Long, Integer> objectNumbers = objectStreamParser.readObjectNumbers();
            assertEquals(2, objectNumbers.size());
            Long[] numbers = objectNumbers.keySet().toArray(new Long[0]); // <<<<<<<<<<<<<<<< LINE: 48
            objectStreamParser = new PDFObjectStreamParser(stream, null);
            assertEquals(COSBoolean.TRUE, objectStreamParser.parseObject(numbers[0]));
            objectStreamParser = new PDFObjectStreamParser(stream, null);
            assertEquals(COSBoolean.FALSE, objectStreamParser.parseObject(numbers[1]));
        }
    }
    

    However, as the objectNumbers is initialized as a HashMap, it is not guaranteed that the order will remain constant over time according to the Java documentation.

    This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.

    Such indeterministic characteristic of HashMap will cause this test to fail in some platforms or specific machines that adopt a different iteration order.

    To fix it, I looked over our codebase and thought it was doable to initialize our objectNumbers as a LinkedHashMap rather than HashMap. In this way, the line 48 (variable numbers) in PDFObjectStreamParserTest.java file will always be deterministic.

    opened by xiedaxia1hao 4
  • Update PDPageContentStream.java

    Update PDPageContentStream.java

    This function writes bytes as ascii characters directly to the page.

    For use cases where only ascii text needs to be written, this solves a few problems:

    1. Avoids the "No glyph for X" issue (eg https://stackoverflow.com/questions/42228567/remove-illegal-characters-from-string-with-pdfbox).
    2. While it's possible to pre-process text before using showText(), this means the data will be looped through at least twice since PDFont.encode() loops the text.
    3. Preforms better. Everything can be kept as bytes without having to go to text

    Benchmarks vs showText(): 2000 page pdf: showText(): 579ms / writeAscii(): 419ms 1000 page pdf: showText(): 283ms / writeAscii(): 193ms

    opened by jknight 4
  • Update SampledImageReader.java

    Update SampledImageReader.java

    This fixes an issue with a missed logged error + loop breaking if an erroneous read falls onto a skipped row.

    It also improves the speed at least by factor twice (due to sign test instead of bit masking and boolean flag for subsampling). It will be even faster with regions, as there was a poor design choice of full inner loop computations even if x was below startX.

    (The method wasn't very slow to begin with, though.)

    opened by gunnar-ifp 3
  • [PDFBOX-5055] - Minor Improvement:

    [PDFBOX-5055] - Minor Improvement:

    • Add final
    • Unnecessary semicolon ';'
    • Remove Unnecessary interface modifier
    • Remove Unnecessary semicolon
    • Remove Unused import
    • Use Standard Charset Object
    • Change License to Text Plain
    opened by arturobernalg 3
  • PDFBOX-4373: Add additional unit tests

    PDFBOX-4373: Add additional unit tests

    Hi,

    I ran JaCoCo over the pdfbox module, and found a few functions that were missing unit test coverage.

    I've written some tests for these functions, with the help of Diffblue Cover.

    Hopefully these tests should help you detect regressions caused by future code changes.

    opened by JohnLBergqvist 3
  • Proof of concept for advanced glyph layout (2nd try)

    Proof of concept for advanced glyph layout (2nd try)

    Based on https://github.com/danfickle/pdfbox See examples/src/main/java/org/apache/pdfbox/examples/pdmodel/AdvancedTextLayoutSequencesDin91379.java

    Glyph layout looks good with NotoSans-Regular. Combinations with two diacritics work after reordering glyphs, using the patch for FOP-2969 and setting gdef in GlyphPositioningTable.

    Output of test is now: TestDin91379AdvancedLayout-NotoSans-Regular.ttf-20.0.pdf

    opened by vk-github18 0
  • Replace finalize() with Cleaner

    Replace finalize() with Cleaner

    Finalizers (method finalize()) are going to be deprecated for removal with JDK 18. See https://openjdk.java.net/jeps/421 for details.

    The best way to replace the finalize() methods is by using the JDK 9 java.lang.ref.Cleaner. As PDFBox 3 targets JDK 8 this can not be used directly.

    The attached patch implements a Cleaner using finalizers for JDK <= 8 and using java.lang.ref.Cleaner by reflection for JDK 9+.

    The two remaining finalize() implementing classes are migrated to the new Cleaner.

    I’m not really happy with the name and package org.apache.fontbox.util.PdfBoxInternalCleaner of the cleaner. Maybe you have an idea for a better place and name.

    In theory this patch could be back ported to PDFBox 2, but I’m not sure if this is worth the risk.

    opened by rototor 1
  • PDFBOX-5284: Refactor Refactor PDFTabulaTextStripper to improve test …

    PDFBOX-5284: Refactor Refactor PDFTabulaTextStripper to improve test …

    Jira

    Description

    Replace test class PDFTabulaTextStripper by spying object and improve test design


    Motivation
    • Decouple test class PDFTabulaTextStripper from production class PDFTextStripper.
    • Remove the redundant test child class PDFTabulaTextStripper
    • Remove the redundant constructor.

    Key changed/added classes in this PR
    • Created spying object to replace test subclass PDFTabulaTextStripper, decoupled test from production code.
    • Created method that return the mocking object for code reuse.

    opened by wx930910 0
  • lenient DomXmpParser

    lenient DomXmpParser

    The XMP box library is nice, but out in the wild are PDF files that fail parsing. For example dc.create is a Bag instead of a Seq.

    Ideally the parser would have a mode where it tries to read as many properties as possible by simply discarding unreadable ones. This is not good if you want to write back a PDF but if you just want to extract Metadata, such a mode would be nice. In this case this invalid dc.creator value would be dropped. This would require doing some more work.

    I've seen that there is a non strict parsing mode, which I don't think should be confused with this proposed lenient mode, but as the name suggests it should be less strict. So in this mode Sequences could be read fom Bags and vice versa. I left Alt cardinality as an error because it doesn't really fit in.

    Maybe in one of the modes an element that should be an array but isn't could automagically be wrapped into one...

    (I also believe that a Bag could always be read from a Sequence...)

    opened by gunnar-ifp 0
Owner
The Apache Software Foundation
The Apache Software Foundation
Extract text from a PDF (pdf to text). Api for PHP/JS/Python and others.

Extract text from a PDF (pdf to text). API in docker. Why did we create this project? In the Laravel project, it was necessary to extract texts from l

dotcode.moscow 6 May 13, 2022
Core Java Library + PDF/A, xtra and XML Worker

iText 5 is EOL, and has been replaced by iText 7. Only security fixes will be added Known Security Issues org.apache.santuario:xmlsec vul

iText 1.4k Jan 9, 2023
Extract tables from PDF files

tabula-java tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that powers Tabula (repo). You can use t

Tabula 1.5k Jan 9, 2023
Java reporting library for creating dynamic report designs at runtime

Master Development Dynamic Reports DynamicReports is an open source Java reporting library based on JasperReports. It allows to create dynamic report

Dynamic Reports 165 Dec 28, 2022
XML/XHTML and CSS 2.1 renderer in pure Java

Flying Saucer OVERVIEW Flying Saucer is a pure-Java library for rendering arbitrary well-formed XML (or XHTML) using CSS 2.1 for layout and formatting

null 1.8k Jan 2, 2023
An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!

OPEN HTML TO PDF OVERVIEW Open HTML to PDF is a pure-Java library for rendering arbitrary well-formed XML/XHTML (and even HTML5) using CSS 2.1 for lay

null 1.6k Dec 29, 2022
Extract text from a PDF (pdf to text). Api for PHP/JS/Python and others.

Extract text from a PDF (pdf to text). API in docker. Why did we create this project? In the Laravel project, it was necessary to extract texts from l

dotcode.moscow 6 May 13, 2022
The open-source Java obfuscation tool working with Ant and Gradle by yWorks - the diagramming experts

yGuard yGuard is an open-source Java obfuscation tool. With yGuard it is easy as pie ( ?? ) to configure obfuscation through an extensive ant task. yG

yWorks GmbH 265 Jan 2, 2023
OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

OpenRefine 9.2k Jan 1, 2023
documents4j is a Java library for converting documents into another document format

documents4j is a Java library for converting documents into another document format. This is achieved by delegating the conversion to any

documents4j 455 Dec 23, 2022
Simple library to create GraphML documents to be consumed by yEd

Simple library to create GraphML documents to be consumed by yEd

/bin/dd 1 Jan 23, 2022
Core Java Library + PDF/A, xtra and XML Worker

iText 5 is EOL, and has been replaced by iText 7. Only security fixes will be added Known Security Issues org.apache.santuario:xmlsec vul

iText 1.4k Jan 9, 2023
Apache Cayenne is an open source persistence framework licensed under the Apache License

Apache Cayenne is an open source persistence framework licensed under the Apache License, providing object-relational mapping (ORM) and remoting services.

The Apache Software Foundation 284 Dec 31, 2022
Source code of APK-Explorer-Editor (AEE), an open-source tool to explore the contents of an installed APK!

APK Explorer & Editor (AEE) APK Explorer & Editor, an open-source tool to explore the contents of an installed APK, is strictly made with an aim to in

APK Explorer & Editor 271 Jan 8, 2023
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.9k Jan 8, 2023
Generate API documents to any place: YApi, RAP2, Eolinker, etc. (一键生成API接口文档, 上传到YApi, Rap2, Eolinker等平台的IDEA插件)

Yapi X ?? ?? ?? 本项目已收录到YApi, Rap2官方仓库 一键生成API接口文档, 上传到YApi, Rap2, Eolinker等平台的IDEA插件. ?? 亮点 零成本、零入侵: 编写标准Javadoc即可,无需依赖swagger注解, 生成API文档准确性高达99%。 开箱即

Jetplugins 117 Dec 26, 2022
Spring Data Redis extensions for better search, documents models, and more

Object Mapping (and more) for Redis! Redis OM Spring extends Spring Data Redis to take full advantage of the power of Redis. Project Stage Snapshot Is

Redis 303 Dec 29, 2022
Beagle helps you identify keywords, phrases, regexes, and complex search queries of interest in streams of text documents.

Beagle Beagle is a detector of interesting things in text. Its intended use is in-stream search applications. Suppose you need to monitor a stream of

TokenMill 49 Dec 3, 2022