Extract text from a PDF (pdf to text). Api for PHP/JS/Python and others.

Related tags

PDF pdf pdfbox pdftotext
Overview

Extract text from a PDF (pdf to text). API in docker.

Why did we create this project?

  1. In the Laravel project, it was necessary to extract texts from large files. Existing packages do not work with files larger than 50 megabytes.
  2. Text extraction is an expensive operation. Running on a separate server will reduce the load.
  3. It was necessary to create a cover for the source.

Installation

Install Docker and Docker Compose

git clone https://github.com/dotcode-moscow/pdf-api.git
cd pdf-api
docker-compose up -d pdf-api

Method /api/extractText

Extracts text from a file. As a parameter, we pass the URL to the file.

Method /api/pdf/ping

ping-pong method

Method /api/imageToPDF

Image to pdf converter

Basic example

curl -d "url=https://trove.nla.gov.au/newspaper/rendition/nla.news-page29291123.pdf" "http://localhost:8080/api/extractText"

POST(HTTP) example

http://localhost:8080/api/extractText?url=https://trove.nla.gov.au/newspaper/rendition/nla.news-page29291123.pdf

Response (JSON) example

"Page number" (without sorting) and "extracted text".
"img" - jpeg base64 front page cover

{
  "1":"National Library of Australia...",
  "img": "data:image/jpeg;base64..."
}

Production mode

network_mode: "host"

Credit

PDFBox

Contributing

Pull requests are welcome.

You might also like...

A Java API for checking if text contains profanity via the alt-profanity-checker Python library.

ProfanityCheckerAPI A Java API for checking if text contains profanity via the alt-profanity-checker Python library. It uses jep to run and interpret

Feb 19, 2022

💻 Machine Coding - leetcode LLD (coding blox) - It is an Online Coding Platform that allows a user to Sign Up, Create Contests and participate in Contests hosted by Others.

leetcode-lld-flipkart-coding-blox Machine coding - leetcode LLD (coding blox) My Approach : https://leetcode.com/discuss/interview-question/object-ori

Sep 15, 2022

A minecraft minigame where you have to defend your bed and destroy the others. Once your bed is destroyed, you cannot respawn.

A minecraft minigame where you have to defend your bed and destroy the others. Once your bed is destroyed, you cannot respawn.

As from November 1st 2021 BedWars1058 by Andrei Dascălu becomes open source under GNU GPL 3.0 license. If you are a developer I would really appreciat

Dec 26, 2022

Collection of homework assignments I did for myself and for others while as an undergrad @ UNLV.

Collection of homework assignments I did for myself and for others while as an undergrad @ UNLV.

Mona Lisa Collection of homework assignments I did for myself and for others while as an undergrad @ UNLV. If you have questions or concerns please fe

May 10, 2022

A Java library for serializing objects as PHP serialization format.

Java PHP Serializer Latest release: A Java library for serializing objects as PHP serialization format. The library fully implements the PHP serializa

Jun 13, 2022

SpringBoot service to pick up CAN messages retransmitted by CANBridge and extract certain values for reporting/monitoring/alerting via DataDog

SpringBoot service to pick up CAN messages retransmitted by CANBridge and extract certain values for reporting/monitoring/alerting via DataDog

SpringBoot service to pick up CAN messages retransmitted by CANBridge and extract certain values for reporting/monitoring/alerting via DataDog

Mar 12, 2022

archifacts is a library to extract your architectural concepts out of your application's code

archifacts is a free (Apache 2.0 license) library for describing and detecting architectural building blocks and their relationships in your Java appl

Nov 29, 2022

A Flutter plugin to extract waveform data from an audio file suitable for visual rendering.

A Flutter plugin to extract waveform data from an audio file suitable for visual rendering.

just_waveform This plugin extracts waveform data from an audio file that can be used to render waveform visualisations. Usage final progressStream = J

Dec 4, 2022

SBSRE is an eclipse plugin for extract method refactoring based on the single responsibility principle(SRP)

SBSRE is an eclipse plugin for extract method refactoring based on the single responsibility principle(SRP)

SBSRE is a slice-based single responsibility extraction approach supported by an eclipse plugin for identifying Single responsibility violations in the methods.

Jul 8, 2022

Run Fabric Mods on Forge! It's an mod loading api, too (not implemented yet). No any releationship between Python library PILlow.

Pillow Mod Loader 中文 | English Quilt that runs on Forge Not implemented yet. Yes, you can believe it. This mod will make Quilt compatible with Forge.

Dec 20, 2022

This repository contains codes for various data structures and algorithms in C, C++, Java, Python, C#, Go, JavaScript and Kotlin.

Overview The goal of this project is to have codes for various data structures and algorithms - in C, C++, Java, Python, C#, Go, JavaScript and Kotlin

Mar 2, 2022

ST (StringTemplate) is a java template engine (with ports for C#, Python, and Objective-C coming) for generating source code

ST (StringTemplate) is a java template engine (with ports for C#, Python, and Objective-C coming) for generating source code, web pages, emails, or an

Jan 5, 2023

Implementation of various design patterns in C++, Java and Python

DesignPatterns Implementation of various design patterns in C++, Java and Python. Strategy Pattern Description: Strategy Pattern in implemented in a p

Jul 20, 2022

icecream-java is a Java port of the icecream library for Python.

icecream-java is a Java port of the icecream library for Python.

Apr 7, 2022

Desafios em C#, Java, JavaScript, Kotlin, Python e Ruby dos Bootcamps da Digital Innovation One

Desafios em C#, Java, JavaScript, Kotlin, Python e Ruby dos Bootcamps da Digital Innovation One

Desafios e Soluções dos Bootcamps da Digital Innovation One 📚 Clique no logotipo da linguagem para conferir soluções que ainda não estão listadas aba

Dec 29, 2022

Desafios-bootcamps-dio - Desafios em C#, Java, JavaScript, Kotlin, Python e Ruby dos Bootcamps da Digital Innovation One

Desafios-bootcamps-dio - Desafios em C#, Java, JavaScript, Kotlin, Python e Ruby dos Bootcamps da Digital Innovation One

Desafios dos Bootcamps da Digital Innovation One Aqui você vai encontrar todos os desafios dos bootcamps que realizei da Digital Innovation One. Os có

Dec 31, 2022

Python wrapper around the BoofCV Computer Vision Library

PyBoof is Python wrapper for the computer vision library BoofCV. Since this is a Java library you will need to have java and javac installed. The form

Dec 30, 2022
Owner
dotcode.moscow
Software Development
dotcode.moscow
Extract tables from PDF files

tabula-java tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that powers Tabula (repo). You can use t

Tabula 1.5k Jan 9, 2023
Core Java Library + PDF/A, xtra and XML Worker

iText 5 is EOL, and has been replaced by iText 7. Only security fixes will be added Known Security Issues org.apache.santuario:xmlsec vul

iText 1.4k Jan 9, 2023
The Apache PDFBox library is an open source Java tool for working with PDF documents

Apache PDFBox The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents,

The Apache Software Foundation 1.8k Dec 31, 2022
XML/XHTML and CSS 2.1 renderer in pure Java

Flying Saucer OVERVIEW Flying Saucer is a pure-Java library for rendering arbitrary well-formed XML (or XHTML) using CSS 2.1 for layout and formatting

null 1.8k Jan 2, 2023
An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!

OPEN HTML TO PDF OVERVIEW Open HTML to PDF is a pure-Java library for rendering arbitrary well-formed XML/XHTML (and even HTML5) using CSS 2.1 for lay

null 1.6k Dec 29, 2022
Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies

m2cgen m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native cod

Bayes' Witnesses 2.3k Jan 4, 2023
Extract tables from PDF files

tabula-java tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that powers Tabula (repo). You can use t

Tabula 1.5k Jan 9, 2023
Share food-Android- - Food donation coded in native android with firebase, google maps api and php server xampp

share_food-Android- Instructions: 1. Create a firebase account and link it with the project via google-services.json. 2. This project also uses a XAMP

Abubakar 3 Dec 28, 2021
Text to Speech Project for Spring Boot and Kotlin, Auth Server, Python with Fast API (gTTS)

TTS-App Text to Speech Project for Spring Boot Module (etc Resource, Auth Server, Python with Fast API (gTTS)) Python의 gTTS lib를 활용하여 텍스트를 음성으로 변환하는 서

Seokhyun 7 Dec 21, 2021