Extracts raw text from web archives (WARCs).

Related tags

Spring Boot jwarcex
Overview

Readme

Extracts raw text from web archives (WARCs), usually obtained by web crawling.

Requirements

  • Java 8
  • Maven 3 (for development)

Note

Test coverage is currently not representative because some test that relied on data that we do not want make public have been removed for the public release.

This will be fixed for the next release.

Building from source

To build jwarcex from source you need to invoke mvn install within the main directory.

cd jwarcex
mvn clean install

For command line usage you will find a jar file under jwarcex/jwarcex-standalone/target/jwarcex-standalone-<VERSION>.jar

Examples

Print the help

java -jar jwarcex-standalone-<VERSION>-SNAPSHOT.jar -h

Process a WARC file

java -jar jwarcex-standalone-<VERSION>-SNAPSHOT.jar /path/to/my.warc /path/to/output.source

Read gzipped WARCs

java -jar jwarcex-standalone-<VERSION>-SNAPSHOT.jar /path/to/my.warc.gz /path/to/output.source -c
You might also like...

A Text Based, Turn-Based, Strategy Game to be played in console

ConsoleGame A Text Based, Turn-Based, Strategy Game to be played in console. Feel free to fix any retardation in my code (meaning bugs, stupid/spaghet

Jul 17, 2022

该仓库中主要是 Spring Boot 的入门学习教程以及一些常用的 Spring Boot 实战项目教程,包括 Spring Boot 使用的各种示例代码,同时也包括一些实战项目的项目源码和效果展示,实战项目包括基本的 web 开发以及目前大家普遍使用的线上博客项目/企业大型商城系统/前后端分离实践项目等,摆脱各种 hello world 入门案例的束缚,真正的掌握 Spring Boot 开发。

该仓库中主要是 Spring Boot 的入门学习教程以及一些常用的 Spring Boot 实战项目教程,包括 Spring Boot 使用的各种示例代码,同时也包括一些实战项目的项目源码和效果展示,实战项目包括基本的 web 开发以及目前大家普遍使用的线上博客项目/企业大型商城系统/前后端分离实践项目等,摆脱各种 hello world 入门案例的束缚,真正的掌握 Spring Boot 开发。

Spring Boot Projects 该仓库中主要是 Spring Boot 的入门学习教程以及一些常用的 Spring Boot 实战项目教程,包括 Spring Boot 使用的各种示例代码,同时也包括一些实战项目的项目源码和效果展示,实战项目包括基本的 web 开发以及目前大家普遍使用的前

Dec 30, 2022

This Web Application Allows A user to upload a two minutes Video. It uses Server Side Capabilities of Nodejs and Spring Boot .

This Web Application Allows A user to upload a two minutes Video. It uses Server Side Capabilities of Nodejs and Spring Boot .

VideoStreamingApplication Purpose Of This Application These days trend of short videos are on rise youtube recently realsed "Shorts" . So , taking ins

Nov 13, 2022

💡极致性能的企业级Java服务器框架,RPC,游戏服务器框架,web应用服务器框架。(Extreme fast enterprise Java server framework, can be RPC, game server framework, web server framework.)

💡极致性能的企业级Java服务器框架,RPC,游戏服务器框架,web应用服务器框架。(Extreme fast enterprise Java server framework, can be RPC, game server framework, web server framework.)

👉 为性能而生的万能服务器框架 👈 Ⅰ. zfoo简介 🚩 性能炸裂,天生异步,Actor设计思想,无锁化设计,基于Spring的MVC式用法的万能RPC框架 极致序列化,原生集成的目前二进制序列化和反序列化速度最快的 zfoo protocol 作为网络通讯协议 高可拓展性,单台服务器部署,

Jan 1, 2023

Back-End/API de uma aplicação web de agendamento desenvolvida durante o Hackaton do Programa de Formação do Grupo Fcamara

🖥️ Sobre o projeto 📅 Agenda Laranja - é um meio prático e eficiente para programar o dia do trabalho presencial, respeitando as normas de segurança.

Sep 17, 2021

Spring Boot JdbcTemplate example with SQL Server: CRUD Rest API using Spring Data JDBC, Spring Web MVC

Spring Boot JdbcTemplate example with SQL Server: Build CRUD Rest API Build a Spring Boot CRUD Rest API example that uses Spring Data Jdbc to make CRU

Dec 20, 2022

Simple and lightweight application which is checking status of your web services and send a notification if it is down.

rose-uptimer Simple and lightweight application which is checking status of your web services and send a notification if it is down. Example configura

Sep 25, 2022

log4j-scanner is a project derived from other members of the open-source community by CISA's Rapid Action Force team to help organizations identify potentially vulnerable web services affected by the log4j vulnerabilities.

Log4j Scanner This repository provides a scanning solution for the log4j Remote Code Execution vulnerabilities (CVE-2021-44228 & CVE-2021-45046). The

Dec 22, 2022

This extension identifies hidden, unlinked parameters. It's particularly useful for finding web cache poisoning vulnerabilities.

This extension identifies hidden, unlinked parameters. It's particularly useful for finding web cache poisoning vulnerabilities.

param-miner This extension identifies hidden, unlinked parameters. It's particularly useful for finding web cache poisoning vulnerabilities. It combin

Jan 27, 2022
Owner
Leipzig Corpora Collection / Wortschatz Leipzig
Leipzig Corpora Collection / Wortschatz Leipzig
RTL marquee text view android right to left moving text - persian - farsi - arabic - urdo

RtlMarqueeView RTL marquee text view can hande the speed of moving text can jump to the specefic position of the text at start can loop the marquee te

mehran elyasi 4 Feb 14, 2022
This repository is related to the Java Web Developer (ND035), Course - Web Services and APIs

About this Repository This repository is related to the Java Web Developer (ND035), Course - Web Services and APIs It contains the following folders:

Rasha Omran 1 Jan 28, 2022
The application consists of a web page with a list of some movies. The page allows user interaction through ratings of movies listed in the web app.

DSMovie About the project https://matheus-maia-alvarez-dsmovie.netlify.app/ DSMovie is a full stack web and mobile application built during the Spring

Matheus Maia Alvarez 6 Jul 21, 2022
JSON Web Token implementation for Java according to RFC 7519. Easily create, parse and validate JSON Web Tokens using a fluent API.

JWT-Java JSON Web Token library for Java according to RFC 7519. Table of Contents What are JSON Web Tokens? Header Payload Signature Features Supporte

Bastiaan Jansen 6 Jul 10, 2022
A complete and performing library to highlight text snippets (EditText, SpannableString and TextView) using Spannable with Regular Expressions (Regex) for Android.

Highlight A complete and performing library to highlight text snippets (EditText/Editable and TextView) using Spannable with Regular Expressions (Rege

Irineu A. Silva 16 Dec 22, 2022
Text to Speech Project for Spring Boot and Kotlin, Auth Server, Python with Fast API (gTTS)

TTS-App Text to Speech Project for Spring Boot Module (etc Resource, Auth Server, Python with Fast API (gTTS)) Python의 gTTS lib를 활용하여 텍스트를 음성으로 변환하는 서

Seokhyun 7 Dec 21, 2021
Decipher-pad - Encrypt and secure your text files with Decipher Pad!

Welcome to Decipher Pad ?? Encrypt and secure your text files with Decipher Pad! Table of Contents About The Project Tech Stack Prerequisites Developm

Md Ausaf Rashid 4 Feb 24, 2022
Calef - CalEF (Calendar Entry Formatter) : Select an entry in Android-Kalender and send/share the entry's content as human readable text.

CalEF (Calendar Entry Formatter) Select an entry in Android-Kalender and send/share the entry's content as human readable text. Usually calendar entri

k3b 6 Aug 17, 2022
This sample shows how to implement two-way text chat over Bluetooth between two Android devices, using all the fundamental Bluetooth API capabilities.

Zenitsu-Bluetooth Chat Application This sample shows how to implement two-way text chat over Bluetooth between two Android devices, using all the fund

Gururaj Koni 1 Jan 16, 2022
Search API with spelling correction using ngram-index algorithm: implementation using Java Spring-boot and MySQL ngram full text search index

Search API to handle Spelling-Corrections Based on N-gram index algorithm: using MySQL Ngram Full-Text Parser Sample Screen-Recording Screen.Recording

Hardik Singh Behl 5 Dec 4, 2021