SitemapGen4j is a library to generate XML sitemaps in Java.

Overview

sitemapgen4j

SitemapGen4j is a library to generate XML sitemaps in Java.

What's an XML sitemap?

Quoting from sitemaps.org:

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.

Sitemap 0.90 is offered under the terms of the Attribution-ShareAlike Creative Commons License and has wide adoption, including support from Google, Yahoo!, and Microsoft.

Getting started

The easiest way to get started is to just use the WebSitemapGenerator class, like this:

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
wsg.addUrl("http://www.example.com/index.html"); // repeat multiple times
wsg.write();

Configuring options

But there are a lot of nifty options available for URLs and for the generator as a whole. To configure the generator, use a builder:

WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .gzip(true).build(); // enable gzipped output
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

To configure the URLs, construct a WebSitemapUrl with WebSitemapUrl.Options.

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
WebSitemapUrl url = new WebSitemapUrl.Options("http://www.example.com/index.html")
    .lastMod(new Date()).priority(1.0).changeFreq(ChangeFreq.HOURLY).build();
// this will configure the URL with lastmod=now, priority=1.0, changefreq=hourly 
wsg.addUrl(url);
wsg.write();

Configuring the date format

One important configuration option for the sitemap generator is the date format. The W3C datetime standard allows you to choose the precision of your datetime (anything from just specifying the year like "1997" to specifying the fraction of the second like "1997-07-16T19:20:30.45+01:00"); if you don't specify one, we'll try to guess which one you want, and we'll use the default timezone of the local machine, which might not be what you prefer.

// Use DAY pattern (2009-02-07), Greenwich Mean Time timezone
W3CDateFormat dateFormat = new W3CDateFormat(Pattern.DAY); 
dateFormat.setTimeZone(TimeZone.getTimeZone("GMT"));
WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .dateFormat(dateFormat).build(); // actually use the configured dateFormat
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

Lots of URLs: a sitemap index file

One sitemap can contain a maximum of 50,000 URLs. (Some sitemaps, like Google News sitemaps, can contain only 1,000 URLs.) If you need to put more URLs than that in a sitemap, you'll have to use a sitemap index file. Fortunately, WebSitemapGenerator can manage the whole thing for you.

WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
for (int i = 0; i < 60000; i++) wsg.addUrl("http://www.example.com/doc"+i+".html");
wsg.write();
wsg.writeSitemapsWithIndex(); // generate the sitemap_index.xml

That will generate two sitemaps for 60K URLs: sitemap1.xml (with 50K urls) and sitemap2.xml (with the remaining 10K), and then generate a sitemap_index.xml file describing the two.

It's also possible to carefully organize your sub-sitemaps. For example, it's recommended to group URLs with the same changeFreq together (have one sitemap for changeFreq "daily" and another for changeFreq "yearly"), so you can modify the lastMod of the daily sitemap without modifying the lastMod of the yearly sitemap. To do that, just construct your sitemaps one at a time using the WebSitemapGenerator, then use the SitemapIndexGenerator to create a single index for all of them.

WebSitemapGenerator wsg;
// generate foo sitemap
wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .fileNamePrefix("foo").build();
for (int i = 0; i < 5; i++) wsg.addUrl("http://www.example.com/foo"+i+".html");
wsg.write();
// generate bar sitemap
wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .fileNamePrefix("bar").build();
for (int i = 0; i < 5; i++) wsg.addUrl("http://www.example.com/bar"+i+".html");
wsg.write();
// generate sitemap index for foo + bar 
SitemapIndexGenerator sig = new SitemapIndexGenerator("http://www.example.com", myFile);
sig.addUrl("http://www.example.com/foo.xml");
sig.addUrl("http://www.example.com/bar.xml");
sig.write();

You could also use the SitemapIndexGenerator to incorporate sitemaps generated by other tools. For example, you might use Google's official Python sitemap generator to generate some sitemaps, and use WebSitemapGenerator to generate some sitemaps, and use SitemapIndexGenerator to make an index of all of them.

Validate your sitemaps

SitemapGen4j can also validate your sitemaps using the official XML Schema Definition (XSD). If you used SitemapGen4j to make the sitemaps, you shouldn't need to do this unless there's a bug in our code. But you can use it to validate sitemaps generated by other tools, and it provides an extra level of safety.

It's easy to configure the WebSitemapGenerator to automatically validate your sitemaps right after you write them (but this does slow things down, naturally).

WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.example.com", myDir)
    .autoValidate(true).build(); // validate the sitemap after writing
wsg.addUrl("http://www.example.com/index.html");
wsg.write();

You can also use the SitemapValidator directly to manage sitemaps. It has two methods: validateWebSitemap(File f) and validateSitemapIndex(File f).

Google-specific sitemaps

Google can understand a wide variety of custom sitemap formats that they made up, including a Mobile sitemaps, Geo sitemaps, Code sitemaps (for Google Code search), Google News sitemaps, and Video sitemaps. SitemapGen4j can generate any/all of these different types of sitemaps.

To generate a special type of sitemap, just use GoogleMobileSitemapGenerator, GoogleGeoSitemapGenerator, GoogleCodeSitemapGenerator, GoogleCodeSitemapGenerator, GoogleNewsSitemapGenerator, or GoogleVideoSitemapGenerator instead of WebSitemapGenerator.

You can't mix-and-match regular URLs with Google-specific sitemaps, so you'll also have to use a GoogleMobileSitemapUrl, GoogleGeoSitemapUrl, GoogleCodeSitemapUrl, GoogleNewsSitemapUrl, or GoogleVideoSitemapUrl instead of a WebSitemapUrl. Each of them has unique configurable options not available to regular web URLs.

Comments
  • Allow to create index even when only (one) sitemap.xml exists

    Allow to create index even when only (one) sitemap.xml exists

    This is a fix for https://code.google.com/p/sitemapgen4j/issues/detail?id=8 When the number of total urls is smaller than maxUrls (50 000) and then using writeSitemapsWithIndex "only" the sitemap.xml will be generated and the count of sitemaps (e.g. sitemap1.xml, sitemap2.xml) is 0. As a result when trying to write the Sitemap index file with writeSitemapsWithIndex we get an exception:

     java.lang.RuntimeException: No URLs added, sitemap index would be empty; you must add some URLs with addUrls
    

    This pull request is basically the patch from comment 8.

    @dfabulich Could you please merge and also release a new version? Thanks!

    opened by mkurz 14
  • Escape XML entities

    Escape XML entities

    Having & in URL causes sitemapgen4j to produce invalid XML:

    WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", dir);
    wsg.addUrl("http://www.example.com/Tips&Tricks.html");
    wsg.write();
    

    Outcome:

    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" >
      <url>
        <loc>http://www.example.com/Tips&Tricks.html</loc>
      </url>
    </urlset>
    

    instead of:

    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" >
      <url>
        <loc>http://www.example.com/Tips&amp;Tricks.html</loc>
      </url>
    </urlset>
    
    opened by eximius313 12
  • Required fields Google News

    Required fields Google News

    According to the documentation, there are some more required fields. I added a couple of those fields.

    Documentation I based myself on: https://support.google.com/news/publisher/answer/74288?hl=en

    opened by TomWalbers 6
  • Close streams in finally clause

    Close streams in finally clause

    The streams have to be closed in a finally clause. Otherwise the streams might be left open when an exception has been thrown meanwhile. (In our production environment this resulted in an "too many open files" error.)

    opened by spekr 5
  • Adding support for custom suffix indexes for sitemap generation

    Adding support for custom suffix indexes for sitemap generation

    @dfabulich please review my change, which incorporates, adding a suffix index ( can be some sort of string pattern ). This can be turned on or off. This property is set via options. when generating the sitemaps. for example in this case it will be -

    WebSitemapGenerator wsg = WebSitemapGenerator.builder( .... all the custom options .. ).build();

    Let me know what you think and your comments please!

    Thanks! Ankita

    opened by anellimarla 4
  • URL using non ASCII characters [utf-8 support]

    URL using non ASCII characters [utf-8 support]

    I think this old issue still up to date http://code.google.com/p/sitemapgen4j/issues/detail?id=16 Even the sitemp is utf-8, we can't use url with french accents like éèà etc etc

    I try the fix describe in the issue and it works. http://code.google.com/p/sitemapgen4j/issues/detail?id=16

    Best regards, Denis

    opened by dgosset 4
  • In-memory support

    In-memory support

    Thanks for sharing this lib, saved me some time already! It would be great if the lib would be able to create a (file-) result without accessing the hard disk. Any plans for this feature?

    opened by cnmuc 4
  • Accessing ISitemapUrl interface throws IllegalAccessError

    Accessing ISitemapUrl interface throws IllegalAccessError

    Trying to use the following code snippet in my Kotlin project.

    WebSitemapGenerator wsg = new WebSitemapGenerator("http://www.example.com", myDir);
    WebSitemapUrl url = new WebSitemapUrl.Options("http://www.example.com/index.html")
        .lastMod(new Date()).priority(1.0).changeFreq(ChangeFreq.HOURLY).build();
    wsg.addUrl(url);
    wsg.write();
    
    

    I'm getting the following error during the runtime.

    Exception in thread "main" java.lang.IllegalAccessError: tried to access class com.redfin.sitemapgenerator.ISitemapUrl from class Main
    	at Main.run(Main.kt:20)
    	at Main$Companion.main(Main.kt:38)
    	at Main.main(Main.kt)
    
    

    The ISitemapUrl interface is not public and it's creating the problem when i try to access the wsg.addUrl(url).

    Solution:

    • Changing the scope of the ISitemapUrl interface from default to public fixes the issue.
    opened by ramsrib 3
  • siteindex.xsd out of date

    siteindex.xsd out of date

    The siteindex.xsd file included is out of date. It only allows 1,000 sitemap URLs. New version from http://www.sitemaps.org/schemas/sitemap/siteindex.xsd will allow up to 50,000 sitemap URLs.

    opened by jiwhiz 3
  • added new generator for google link extension

    added new generator for google link extension

    I've added the generator for google link extension of sitemap standard as requested in ticket https://github.com/dfabulich/sitemapgen4j/issues/30

    Changed a couple of files more to close a stream that was let open (is test code so no risk for a memory leak but I have findbugs installed and it was complaining anyway).

    opened by sergiovm 2
  • Added checked exceptions for InvalidURLException

    Added checked exceptions for InvalidURLException

    Hello, I really like this library, but there are far to many unchecked exceptions. This is just a small start for migrating some of the Exceptions. This breaks backwards compatibility, but I feel its important that a library not be filled with so many unchecked/runtime Exceptions. Thoughts?

    opened by jamesbrink 2
  • Add image to GoogleLingSitemapUrl

    Add image to GoogleLingSitemapUrl

    Currently we can add alternatees by using GoogleLinkSitemapUrl and images by using GoogleImageSitemapUrl. In my case I want to have both at the same time to achieve following output:

    <url>
      <loc>http://www.example.com/en/product-1</loc>
      <xhtml:link rel="alternate" hreflang="de" href="http://www.example.com/de/product-1" />
      <xhtml:link rel="alternate" hreflang="en" href="http://www.example.com/en/product-1" />
      <image:image>
       <image:loc>http://www.example.com/image1.jpg</image:loc>
      </image:image>
      <image:image>
       <image:loc>http://www.example.com/image2.jpg</image:loc>
      </image:image>
    </url>
    

    What do you think? My idea is to add images to GoogleLinkSitemapUrl and maybe deprecate GoogleImageSitemapUrl. I will implement it myself but want to confirm solution with the community first

    opened by Klapsa2503 0
  • add customizable file name for sitemap_index.xml

    add customizable file name for sitemap_index.xml

    nothing fancy, but just came across a seldom situation last week, which needs a none-standard file name for sitemap_index.xml, it will be then reflected in robots.txt to trigger a re-read from google bot and possibly fix a bad scored website.

    opened by greenflute 0
  • Possibility to customize the name of sitemap_index.xml in Method com.redfin.sitemapgenerator.SitemapGenerator.writeSitemapsWithIndex()

    Possibility to customize the name of sitemap_index.xml in Method com.redfin.sitemapgenerator.SitemapGenerator.writeSitemapsWithIndex()

    Is there any possibility to customize the name of sitemap_index.xml in Method com.redfin.sitemapgenerator.SitemapGenerator.writeSitemapsWithIndex()? Currently the file name is hard coded, although i must admit it is very seldom that man needs a different name for the sitemap xml. but it would be nice to have the possibility to parameterize it.

    best Regards

    opened by greenflute 2
  • Missing documentation

    Missing documentation

    Firstly, Thank you guys created sitemapgen4j, it was really appreciated. Sorry if I wrong when writing this post. I didn't see any full user guide or docs of sitemapgen4j, so I don't what the ChangeFreq used for? Who can help me to explain about it?

    Thank you in advance. Btw pointing me if we have already the documentation of this open-source.

    opened by hungbang 0
  • Autovalidation for GoogleImageSitemapGenerator throws RuntimeException

    Autovalidation for GoogleImageSitemapGenerator throws RuntimeException

    Using auto validation in combination with a GoogleImageSitemapGenerator fails.

    Code extract:

    GoogleImageSitemapGenerator generator = GoogleImageSitemapGenerator.builder("https://www.google.com", new File(System.getProperty("java.io.tmpdir")))
                                                                               .gzip(false)
                                                                               .autoValidate(true)
                                                                               .allowEmptySitemap(false)
                                                                               .allowMultipleSitemaps(true)
                                                                               .build();
    
    Image image = new Image.ImageBuilder("https://www.google.com/bug.jpg").build();
    
    generator.addUrl(new GoogleImageSitemapUrl.Options("https://www.google.com/any").images(image)
                                                                                    .changeFreq(ChangeFreq.DAILY)
                                                                                    .priority(Priority.DEFAULT.getValue())
                                                                                    .lastMod(new Date())
                                                                                    .build());
    
    generator.write();
    

    Exception:

    Exception in thread "main" java.lang.RuntimeException: Sitemap file failed to validate (bug?)
    	at com.redfin.sitemapgenerator.SitemapGenerator.writeSiteMap(SitemapGenerator.java:280)
    	at com.redfin.sitemapgenerator.SitemapGenerator.write(SitemapGenerator.java:173)
    	at com.redfin.sitemapgenerator.GoogleImageSitemapGenerator.write(GoogleImageSitemapGenerator.java:11)
    	at be.netmediaeurope.promoplatform.promobutler.controllers.sitemap.service.v2.delegate.ProducerDetailSitemapGeneratorDelegate.main(ProducerDetailSitemapGeneratorDelegate.java:108)
    Caused by: org.xml.sax.SAXParseException; lineNumber: 8; columnNumber: 18; cvc-complex-type.2.4.c: The matching wildcard is strict, but no declaration can be found for element 'image:image'.
    

    Missing .xsd sitemap-image.xsd

    opened by skubski 0
Releases(v1.1.1)
Owner
Dan Fabulich
Dan Fabulich
This project demonstrates usage of Captcha, OTP APIs to access Offline eKYC XML.

Client Application to simulate offline eKYC wrapper API flow Introduction This is a Spring boot application which can be used to download offline eKYC

UIDAI 4 Oct 29, 2021
Acceso a Datos - 02 XML. 2DAM. Ejercicios realizados por el alumnado. Curso 2021-2022

Acceso a Datos - 02 - Ejercicios - 2021-2022 Acceso a Datos - 02 XML. 2DAM. Ejercicios realizados por el alumnado. Curso 2021-2022 ¿Cómo Colaborar? Es

José Luis González Sánchez 5 Dec 27, 2022
Magic Bean: A very basic library which will generate POJOs.

Magic Bean: A very basic library which will generate POJOs.

Ethan McCue 48 Dec 27, 2022
Library to generate images from layers

react-native-image-generator Library for generate images from other images Installation yarn add react-native-image-generator Usage import { generate

Evgeny Usov 13 Nov 16, 2022
Community extension to generate a Java client from the provided Camunda 7 OpenAPI descitpion and also warp it into Spring Boot

Camunda Engine OpenAPI REST Client Java and Spring Boot This community extension is a convenience wrapper around the generated Java client from the Ca

Camunda Community Hub 29 Dec 28, 2022
A web application to generate Java source code with spring-boot and mybatis-plus

A web application to generate Java source code with spring-boot and mybatis-plus. Also, The class of Domain,Mapper,XML of Mapper Interface,Service,Controller are included. You can change the data source what you want to generate for your project in app running without restart this code -generator application.

Weasley 3 Aug 29, 2022
Generate a dynamic PAC script that will route traffic to your Burp proxy only if it matches the scope defined in your Burp target.

Burp PAC Server This Burp Extension generates a dynamic Proxy Auto-Configuration (PAC) script that will route traffic to your Burp proxy only if it ma

null 30 Jun 13, 2022
Generate and read big Excel files quickly

fastexcel fastexcel-writer There are not many alternatives when you have to generate xlsx Excel workbooks in Java. The most popular one (Apache POI) i

Cegid Conciliator 449 Jan 1, 2023
Fun little program to generate worlds in Excel

Basic world generation for Excel! How to use (For windows): Download the latest release from Releases run java -jar WorldGenExcelVersion.jar "path_to_

Steven Zhu 1 Feb 12, 2022
Generate facts from bytecode

soot-fact-generator generate facts from bytecode (source is https://github.com/plast-lab/doop-mirror/tree/master/generators) 通过soot解析bytecode生成fact,类似

null 14 Dec 28, 2022
This application will help you to generate Elasticsearch template based on your data

Welcome to templates generator application for Elasticsearch This application will help you to generate the template and/or test index, based on your

DBeast 2 Jan 2, 2023
A desktop app to generate QR codes.

qrcode-generator A desktop GUI app to generate QR codes. Currently a fun project and a work-in-progress. GitHub URL: https://github.com/abhinavgunwant

Abhinav Gunwant 2 Aug 2, 2022
JHipster Lite ⚡ is a development platform to generate, develop & deploy modern web applications & microservice architectures, step by step.

JHipster Lite ⚡ Description JHipster is a development platform to quickly generate, develop & deploy modern web applications & microservice architectu

JHipster 255 Jan 3, 2023
A command-line tool to generate different types of noise as images.

noisegen A command-line tool to generate different types of noise as images. Usage Run one of the releases, either the JAR using java -jar noisegen-0.

Tommy Ettinger 6 Jul 21, 2022
Tinker is a hot-fix solution library for Android, it supports dex, library and resources update without reinstall apk.

Tinker Tinker is a hot-fix solution library for Android, it supports dex, library and resources update without reinstalling apk. Getting started Add t

Tencent 16.6k Dec 30, 2022
Trust-java - Test Results Verification library for Java

TRUST - Test Results Verification library for Java The TRUST's primary goal is to provide the simple way of different test results verification. Gener

Serhii Shymkiv 2 Nov 19, 2017
Library for converting from one Java class to a dissimilar Java class with similar names based on the Bean convention

Beanmapper Beanmapper is a Java library for mapping dissimilar Java classes with similar names. The use cases for Beanmapper are the following: mappin

null 26 Nov 15, 2022
Create a Music Playlist Library -Core JAVA, JAVA Swing, AWT

Project Specifications Manage Everything about a basic Music Playlist Application in Java A Music Library for Listing, Replaying, Navigating between c

Muhammad Asad 7 Nov 8, 2022
Resconstruct is a java library to infer missing information vectors of java classes.

Reconstruct Resconstruct is a java library to infer missing information vectors of java classes. Features Phantom classes Inheritance solving Dummy fi

Nowilltolife 14 Nov 17, 2022