A scalable web crawler framework for Java.

Overview

logo

Readme in Chinese

Build Status

A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.

Features:

  • Simple core with high flexibility.
  • Simple API for html extracting.
  • Annotation with POJO to customize a crawler, no configuration.
  • Multi-thread and Distribution support.
  • Easy to be integrated.

Install:

Add dependencies to your pom.xml:

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.7.4</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.7.4</version>
</dependency>

WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.

<exclusions>
    <exclusion>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
</exclusions>

Get Started:

First crawler:

Write a class implements PageProcessor. For example, I wrote a crawler of github repository infomation.

public class GithubRepoPageProcessor implements PageProcessor {

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
        if (page.getResultItems().get("name")==null){
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
    }
}
  • page.addTargetRequests(links)

    Add urls for crawling.

You can also use annotation way:

@TargetUrl("https://github.com/\\w+/\\w+")
@HelpUrl("https://github.com/\\w+")
public class GithubRepo {

    @ExtractBy(value = "//h1[@class='public']/strong/a/text()", notNull = true)
    private String name;

    @ExtractByUrl("https://github\\.com/(\\w+)/.*")
    private String author;

    @ExtractBy("//div[@id='readme']/tidyText()")
    private String readme;

    public static void main(String[] args) {
        OOSpider.create(Site.me().setSleepTime(1000)
                , new ConsolePageModelPipeline(), GithubRepo.class)
                .addUrl("https://github.com/code4craft").thread(5).run();
    }
}

Docs and samples:

Documents: http://webmagic.io/docs/

The architecture of webmagic (refered to Scrapy)

image

There are more examples in webmagic-samples package.

Lisence:

Lisenced under Apache 2.0 lisence

Thanks:

To write webmagic, I refered to the projects below :

Mail-list:

https://groups.google.com/forum/#!forum/webmagic-java

http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988

QQ Group: 373225642 542327088

Related Project

  • Gather Platform

    A web console based on WebMagic for Spider configuration and management.

Comments
  • WebMagic-Avalon项目计划

    WebMagic-Avalon项目计划

    WebMagic-Avalon项目的目标是打造一个可配置、可管理的爬虫,以及一个可分享配置/脚本的平台,从而减少熟悉的开发者的开发量,并且让不熟悉Java技术的人也能简单的使用一个爬虫。

    ~~Part1:webmagic-scripts~~

    目标:使得可以用简单脚本的方式编写爬虫,从而为一些常用场景提供可流通的脚本。 例如:我需要抓github的仓库数据,可以这样写一个脚本(javascript):

    https://github.com/code4craft/webmagic/tree/master/webmagic-scripts

    这个功能目前实现了一部分,但最终结果仍在实验阶段。欢迎大家积极参与并提出意见。~~因为编写一个稳定的语言需要很多领域积累,后来决定先做后台。~~

    Part2:webmagic-pannel

    一个集成了加载脚本、管理爬虫的后台。目前在开发中。

    ~~Part3:webmagic-market~~

    一个可以分享、搜索和下载脚本的站点。计划中。

    如何参与

    webmagic目前由作者业余维护,仅仅为了分享和个人提高,没有任何盈利,也没有商业化打算。

    欢迎以下几种形式的贡献:

    1. 为webmagic项目本身提出改进意见,可以通过邮件组、qq、oschina或者在github提交issue(推荐)的方式。
    2. 参与WebMagic-Avalon计划的建设讨论,包括产品设计、技术选型等,可以直接回复这个issue。
    3. 参与webmagic代码开发,请fork一份代码,修改后提交pull request给我。请使用尽量新的版本,并说明修改内容。pull request接受后,我会将你加为committer,共同参与开发。
    newFeature suspend 
    opened by code4craft 33
  • Https下无法抓取只支持TLS1.2的站点

    Https下无法抓取只支持TLS1.2的站点

    WebMagic默认的HttpClient只会用TLSv1去请求,对于某些只支持TLS1.2的站点(例如 https://juejin.im/) ,就会报错:

    javax.net.ssl.SSLException: Received fatal alert: protocol_version
    	at sun.security.ssl.Alerts.getSSLException(Alerts.java:208)
    	at sun.security.ssl.Alerts.getSSLException(Alerts.java:154)
    	at sun.security.ssl.SSLSocketImpl.recvAlert(SSLSocketImpl.java:2023)
    	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1125)
    	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
    	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
    	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
    	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:394)
    	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:353)
    	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141)
    	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
    	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
    	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
    	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
    	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
    	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
    	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
    	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    	at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:85)
    

    现在的修改方式是在HttpClientGenerator中构建SSLConnectionSocketFactory时加上支持。

    bug 
    opened by code4craft 24
  • [Bug] CountableThreadPool 类中 exec() 中是否存在线程安全问题

    [Bug] CountableThreadPool 类中 exec() 中是否存在线程安全问题

        if (threadAlive.get() >= threadNum) { // 1 感觉此处有线程安全问题
            try {
                reentrantLock.lock();
                while (threadAlive.get() >= threadNum) {
                    try {
                        condition.await();
                    } catch (InterruptedException e) {
                    }
                }
            } finally {
                reentrantLock.unlock();
            }
        }
        threadAlive.incrementAndGet();  // 2
        executorService.execute(new Runnable() {
            @Override
            public void run() {
                try {
                    runnable.run();
                } finally {
                    try {
                        reentrantLock.lock();
                        threadAlive.decrementAndGet();
                        condition.signal();
                    } finally {
                        reentrantLock.unlock();
                    }
                }
            }
        });
    

    比如,有两个线程,执行此任务。先假设 threadNum = 10threadAlive.get() = 9。线程1执行注释1处的判断得到 false,将会执行注释2的语句。但还没执行,就发生调度了,现线程2执行注释1处了。此时由于线程1没有执行 threadAlive.incrementAndGet(); 所以依旧判断为 false 。但是,这样的逻辑是不该的。

    我不知道我这样理解对不对,请大家解答。谢谢~

    我的一个修改

        // My modify
        try {
            reentrantLock.lock();
            if (threadAlive.get() >= threadNum)
                while (threadAlive.get() >= threadNum) {
                    try {
                        condition.await();
                    } catch (InterruptedException e) {
                    }
                }
            else
                threadAlive.incrementAndGet();
        } finally {
            reentrantLock.unlock();
        }
        executorService.execute(new Runnable() {
            @Override
            public void run() {
                try {
                    runnable.run();
                } finally {
                    try {
                        reentrantLock.lock();
                        threadAlive.decrementAndGet();
                        condition.signal();
                    } finally {
                        reentrantLock.unlock();
                    }
                }
            }
        });
    
    opened by INotWant 12
  • download失败,添加请求,cookie失效

    download失败,添加请求,cookie失效

    大佬你好,我现在有一个问题,就是目前我们的PageProcessor里面addTargetRequest是在download success成功情况下才能添加新的请求,假设失败了就会重试,但是如果一直重试失败到达最大次数后,就不能添加新的request了,我尝试在Spider启动时直接addRequests添加构建所有请求,但是这样Request的cookie header就写死了,会出现请求cookie失效,无法灵活更新cookie,这个情况该如何处理呢,能否支持在失败时也可以添加新的Request

    question 
    opened by jason2015lxj 10
  • 第一个项目运行报错,com.google.common.collect.Table

    第一个项目运行报错,com.google.common.collect.Table

    我自己建了一个项目, 导入了 us.codecraft webmagic-core 0.5.3 us.codecraft webmagic-extension 0.5.3 运行之后报 Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/collect/Table at com.leozhou.webmagic.GithubRepoPageProcessor.(GithubRepoPageProcessor.java:14) at com.leozhou.webmagic.GithubRepoPageProcessor.main(GithubRepoPageProcessor.java:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Table at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 7 more

    opened by le0zh0u 10
  • 在Maven仓库官方库中的RedisScheduler代码未更新问题

    在Maven仓库官方库中的RedisScheduler代码未更新问题

    Maven仓库中的RedisScheduler类代码未更新,GitHub上代码已更新的RedisScheduler部分代码没有和Maven中代码保持一致,导致redis-client只能用低版本2.x不能使用3.x及以上版本,因为低版本(pool.returnResource(jedis);已弃用 ; 在redis去重时报错, pool.returnResource(jedis); 报IllegalAccessError)建议重新更新上传Maven官方仓库中的jar包 更新一个小版本。 (这个故障经排查最后发现是maven官方库中代码未使用已更新代码,大坑!!!)

    opened by 404area 9
  • PageProcessor里面怎么有效的更加自己的条件比如新闻Id 判断停止任务呢?

    PageProcessor里面怎么有效的更加自己的条件比如新闻Id 判断停止任务呢?

    PageProcessor里面怎么有效的更加自己的条件比如新闻Id 判断停止任务呢? 在下面方法里面怎么停止任务呢? public void process(Page page) { //列表页 if (page.getUrl().regex(URL_LIST).match()) { if (isFlag) { spider.stop(); }else{ page.addTargetRequests(page.getHtml().xpath("//div[@class="list-con"]").links().regex(URL_GET).all()); page.addTargetRequests(page.getHtml().links().regex(URL_LIST).all()); } } }

    opened by lidaoyang 9
  • addUrl添加的入口地址被Scheduler过滤后,下次无法启动

    addUrl添加的入口地址被Scheduler过滤后,下次无法启动

    我添加了入口 Spider.create(HahaProcessor()) .addUrl("http://www.haha.mx/rising") 设置的FileCacheQueueScheduler的Scheduler 爬取一次后,此url被标记为已爬取 image 事实上我是每半小时从这个页面取到最热门的几条,我希望这个入口不会被过滤,但其他内容页的url照常过滤,应该怎么做啊?

    question 
    opened by JsonSong89 9
  • 请教版主关于正则表达式匹配的问题 求解答

    请教版主关于正则表达式匹配的问题 求解答

    import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.pipeline.JsonFilePipeline; import us.codecraft.webmagic.processor.PageProcessor;

    /**

    • @author [email protected]
      */ public class Client58Processor implements PageProcessor {

      public static final String URL_LIST = "http://cs\.58\.com/zhaozu/pn\d+\/pve_1092_1/?PGTID=0d30000d-0019-\w+\-\w+\-\w+\&ClickID=1";

      public static final String URL_POST = "http://cs\.58\.com/zhaozu/\w+\.shtml?psid=124606730193983807135678330&entinfo=\w+\&iuType=p_0&PGTID=0d30000d-0019-e991-490c-51f69d09bf8f&ClickID=\w+\";

      private Site site = Site .me() .setDomain("cs.58.com") .setSleepTime(6000) .setUserAgent( "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");

      @Override public void process(Page page) { // 列表页 if (page.getUrl().regex(URL_LIST).match()) { page.addTargetRequests(page.getHtml().xpath("//div[@class="tbimg"]").links().regex(URL_POST).all()); page.addTargetRequests(page.getHtml().links().regex(URL_LIST).all()); // 文章页 } else { /page.putField("title", page.getHtml().xpath("//div[@class='headline']")); page.putField("content", page.getHtml().xpath("//div[@class='infocs']"));/ page.putField("title", page.getHtml().xpath("//div[@class='filterbar']")); page.putField("content", page.getHtml().xpath("//div[@class='filterbar']")); /page.putField("date", page.getHtml().xpath("//div[@id='articlebody']//span[@class='time SG_txtc']").regex("\((.)\)"));*/ } }

      @Override public Site getSite() { return site; }

      public static void main(String[] args) { Spider.create(new Client58Processor()) .addUrl("http://cs.58.com/zhaozu/pn1/pve_1092_1/?PGTID=0d30000d-0019-e4b5-3eff-f37b576ebd78&ClickID=1") .addPipeline(new JsonFilePipeline("D:\webmagic58\")) .run(); } }

    如上代码 如何匹配列表数据哦? 正则表达式匹配不上

    question 
    opened by EdronCai 8
  • webmagic-core工程中依赖org.apache.commons.lang3不可用,求指导!

    webmagic-core工程中依赖org.apache.commons.lang3不可用,求指导!

    工具:idea 搭建方式:Maven 随意搭建一个项目后,通过pom文件依赖webmagic-core与webmagic-extension;在运行官网的demo时: 提示java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils。 PS:工程中已经引入org.apache.commons.lang3这个jar包了,但是webmagic-core始终报错,郁闷。 26

    bug 
    opened by KfcRun 8
  • 运行第一个demo时出错。

    运行第一个demo时出错。

    http://webmagic.io/docs/zh/posts/ch2-install/first-project.html 这里的demo并不能成功运行。然后试了下面的demo。github上的demo。报错都是一致的。 刚接触爬虫。望作者大大指教。

    报错信息:

    Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/collect/Table at test.GithubRepoPageProcessor.<init>(GithubRepoPageProcessor.java:55) at test.GithubRepoPageProcessor.main(GithubRepoPageProcessor.java:75) Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Table at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 2 more

    opened by theoneLee 8
  • 请求支持JQuery遍历API

    请求支持JQuery遍历API

    CSS选择器有局限,举个例子,CSS在纵横方向只能向后选择,也就是说只能选择子元素和后继元素,不能选择父元素和前驱元素。而JQueryAPI提供了解决方案,如$("css选择器").parent()用于获取指定元素的父元素,我还能基于此选择器获取父元素的后继元素,如$("css选择器").parent().next()。相关API参考https://www.w3school.com.cn/jquery/jquery_ref_traversing.asp。

    opened by w3l7 0
  • Integrate URLFrontier as a backend for URL storage

    Integrate URLFrontier as a backend for URL storage

    Hi,

    This is a question, not a bug report.

    url-frontier is an API to define a crawl frontier. It uses gRPC and has a service implementation. It is crawler-neutral and is suitable to being used in a distributed environment.

    How could URLFrontier fit within Webmagic? Is there a similar abstraction over the URL storage? Is there something missing in URLFrontier's API?

    Thanks

    opened by jnioche 0
  • CountableThreadPool 的意义何在

    CountableThreadPool 的意义何在

       public void execute(final Runnable runnable) {
    
    
            if (threadAlive.get() >= threadNum) {
                try {
                    reentrantLock.lock();
                    while (threadAlive.get() >= threadNum) {
                        try {
                            condition.await();
                        } catch (InterruptedException e) {
                        }
                    }
                } finally {
                    reentrantLock.unlock();
                }
            }
            threadAlive.incrementAndGet();
            executorService.execute(new Runnable() {
                @Override
                public void run() {
                    try {
                        runnable.run();
                    } finally {
                        try {
                            reentrantLock.lock();
                            threadAlive.decrementAndGet();
                            condition.signal();
                        } finally {
                            reentrantLock.unlock();
                        }
                    }
                }
            });
        }
    

    这个方法,获取当前活跃线程数没有做并发处理,这个锁根本就没有发挥出作用

    opened by MaLuxray 0
  • downloader 设置 proxy 与 site 设置 proxy 有区别吗?

    downloader 设置 proxy 与 site 设置 proxy 有区别吗?

    我发现可以对Site对象设置代理
    Uploading image.png…

    也可以对downloader设置代理

            HttpClientDownloader downloader = new HttpClientDownloader();
            downloader.setProxyProvider(SimpleProxyProvider.from(
                    new Proxy("127.0.0.1", 7890)
            ));
    
            Spider.create(this)
                    .addUrl(starturl)
                    .addPipeline(new XiannvPipeline())
                    .setDownloader(downloader)
                    .runAsync();
    

    我的问题是,这两个方法等效吗,为什么要这样设置?

    opened by nesteiner 1
Releases(WebMagic-0.8.0)
  • WebMagic-0.8.0(Nov 23, 2022)

  • WebMagic-0.7.6(Oct 24, 2022)

    What's Changed

    • perfect Spider.run to avoid some rare concurrent issue, change the Sp… by @carl-don-it in https://github.com/code4craft/webmagic/pull/1033
    • [Snyk] Security upgrade org.jruby:jruby from 9.2.14.0 to 9.3.0.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1036
    • [Snyk] Security upgrade com.jayway.jsonpath:json-path from 2.5.0 to 2.6.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1048
    • [Snyk] Security upgrade com.github.dreamhead:moco-core from 1.2.0 to 1.3.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1050
    • [Snyk] Security upgrade net.sourceforge.htmlcleaner:htmlcleaner from 2.9 to 2.26 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1056
    • [Snyk] Security upgrade com.fasterxml.jackson.core:jackson-databind from 2.13.0-rc1 to 2.13.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1060
    • [Snyk] Security upgrade com.fasterxml.jackson.core:jackson-databind from 2.13.0 to 2.13.2 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1063
    • [Snyk] Security upgrade com.fasterxml.jackson.core:jackson-databind from 2.13.2 to 2.13.2.1 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1064
    • [Snyk] Security upgrade org.jetbrains.kotlin:kotlin-stdlib from 1.1.2-2 to 1.6.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1065
    • change dependency versions into properties by @davidhsing in https://github.com/code4craft/webmagic/pull/1067
    • [Snyk] Security upgrade com.alibaba:fastjson from 1.2.75 to 1.2.83 by @code4craft in https://github.com/code4craft/webmagic/pull/1071
    • [Snyk] Security upgrade us.codecraft:xsoup from 0.3.2 to 0.3.4 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1072
    • [Snyk] Security upgrade org.seleniumhq.selenium:selenium-java from 3.141.59 to 4.0.0 by @code4craft in https://github.com/code4craft/webmagic/pull/1075
    • Common the downloader status process and pass error information when … by @vioao in https://github.com/code4craft/webmagic/pull/1082
    • Revert "Common the downloader status process and pass error information when …" by @sutra in https://github.com/code4craft/webmagic/pull/1083
    • Common downloader error process by @vioao in https://github.com/code4craft/webmagic/pull/1085
    • Enhance Jsoup could parse tr td tag directly by @vioao in https://github.com/code4craft/webmagic/pull/1086
    • [Snyk] Security upgrade com.fasterxml.jackson.core:jackson-databind from 2.13.2.1 to 2.13.4 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1087
    • [Snyk] Security upgrade us.codecraft:xsoup from 0.3.4 to 0.3.6 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1088
    • : 是非法字符,无法作为文件名 by @jialigit in https://github.com/code4craft/webmagic/pull/762
    • [Snyk] Security upgrade com.fasterxml.jackson.core:jackson-databind from 2.13.4 to 2.13.4.2 by @code4craft in https://github.com/code4craft/webmagic/pull/1089

    New Contributors

    • @carl-don-it made their first contribution in https://github.com/code4craft/webmagic/pull/1033
    • @davidhsing made their first contribution in https://github.com/code4craft/webmagic/pull/1067
    • @vioao made their first contribution in https://github.com/code4craft/webmagic/pull/1082
    • @jialigit made their first contribution in https://github.com/code4craft/webmagic/pull/762

    Full Changelog: https://github.com/code4craft/webmagic/compare/WebMagic-0.7.5...WebMagic-0.7.6

    Source code(tar.gz)
    Source code(zip)
  • WebMagic-0.7.5(Sep 2, 2022)

    What's Changed

    • [Fix] #698 修复使用Redis,Request丢失附加信息问题 by @jianyun8023 in https://github.com/code4craft/webmagic/pull/702
    • [Fix] 修正错误方法名 by @jianyun8023 in https://github.com/code4craft/webmagic/pull/703
    • fix the typo by @aristotll in https://github.com/code4craft/webmagic/pull/658
    • [Snyk] Fix for 2 vulnerabilities by @snyk-bot in https://github.com/code4craft/webmagic/pull/895
    • [Snyk] Fix for 1 vulnerabilities by @snyk-bot in https://github.com/code4craft/webmagic/pull/897
    • [Snyk] Fix for 1 vulnerabilities by @snyk-bot in https://github.com/code4craft/webmagic/pull/899
    • [Snyk] Fix for 1 vulnerabilities by @snyk-bot in https://github.com/code4craft/webmagic/pull/908
    • [Snyk] Fix for 1 vulnerabilities by @snyk-bot in https://github.com/code4craft/webmagic/pull/911
    • [Snyk] Security upgrade com.github.dreamhead:moco-core from 0.11.0 to 1.0.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/914
    • [Snyk] Security upgrade com.github.dreamhead:moco-core from 0.11.0 to 1.0.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/920
    • [Snyk] Security upgrade com.github.dreamhead:moco-core from 0.11.0 to 1.0.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/922
    • [Snyk] Security upgrade com.github.dreamhead:moco-core from 0.11.0 to 1.0.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/923
    • [Snyk] Security upgrade com.github.dreamhead:moco-core from 0.11.0 to 1.0.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/925
    • [Snyk] Fix for 15 vulnerable dependencies by @snyk-bot in https://github.com/code4craft/webmagic/pull/889
    • [Snyk] Fix for 1 vulnerable dependencies by @snyk-bot in https://github.com/code4craft/webmagic/pull/882
    • Add unit tests for us.codecraft.webmagic.utils.NumberUtils by @ThomasPerkins1123 in https://github.com/code4craft/webmagic/pull/885
    • build: manage plugin version & remove build WARNING by @leeyazhou in https://github.com/code4craft/webmagic/pull/939
    • [Snyk] Security upgrade com.alibaba:fastjson from 1.2.68 to 1.2.69 by @snyk-bot in https://github.com/code4craft/webmagic/pull/946
    • [Snyk] Security upgrade org.apache.httpcomponents:httpclient from 4.5.12 to 4.5.13 by @snyk-bot in https://github.com/code4craft/webmagic/pull/955
    • [Snyk] Security upgrade junit:junit from 4.13 to 4.13.1 by @snyk-bot in https://github.com/code4craft/webmagic/pull/957
    • [Snyk] Security upgrade com.google.guava:guava from 29.0-jre to 30.0-android by @snyk-bot in https://github.com/code4craft/webmagic/pull/959
    • 子任务可以使用不同的下载器。。。 by @itranlin in https://github.com/code4craft/webmagic/pull/974
    • 主要是对代理的功能进行了增加和修改 by @yaoqiangpersonal in https://github.com/code4craft/webmagic/pull/976
    • Remove useless imports to fix build. by @sutra in https://github.com/code4craft/webmagic/pull/977
    • SpiderStatus中getPagePerSecond()方法,增加验证逻辑,避免空指针,避免除数为零。 by @yqia182 in https://github.com/code4craft/webmagic/pull/993
    • 增加了List属性的get方法,供SpiderMonitor的子类获取. by @thebirdandfish in https://github.com/code4craft/webmagic/pull/1000
    • [Snyk] Security upgrade com.github.dreamhead:moco-core from 1.1.0 to 1.2.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1011
    • 提交可恢复爬取内容例子 by @linweisen in https://github.com/code4craft/webmagic/pull/1013
    • Update to Jedis 3.6.0 by @gkorland in https://github.com/code4craft/webmagic/pull/1025
    • [Snyk] Security upgrade com.jayway.jsonpath:json-path from 2.4.0 to 2.6.0 by @snyk-bot in https://github.com/code4craft/webmagic/pull/1029

    New Contributors

    • @jianyun8023 made their first contribution in https://github.com/code4craft/webmagic/pull/702
    • @aristotll made their first contribution in https://github.com/code4craft/webmagic/pull/658
    • @ThomasPerkins1123 made their first contribution in https://github.com/code4craft/webmagic/pull/885
    • @leeyazhou made their first contribution in https://github.com/code4craft/webmagic/pull/939
    • @itranlin made their first contribution in https://github.com/code4craft/webmagic/pull/974
    • @yaoqiangpersonal made their first contribution in https://github.com/code4craft/webmagic/pull/976
    • @sutra made their first contribution in https://github.com/code4craft/webmagic/pull/977
    • @yqia182 made their first contribution in https://github.com/code4craft/webmagic/pull/993
    • @thebirdandfish made their first contribution in https://github.com/code4craft/webmagic/pull/1000
    • @linweisen made their first contribution in https://github.com/code4craft/webmagic/pull/1013
    • @gkorland made their first contribution in https://github.com/code4craft/webmagic/pull/1025

    Full Changelog: https://github.com/code4craft/webmagic/compare/WebMagic-0.7.3...WebMagic-0.7.5

    Source code(tar.gz)
    Source code(zip)
  • WebMagic-0.7.3(Jul 30, 2017)

    本次更新增加了Downloader模块的一些功能。

    #609 修复HttpRequestBody没有默认构造函数导致无法反序列化的bug。 #631 HttpRequestBody的静态构造函数不再抛出UnsupportedEncodingException受检异常。

    #571 Page对象增加bytes属性,用于获取二进制数据。下载纯二进制页面时,请设置request.setBinarayContent(true),这样对于二进制内容不会尝试转换为String,减小开销。

    #629 在HttpUriRequestConverter中会自动对一些导致URI异常的字符进行转移或过滤。

    #610 自动识别编码时,可以识别Content-Type中charset为大写的情况。 #627 支持为Request单独设置页面编码,兼容同一站点多种编码方式的情况。 #613 Page对象增加charset属性,其值为request/site中设置的charset,或者为自动检测的charset(未定义时)。

    #606 升级jsonpath到2.4.0 #608 升级jsoup到1.10.3

    Source code(tar.gz)
    Source code(zip)
    webmagic-0.7.3-all.tar.gz(5.07 MB)
  • WebMagic-0.7.2(Jun 17, 2017)

    此次更新修复了0.7.0-0.7.1版本的若干bug。

    1. #594 Request中的HttpRequestBody实现序列化接口。
    2. #596 修复0.7.0开始代理认证无法正确使用的问题。
    3. #601 完善页面状态异常时的错误信息。
    4. #605 修复0.7.0开始重复调用onSuccess和onError函数导致监控出错的问题。
    Source code(tar.gz)
    Source code(zip)
    webmagic-0.7.2-all.tar.gz(5.02 MB)
  • WebMagic-0.7.1(Jun 4, 2017)

    此次更新包含几个比较大的Bugfix,以及一些遗留问题的改进。

    • 修复0.7.0引入的RedisScheduler无法使用的bug。#583
    • 注解模式的JsonPath默认会指定source为RawText,不再会出现自动为头尾加了<html>标签导致无法解析的情况。#589
    • RegexSelector之前版本默认匹配group1,并通过对无捕获组的正则头尾加括号的形式来统一抽取内容。在0.7.1版本改为不再改变正则,而是通过匹配group0还是group1来实现匹配,见#559。新做法可以减少一些特殊用法的出错几率,例如零宽断言(#556)。
    • 重构了ObjectFormatter部分,修复了ObjectFormatter无法初始化参数的bug。 #570
    Source code(tar.gz)
    Source code(zip)
    webmagic-0.7.1-all.tar.gz(5.02 MB)
  • WebMagic-0.7.0(May 29, 2017)

    此次更新重写了HttpClientDownloader,完善了POST等其他Http Method的支持,并重写了代理API,更加简单和便于扩展。

    POST支持

    • 新的POST API,支持各种RequestBody #513
    Request request = new Request("http://xxx/path");
    request.setMethod(HttpConstant.Method.POST);
    request.setRequestBody(HttpRequestBody.json("{'id':1}","utf-8"));
    
    • 移除了老的在request.extra中设置NameValuePair的方式
    • POST请求不再进行去重 #484

    代理支持

    • 新的代理APIProxyProvider,支持自由扩展
    • 默认实现SimpleProxyProvider是一个简单的round-robin实现,可以添加任意个数的代理。
    HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
    SimpleProxyProvider proxyProvider = SimpleProxyProvider.from(new Proxy("127.0.0.1", 1087), new Proxy("127.0.0.1", 1088));
    httpClientDownloader.setProxyProvider(proxyProvider);
    
    • 移除了Site上关于代理配置的setProxy等,代理设置统一到HttpClientDownloader里。

    新的SimpleHttpClient

    • 用作简单的单次下载和解析时,使用SimpleHttpClient可以满足需求
    SimpleHttpClient simpleHttpClient = new SimpleHttpClient();
    GithubRepo model = simpleHttpClient.get("github.com/code4craft/webmagic",GithubRepo.class);
    

    其他改动

    • 为Page中增加状态码和Http头信息 #406
    • 支持Request级别设置Http Header和Cookie
    • 去掉Site.addStartRequest() , 用Spider.addStartRequest()代替 #494
    • HttpClientDownloader大幅重构,将Request转换抽象到HttpUriRequestConverter(之前继承HttpClientDownloader的实现可能需要做相应修改) #524
    • 将CycleRetry和statusCode的判断逻辑从Downloader中迁移到Spider中 #527
    • 通过Page.isDownloadSuccess而不是Page对象本身为空来判断是否下载失败
    • 为PageModel增加不发现新URL的功能 #575
    • 为Site增加了disableCookieManagement属性,在不想使用cookie时使用 #577
    Source code(tar.gz)
    Source code(zip)
    webmagic-0.7.0-all.tar.gz(4.52 MB)
  • webmagic-parent-0.6.1(Jan 21, 2017)

    本次更新修复了一些0.6.0的问题和一些小优化。

    • 修改默认策略为信任所有https证书 #444 @ckex
    • 修复使用startUrls添加url时,如果使用了cookie会出现空指针的问题 #438
    • PhantomJSDownloader支持crawl.js路径自定义 #414 @jsbd
    • POST请求支持302跳转 #443 @xbynet

    注:默认信任所有证书会有内容伪造的风险,但是考虑到爬虫的便利性还是加上了,使用者需要自己判断内容安全性。

    Source code(tar.gz)
    Source code(zip)
    webmagic-0.6.1-all.tar.gz(4.97 MB)
  • WebMagic-0.6.0(Dec 18, 2016)

    此次更新主要是一些依赖包的升级和bugfix。

    • #290 代理增加用户名密码认证 @hepan

    • #194 重构代理池部分代码,支持自定义代理池 @EdwardsBean

    • #314 修复低版本json-path依赖2.x的StringUtils导致报错的问题

    • #380 升级fastjson 到1.2.21

    • #301 修复JsonPath在注解模式不可用的问题 @Salon-sai

    • #377 修复监控模块在URL带有端口时会报错的问题

    • #400 修复FileCacheQueueScheduler的NullPointerException问题

    • #407 为PhantomJSDownloader添加新的构造函数,支持phantomjs自定义命令 @jsbd

    • #419 修复抓取https链接线程无法结束导致进程一直运行的问题 @cpaladin

    • #374 升级HttpClient到4.5.2,修复一些安全问题

    • #424 去掉Guava依赖

      因为Guava不同版本兼容性不好,经常导致demo无法运行,所以我最后决定去掉了Guava的依赖。如果使用了BloomFilterDuplicateRemover的用户,需要手动依赖Guava包。

    • #426 去掉Avalon相关包

      Avalon是之前计划的一站式抓取平台。因为有个朋友基于WebMagic做了类似的实现Gather Platform,所以Avalon放弃了,转而支持这个项目。WebMagic核心会专注于做应用内的框架。

    Source code(tar.gz)
    Source code(zip)
    webmagic-0.6.0-all.tar.gz(4.97 MB)
  • webmagic-0.5.3(Jan 21, 2016)

    时隔一年半,作者终于回归了。这个版本主要解决之前的一些BUG,后续会慢慢的继续完善功能。

    • 升级Xsoup到0.3.1,支持//div[contains(@id,'te')]语法。
    • #245 升级Jsoup到1.8.3,解决n-th selector二进制不兼容的问题。
    • #139 修复JsonFilePipeline保存路径的问题
    • #144 修复@TargetUrl增加SourceRegion后取不到链接的问题
    • #157 修复FileCacheQueueScheduler中去重偶尔不工作的问题 @zhugw
    • #188 增加重试的间隔时间,默认为1秒 [@edwardsbean](ht[tps* //github.com/edwardsbean)
    • #193 修复分页功能MultiPagePipeline可能出现的并发问题 edwardsbean
    • #198 修正site.setHttpProxy()不起作用的bug @okuc
    Source code(tar.gz)
    Source code(zip)
  • WebMagic-0.5.2(Jun 4, 2014)

    此次主要包括对于Selector部分的重构,以及一些功能的完善和一些Bugfix。

    • 重构了Selector部分,使得结构更清晰,并且能够更好的支持链式的XPath抽取了。 [Issue #113]

    • 支持对于选取出来的结果,进行外部迭代。例如:

      List<Selectable> divs = html.xpath("//div").nodes();
      for (Selectable div : divs) {
          System.out.println(div.xpath("//h2").get());
      }
      
    • 增强自动编码识别机制,现在除了从HTTP头中,还会从HTML Meta信息中判断编码,感谢@fengwuze @sebastian1118提交代码和建议。[Issue #126]

    • 升级Xsoup版本为0.2.4,增加了判断XPath最终抽取结果(是元素还是属性)的API,完善了一些特殊字符处理的功能。

    • 增加PageMapper功能,以后可以在任何地方使用注解模式来解析页面了![Issue #120] 例如:

      public void process(Page page) {
              //新建Mapper,GithubRepo是一个带注解的POJO
              PageMapper<GithubRepo> githubRepoPageMapper = new PageMapper<GithubRepo>(GithubRepo.class);
              //直接解析页面,得到解析后的结果
              GithubRepo githubRepo = githubRepoPageMapper.get(page);
              page.putField("repo",githubRepo);
          }
      
    • 增加多个代理以及智能切换的支持,感谢@yxssfxwzy 贡献代码,使用Site.setHttpProxyPool可开启此功能。[Pull #128]

      public void process(Page page) {
              Site site = Site.me().setHttpProxyPool(
                      Lists.newArrayList(
                              new String[]{"192.168.0.2","8080"},
                              new String[]{"192.168.0.3","8080"}));
          }
      

    Bugfix:

    • 修复了JsonFilePipeline不能自动创建文件夹的问题。[Issue #122]
    • 修复了Jsonp在removePadding时,对于特殊字符匹配不当的问题。[Issue #124]
    • 修复了当JsonPathSelector选取的结果是非String类型时,类型转换出错的问题。[Issue #129]
    Source code(tar.gz)
    Source code(zip)
    all-dependencies-0.5.2.tar.gz(6.14 MB)
    webmagic-core-0.5.2-sources.jar(54.57 KB)
    webmagic-core-0.5.2.jar(93.15 KB)
    webmagic-extension-0.5.2-sources.jar(45.77 KB)
    webmagic-extension-0.5.2.jar(91.68 KB)
    xsoup-0.2.4-sources.jar(14.89 KB)
    xsoup-0.2.4.jar(38.83 KB)
  • WebMagic-0.5.1(May 2, 2014)

    此次更新主要包括Scheduler的一些改动,对于自己定制过Scheduler的用户,强烈推荐升级。

    • 修复了RedisScheduler无法去重的BUG,感谢@codev777 仔细测试并发现问题。 #117
    • 对Scheduler进行了重构,新增了接口DuplicateRemover,将去重单独抽象出来,以便在同一个Scheduler中选择不同的去重方式。 #118
    • 增加了BloomFilter去重方式。BloomFilter是一种可以用极少的内存消耗完成大量URL去重的数据结构,缺点是会有少量非重复的URL被判断为重复,导致URL丢失(小于0.5%)。

    使用以下的方式即可将默认的HashSet去重改为BloomFilter去重:

    spider.setScheduler(new QueueScheduler()
    .setDuplicateRemover(new BloomFilterDuplicateRemover(10000000)) //10000000是估计的页面数量
    
    Source code(tar.gz)
    Source code(zip)
    all-dependencies-0.5.1.tar.gz(6.12 MB)
    webmagic-core-0.5.1-sources.jar(47.00 KB)
    webmagic-core-0.5.1.jar(77.53 KB)
    webmagic-extension-0.5.1-sources.jar(43.59 KB)
    webmagic-extension-0.5.1.jar(86.50 KB)
  • WebMagic-0.5.0(Apr 27, 2014)

    此次更新主要增加了监控功能,同时重写了多线程部分,使得多线程下性能有了极大的提升。另外还包含注解模式一些优化、多页面的支持等功能。

    项目总体进展:

    • 官网webmagic.io上线了!同时上线的还有详细版的官方文档http://webmagic.io/docs,从此使用更加简单!
    • 新增三名合作开发者@ccliangbo @ouyanghuangzheng @linkerlin ,一起参与项目的维护。
    • 官方论坛http://bbs.webmagic.io/和官方QQ群373225642上线,以后会更加重视社区的建设。

    监控部分:

    多线程部分:

    • 重写了多线程部分,修复了多线程下,主分发线程会被工作线程阻塞的问题,使得多线程下效率有了极大的提升,推荐所有用户升级。 #110
    • 为主线程等待新URL时的wait/notify机制增加了timeout时间,防止少数情况下发生的爬虫卡死的情况。 #111

    抽取API部分:

    • 增加了JSON的支持,现在可以使用page.getJson().jsonPath()来使用jsonPath解析AJAX请求,也可以使用page.getJson().removePadding().jsonPath()来解析JSONP请求。 #101
    • 修复一个Selectable的缓存导致两次取出的结果不一致的问题。 #73 感谢@seveniu 发现问题
    • 支持为一个Spider添加多个PageProcessor,并按照URL区分,感谢@sebastian1118 提交patch。使用示例:PatternProcessorExample #86
    • 修复不常用标签无法使用nth-of-type选择的问题(例如//div/svg[2]) 。#75
    • 修复XPath中包含特殊字符,即使转义也会导致解析失败的问题。#77

    注解模式:

    • 注解模式现在支持继承了!父类的注解将对子类也有效。#103
    • 修复注解模式下,一个Spider使用多个Model时,可能不生效的问题,感谢 @ccliangbo 发现此问题。#85
    • 修复sourceRegion中只有一个URL会被抽取出来的问题,感谢@jsinak 发现此问题。#107
    • 修复了自动类型转换Formatter的一个BUG,现在可以自定义Formatter了。如果你不了解Formatter可以看这里:注解模式下结果的类型转换 #100

    其他组件:

    • Downloader现在支持除了GET之外的其他几种HTTP请求了,包括POST、HEAD、PUT、DELETE、TRACE,感谢@usenrong 提出建议。 #108
    • Site中设置Cookie时,可以指定域名,而不是只能使用默认域名了。 #109
    • setScheduler()方法在调用时,如果之前Scheduler已有URL,会先转移到新的Scheduler,避免URL丢失。 #104
    • 在发布包中去掉了log4j.xml,避免与用户程序冲突,感谢@cnjavaer 发现问题。 #82
    Source code(tar.gz)
    Source code(zip)
    all-dependencies-0.5.0.tar.gz(6.12 MB)
    webmagic-core-0.5.0-sources.jar(45.19 KB)
    webmagic-core-0.5.0.jar(75.14 KB)
    webmagic-extension-0.5.0-sources.jar(43.58 KB)
    webmagic-extension-0.5.0.jar(86.42 KB)
  • webmaigc-0.4.3(Mar 13, 2014)

    Bugfix:

    • Fix cycleRetryTimes does not work #58 #60 #62 @yxssfxwzy
    • Fix NullPointerException in FileCachedQueueScheduler #53 @xuchaoo
    • Fix Selenium does not quit #57 @d0ngw

    Enhancement:

    • Enhance RegexSelector group check #51 @SimpleExpress
    • Add XPath syntax support: #64 contains,or/and,"|"
    • Add text attribute select to CssSelector #66
    • Change logger to slf4j #55
    • Update HttpClient version to 4.3.3 #59
    Source code(tar.gz)
    Source code(zip)
  • webmagic-0.4.2(Dec 3, 2013)

    Enhancement: #45 Remove multi option in ExtractBy. Auto detect whether is multi be field type. Bugfix: #46 Downloader thread hang up sometiems.

    Source code(tar.gz)
    Source code(zip)
  • webmagic-0.4.1(Nov 28, 2013)

    More support for ajax:

    • #39 Parsing html after page.getHtml()
    • #42 Add jsonpath support in annotation mode
    • #35 Add more http info to page
    • #41 Add more status monitor method to Spider
    Source code(tar.gz)
    Source code(zip)
  • webmagic-0.4.0(Nov 6, 2013)

    Improve performance of Downloader.

    • Update HttpClient to 4.3.1 and rewrite the code of HttpClientDownloader #32.
    • Use gzip by default to reduce the transport cost #31.
    • Enable HTTP Keep-Alive and connection persistence, fix the wrong usage of PoolConnectionManage r#30.

    The performance of Downloader is improved by 90% in my test.Test code: Kr36NewsModel.java.

    Add synchronzing API for small task #28.

            OOSpider ooSpider = OOSpider.create(Site.me().setSleepTime(100), BaiduBaike.class);
            BaiduBaike baike = ooSpider.<BaiduBaike>get("http://baike.baidu.com/search/word?word=httpclient&pic=1&sug=1&enc=utf8");
            System.out.println(baike);
    

    More config for site

    • Http proxy support by Site.setHttpProxy #22.
    • More http header customizing support by Site.addHeader #27.
    • Allow disable gzip by Site.setUseGzip(false).
    • Move Site.addStartUrl to Spider.addUrl because I think startUrl is more a Spider's property than Site.

    Code refactor in Spider

    • Refactor the multi-thread part of Spider and fix some concurrent problem.
    • Import Google Guava API for simpler code.
    • Allow add request with more information by Spider.addRequest() instead of addUrl #29.
    • Allow just downloading start urls without spawn urls extracted by Spider.setSpawnUrl(false).
    Source code(tar.gz)
    Source code(zip)
  • webmagic-0.3.2(Sep 23, 2013)

    • #13 Add class cast to annotation crawler.
    • Allow customize class caster by implementing ObjectFormatter and register it in ObjectFormatters.put().
    • Fix a thread pool reject exception when call Spider.stop()
    Source code(tar.gz)
    Source code(zip)
  • webmagic-parent-0.3.1(Sep 8, 2013)

    • #26 Bugfix: Annotation extractor does not work.
    • #25 Bugfix: UrlUtils.canonicalizeUrl does not work with "../../" path.
    • #24 Enhancement: Add stop method to Spider.
    Source code(tar.gz)
    Source code(zip)
  • webmagic-0.3.0(Sep 4, 2013)

    • Change default XPath selector from HtmlCleaner to Xsoup.

      Xsoup is an XPath selector based on Jsoup written by me. It has much better performance than HtmlCleaner.

      Time of processing a page is reduced from 7~9ms to 0.4ms.

      If Xsoup is not stable for your usage, just use Spider.xsoupOff() to turn off it and report an issue to me!

    • Add cycle retry times for Site.

      When cycle retry times is set, Spider will put the url which downloading failed back to scheduler, and retry after a cycle of queue.

    Source code(tar.gz)
    Source code(zip)
  • webmagic-parent-0.2.1(Aug 20, 2013)

    ComboExtractor support for annotation.

    Request priority support (using PriorityScheduler).

    Complete some I18n work (comments and documents).

    More convenient extractor API:

    • Add attribute name select for CSSSelector.

    • Group of regex selector can be specified.

    • Add OrSelector.

    • Add Selectors, import static Selectors.* for fluent API such as:

      or(regex("<title>(.*)</title>"), xpath("//title"), $("title")).select(s);
      
    • Add JsonPathSelector for Json parse.

    Source code(tar.gz)
    Source code(zip)
  • version-0.2.0(Aug 30, 2013)

    此次更新的主题是"方便"(之前的主题是"灵活")。

    增加了webmagic-extension模块。

    增加了注解方式支持,可以通过POJO+注解的方式编写一个爬虫,更符合Java开发习惯。以下是抓取一个博客的完整代码:

        @TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
        public class OschinaBlog {
    
            @ExtractBy("//title")
            private String title;
    
            @ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css)
            private String content;
    
            @ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
            private List<String> tags;
    
            public static void main(String[] args) {
                OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"),
                new ConsolePageModelPipeline(), OschinaBlog.class)
                .scheduler(new RedisScheduler("127.0.0.1")).thread(5).run();
            }
    
        }
    

    增加一个Spider.test(url)方法,用于开发爬虫时进行调试。

    增加基于redis的分布式支持。

    增加XPath2.0语法支持(webmagic-saxon模块)。

    增加基于Selenium的浏览器渲染支持,用于抓取动态加载内容(webmagic-selenium模块)。

    修复了不支持https的bug。

    补充了文档:webmagic-0.2.0用户手册

    Source code(tar.gz)
    Source code(zip)
  • version-0.1.0(Jul 25, 2013)

    第一个稳定版本。

    修改了若干API,使得可扩展性更强,为每个任务分配一个ID,可以通过ID区分不同任务。

    重写了Pipeline接口,将抽取结果集包装到ResultItems对象,而不是通用一个Page对象,便于逻辑分离。

    增加下载的重试机制,支持gzip,支持自定义UA/cookie。

    增加多线程抓取功能,只需在初始化的时候指定线程数即可。

    增加jquery形式的CSS Selector API,可以通过page.getHtml().$("div.body")形式抽取元素。

    完善了文档,架构说明:webmagic的设计机制及原理-如何开发一个Java爬虫,Javadoc:http://code4craft.github.io/webmagic/docs

    Source code(tar.gz)
    Source code(zip)
Apache Nutch is an extensible and scalable web crawler

Apache Nutch README For the latest information about Nutch, please visit our website at: https://nutch.apache.org/ and our wiki, at: https://cwiki.apa

The Apache Software Foundation 2.5k Dec 31, 2022
Open Source Web Crawler for Java

crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thr

Yasser Ganjisaffar 4.3k Jan 3, 2023
An implementation of darcy-web that uses Selenium WebDriver as the automation library backend.

darcy-webdriver An implementation of darcy-ui and darcy-web that uses Selenium WebDriver as the automation library backend. maven <dependency> <gr

darcy framework 20 Aug 22, 2020
Concise UI Tests with Java!

Selenide = UI Testing Framework powered by Selenium WebDriver What is Selenide? Selenide is a framework for writing easy-to-read and easy-to-maintain

Selenide 1.6k Jan 4, 2023
jQuery-like cross-driver interface in Java for Selenium WebDriver

seleniumQuery Feature-rich jQuery-like Java interface for Selenium WebDriver seleniumQuery is a feature-rich cross-driver Java library that brings a j

null 69 Nov 27, 2022
jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

jsoup: Java HTML Parser jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting a

Jonathan Hedley 9.9k Jan 4, 2023
Elegant parsing in Java and Scala - lightweight, easy-to-use, powerful.

Please see https://repo1.maven.org/maven2/org/parboiled/ for download access to the artifacts https://github.com/sirthias/parboiled/wiki for all docum

Mathias 1.2k Dec 21, 2022
A pure-Java Markdown processor based on a parboiled PEG parser supporting a number of extensions

:>>> DEPRECATION NOTE <<<: Although still one of the most popular Markdown parsing libraries for the JVM, pegdown has reached its end of life. The pro

Mathias 1.3k Nov 24, 2022
My solution in Java for Advent of Code 2021.

advent-of-code-2021 My solution in Java for Advent of Code 2021. What is Advent of Code? Advent of Code (AoC) is an Advent calendar of small programmi

Phil Träger 3 Dec 2, 2021
Dicas , códigos e soluções para projetos desenvolvidos na linguagem Java

Digytal Code - Programação, Pesquisa e Educação www.digytal.com.br (11) 95894-0362 Autores Gleyson Sampaio Repositório repleto de desafios, componente

Digytal Code 13 Apr 15, 2022
An EFX translator written in Java.

This is an EFX translator written in Java. It supports multiple target languages. It includes an EFX expression translator to XPath. It is used to in the generation of the Schematron rules in the eForms SDK.

TED & EU Public Procurement 5 Oct 14, 2022
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Sparkler A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases

USC Information Retrieval & Data Science 396 Dec 17, 2022
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
Apache Nutch is an extensible and scalable web crawler

Apache Nutch README For the latest information about Nutch, please visit our website at: https://nutch.apache.org/ and our wiki, at: https://cwiki.apa

The Apache Software Foundation 2.5k Dec 31, 2022
Open Source Web Crawler for Java

crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thr

Yasser Ganjisaffar 4.3k Jan 3, 2023
Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.

Cadence This repo contains the source code of the Cadence server and other tooling including CLI, schema tools, bench and canary. You can implement yo

Uber Open Source 6.5k Jan 4, 2023
💡极致性能的企业级Java服务器框架,RPC,游戏服务器框架,web应用服务器框架。(Extreme fast enterprise Java server framework, can be RPC, game server framework, web server framework.)

?? 为性能而生的万能服务器框架 ?? Ⅰ. zfoo简介 ?? 性能炸裂,天生异步,Actor设计思想,无锁化设计,基于Spring的MVC式用法的万能RPC框架 极致序列化,原生集成的目前二进制序列化和反序列化速度最快的 zfoo protocol 作为网络通讯协议 高可拓展性,单台服务器部署,

null 1k Jan 1, 2023
Microhttp - a fast, scalable, event-driven, self-contained Java web server

Microhttp is a fast, scalable, event-driven, self-contained Java web server that is small enough for a programmer to understand and reason about.

Elliot Barlas 450 Dec 23, 2022
Firefly is an asynchronous web framework for rapid development of high-performance web application.

What is Firefly? Firefly framework is an asynchronous Java web framework. It helps you create a web application Easy and Quickly. It provides asynchro

Alvin Qiu 289 Dec 18, 2022