Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken headings in Markdown files #55

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
#WebCollector
# WebCollector
WebCollector is an open source web crawler framework based on Java.It provides
some simple interfaces for crawling the Web,you can setup a
multi-threaded web crawler in less than 5 minutes.




##HomePage
## HomePage
[https://github.com/CrawlScript/WebCollector](https://github.com/CrawlScript/WebCollector)

##Document
## Document
[WebCollector-GitDoc](https://github.com/CrawlScript/WebCollector-GitDoc)



##Installation
## Installation

### Without Maven
WebCollector jars are available on the [HomePage](https://github.com/CrawlScript/WebCollector).

+ __webcollector-version-bin.zip__ contains core jars.


##Quickstart
## Quickstart
Lets crawl some news from hfut news.This demo prints out the titles and contents extracted from news of hfut news.

[NewsCrawler.java](https://github.com/CrawlScript/WebCollector/blob/master/NewsCrawler.java):
Expand Down Expand Up @@ -96,7 +96,7 @@ public class NewsCrawler extends BreadthCrawler {



##Content Extraction
## Content Extraction
WebCollector could automatically extract content from news web-pages:

```java
Expand All @@ -114,6 +114,6 @@ Element contentElement = ContentExtractor.getContentElementByUrl(url);
```


##Other Documentation
## Other Documentation

+ [中文文档](https://github.com/CrawlScript/WebCollector/blob/master/README.zh-cn.md)
12 changes: 6 additions & 6 deletions README.zh-cn.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
WebCollector
============

###爬虫简介
### 爬虫简介
WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架(内核),它提供精简的的API,只需少量代码即可实现一个功能强大的爬虫。

###爬虫内核:
### 爬虫内核:
WebCollector致力于维护一个稳定、可扩的爬虫内核,便于开发者进行灵活的二次开发。内核具有很强的扩展性,用户可以在内核基础上开发自己想要的爬虫。源码中集成了Jsoup,可进行精准的网页解析。

###教程:
### 教程:
WebCollector的开源中国项目主页中可找到教程列表:[http://www.oschina.net/p/webcollector](http://www.oschina.net/p/webcollector)


###2.x:
### 2.x:
WebCollector 2.x版本特性:
* 1)自定义遍历策略,可完成更为复杂的遍历业务,例如分页、AJAX
* 2)可以为每个URL设置附加信息(MetaData),利用附加信息可以完成很多复杂业务,例如深度获取、锚文本获取、引用页面获取、POST参数传递、增量更新等。
Expand All @@ -24,15 +24,15 @@ WebCollector 2.x版本特性:



###Jar包
### Jar包
可在[WebCollector的github主页](https://github.com/CrawlScript/WebCollector)下载所需jar包.

+ __webcollector-version-bin.zip__ 包含核心jar包.




###__通过捐款支持WebCollector__
### __通过捐款支持WebCollector__


维护WebCollector及教程需要花费较大的时间和精力,如果你喜欢WebCollector的话,欢迎通过捐款的方式,支持开发者的工作,非常感谢!
Expand Down