-
Notifications
You must be signed in to change notification settings - Fork 1
LittleRedHat/moh
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
## 环境安装
安装虚拟环境 virtualenv venv
激活虚拟环境 source ./venv/bin/activate(mac/linux) windows上 venv/Scripts/activate
首次运行安装依赖包: pip install -r requirements.txt
## 爬虫运行
cd crawl
运行命令:
scrapy crawl moh -a domain=[NATION] -a debug=True
命令停止用ctrl+\
ctrl+z or ctrl+c无法终止scrapy,会导致scrapy在后台运行
##命令说明
domain对应国家的缩写,debug选项打开调试模式(此时不会将爬下来的网页写到数据库中,并会输出爬取得网页信息基本信息,用于查看时间,标题是否正确
## 爬取规则制定说明
## TODO
css 内部字体下载
过滤掉视频等不需要文件下载
## api 修改
添加title搜索和日期过滤
示例如下
{
"should": [
[
"cancer"
]
],
"by": "title",
"size": "20",
"from": 0,
"sort": "score",
"filters": [
{
"name": "type",
"value": [
"html"
]
},
{
"name": "publish",
"value": [
"2014-01-01",
"2017-01-01"
]
},
{
"name": "nation",
"value": [
"all"
]
},
{
"name": "language",
"value": [
"all"
]
}
]
}
by -> title 或者 content
日期过滤 -> 时间格式为 'yyyy-mm-dd'
{
name:"date",
"value":[开始时间,截止时间]
}About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published