編碼的世界 / 優質文選 / 歷史

scrapy抓取csdn中標題帶有“語義”關鍵字的文章的標題和鏈接


2022年7月18日
-   

scrapy抓取csdn中標題帶有“語義”關鍵字的文章的標題和鏈接
實現步驟
中文字符比對
中文字符比對
首先了解一下ASCII,Unicode和UTF-8:  http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html  大體意思就是ASCII是單字節編碼,能表示的字符有限。unicode能表示世界上的所有字符,但占有的空間太大。utf-8是unicode編碼中的一種,能夠有效的節省空間。
Python 編碼轉換與中文處理  http://www.jianshu.com/p/53bb448fe85b
關於編碼的官方文檔:  https://docs.python.org/2/howto/unicode.html#encodings
csdn的中文標題字符的比對
由於要處理的csdn的中文字符,那麼先以一個鏈接的標題作為例子:  http://blog.csdn.net/searobbers_duck/article/details/51669224  * 抓取網頁中的內容:
scrapy shell http://blog.csdn.net/searobbers_duck/article/details/51669224

獲得標題:
title = response.xpath('//title/text()').extract()
print title


查看編碼格式:
import chardet
chardet.detect(title[0])


變為unicode編碼:
utf_8_title=title[0].encode('utf-8').decode('utf-8')
print utf_8_title

查找是否包含指定中文字符
cn_str=u'可視化'
pos=utf_8_title.find(cn_str)
if(pos!=-1):
print "find the chinese word!"
else:
print "Can't find the chinese word!"


mysql數據庫相關操作
安裝python-mysql:  http://blog.csdn.net/searobbers_duck/article/details/51839799基本語句:  http://blog.csdn.net/searobbers_duck/article/details/51889556
爬蟲程序
創建工程
scrapy startproject csdn_semantics_spider
cd csdn_semantics_spider

修改items.py
item中定義了,在爬蟲爬取過程中,需要爬取的內容項。這裏主要爬取標題(title),鏈接(link),描述(desc)
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class CsdnSemanticsSpiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
pass

修改創建spider文件
gedit csdn_semantics_spider/spider/csdn_semantics_spider1.py

修改其內容如下:
#coding=utf-8
import re
import json
from scrapy.selector import Selector
try:
from scrapy.spider import Spider
except:
from scrapy.spider import BaseSpider as Spider
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle
from csdn_semantics_spider.items import *
class CsdnSemanticsSpider(CrawlSpider):
#定義爬蟲的名稱
name = "CsdnSemanticsSpider"
#定義允許抓取的域名,如果不是在此列表的域名則放棄抓取
allowed_domains = ["blog.csdn.net"]
#定義抓取的入口url
start_urls = [
"http://blog.csdn.net/searobbers_duck/article/details/51839799"
]
# 定義爬取URL的規則,並指定回調函數為parse_item
# rules = [
# Rule(sle(allow=("/S{1,}/articledetails/d{1,}")), #此處要注意?號的轉換,複制過來需要對?號進行轉義。
# follow=True,
# callback='parse_item')
rules = [
Rule(sle(allow=("/S{1,}/article/details/d{1,}")), #此處要注意?號的轉換,複制過來需要對?號進行轉義。
follow=True,
callback='parse_item')
]
#print "**********CnblogsSpider**********"
#定義回調函數
#提取數據到Items裏面,主要用到XPath和CSS選擇器提取網頁數據
def parse_item(self, response):
#print "-"
items = []
sel = Selector(response)
base_url = get_base_url(response)
title = sel.css('title').xpath('text()').extract()
key_substr=u'語義'
for index in range(len(title)):
item = CsdnSemanticsSpiderItem()
item['title']=title[index].encode('utf-8').decode('utf-8')
pos=item['title'].find(key_substr)
print item['title']
if(pos != -1):
#print item['title'] + "***************
"
print item['title']
item['link']=base_url
item['desc']=item['title']
#print base_url + "********
"
items.append(item)
#print repr(item).decode("unicode-escape") + '
'
return items

修改pipelines.py文件
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import codecs
class JsonWithEncodingCsdnSemanticsPipeline(object):
def __init__(self):
self.file = codecs.open('csdn_semantics.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "
"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()

修改setting.py  在setting.py中添加:
ITEM_PIPELINES = {
'csdn_semantics_spider.pipelines.JsonWithEncodingCsdnSemanticsPipeline': 300,
}
LOG_LEVEL = 'INFO'

上述修改完成後,運行爬蟲程序
scrapy crawl CsdnSemanticsSpider


將爬取的信息寫入數據庫
數據庫模塊  創建一個名為“csdn_semantics_db”的數據庫,並建好表格,代碼如下:
drop database if exists csdn_semantics_db;
create database if not exists csdn_semantics_db default character set utf8 collate utf8_general_ci;
use csdn_semantics_db;
create table if not exists csdn_semantics_info(linkmd5id char(32) NOT NULL, title text, link text, description text, updated datetime DEFAULT NULL, primary key(linkmd5id)) ENGINE=MyISAM DEFAULT CHARSET=utf8;
select * from csdn_semantics_info;

修改pipelines.py的內容:
代碼如下,如有sql語句疑問請訪問鏈接: http://blog.csdn.net/searobbers_duck/article/details/51889556

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import codecs
from datetime import datetime
from hashlib import md5
import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi
class JsonWithEncodingCsdnSemanticsPipeline(object):
def __init__(self):
self.file = codecs.open('csdn_semantics.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "
"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
class MySQLStoreCsdnSemanticsPipeline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
dbargs = dict(
host=settings['MYSQL_HOST'],
db=settings['MYSQL_DBNAME'],
user=settings['MYSQL_USER'],
passwd=settings['MYSQL_PASSWD'],
charset='utf8',
cursorclass = MySQLdb.cursors.DictCursor,
use_unicode= True,
)
dbpool = adbapi.ConnectionPool('MySQLdb', **dbargs)
return cls(dbpool)
#pipeline默認調用
def process_item(self, item, spider):
d = self.dbpool.runInteraction(self._do_upinsert, item, spider)
d.addErrback(self._handle_error, item, spider)
d.addBoth(lambda _: item)
return d
#將每行更新或寫入數據庫中
def _do_upinsert(self, conn, item, spider):
linkmd5id = self._get_linkmd5id(item)
#print linkmd5id
now = datetime.utcnow().replace(microsecond=0).isoformat(' ')
insertcmd="insert into csdn_semantics_info values('%s','%s','%s','%s', '%s') on duplicate key update title='%s', link='%s', description='%s', updated='%s'"%(linkmd5id, item['title'], item['link'],item['desc'], now, item['title'], item['link'],item['desc'], now)
conn.execute(insertcmd)
#獲取url的md5編碼
def _get_linkmd5id(self, item):
#url進行md5處理,為避免重複采集設計
return md5(item['link']).hexdigest()
#異常處理
def _handle_error(self, failue, item, spider):
log.err(failure)

源代碼地址:https://github.com/searobbersduck/CsdnSemanticScrapy.git

熱門文章