Python Scrapy基礎 (火車訂票情境)

編輯 : Frank
日期 : 2019/08/13
參考網址


範例情境

爬取火車時刻表網頁的資料,根據給予的參數來獲取相對應的數據資料。

開啟Scrapy專案

透過指令來創建專案

1
scrapy startproject train_timetable

創建後會產生以下檔案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
train_timetable/
+
|
+---------> scrapy.cfg
|
+---------> train_timetable/
+
|
+----------> __init__.py
|
+----------> items.py
|
+----------> middlewares.py
|
+----------> pipelines.py
|
+----------> settings.py
|
+----------> spiders
+
|
+----------> __init__.py

執行爬蟲

1
scrapy crawl train_timetable

撰寫Scrapy爬蟲

設置 item.py

首先先定義需要抓取的項目內容,包含火車車種、車次、發車站終點站、開車時間、到達時間等等。

1
2
3
4
5
6
7
8
9
10
11
12
13
import scrapy

class TrainTimetableItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()

trainType = scrapy.Field()
trainClass = scrapy.Field()
startStation = scrapy.Field()
endStation = scrapy.Field()
startTime = scrapy.Field()
arrivalTime = scrapy.Field()
pass

設置 spiders/crawler.py

首先先在spiders資料夾內創建主要爬蟲程式,在程式內通過scrapy來建立一個Class繼承scrapy.Spider,隨後進行爬蟲資料的捕捉。

scrapy.Spider屬性:

  • name : 爬蟲名稱
  • allowed_domains : 允許網域
  • start_urls : 爬取的網址

scrapy請求:

  • Request : GET
  • FormRequest : POST

scrapy選擇器:

  • css
  • xpath

通過 FormRequest與相對應的payload和url來獲取檔案,並透過callback來進行資料的回報。
透過xpath來捕捉網頁script資訊內的火車班表資訊。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#coding:utf-8
import scrapy
from train_timetable.items import TrainTimetableItem
import json

class Train_Timetable_Spider(scrapy.Spider):
name = 'train_timetable'
#start_urls = ['']

def start_requests(self):
url = 'http://twtraffic.tra.gov.tw/twrail/TW_CTSearchResult.aspx'

payload = {
'FromCity': '4',
'FromStation': '1115',
'FromStationName': '0',
'ToCity': '5',
'ToStation': '1203',
'ToStationName': '0',
'ToBackSelect': '0',
'searchdate_to': '2019-08-13',
'FromTimeSelect': '0600'
}

yield scrapy.FormRequest(
url = url,
formdata = payload,
callback = self.parse_list)

def parse_list(self,response):

train = response.xpath('//script[2]/text()')
train = train.extract()[0]

start = train.find('[')
end = train.find(']')+1

data = json.loads(train[start:end])

count = 1
for d in data:
print count
count = count + 1
for key in d:
print key,d[key]
print '--------------'

完成程式後,執行爬蟲指令來檢查是否成功抓取資料

1
scrapy crawl train_timetable

可以看到已經成功得到想要的相對應欄位

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
1
End_Code 1411
Discount_Price_Adult None
TrainType 0
Dining N
Train_Code 103
Fare 83
Direction 1
TicketLink N
To_Ticket_Code 151
Begin_Code 1003
Class_Code 1108
End_Name 潮州
From_Departure_Time 0859
Begin_EName Qidu
Package N
Handicapped Y
Begin_Name 七堵
Discount_End_Date None
Comment 每日行駛。
End_EName Chaozhou
MainViaRoad 2
Discount_Begin_Date None
Over_Night 0
From_Ticket_Code 131
To_Arrival_Time 0928
Everyday Y
--------------
2
End_Code 1242
Discount_Price_Adult None
TrainType 0
Dining N
Train_Code 507
Fare 64
Direction 1
TicketLink N
To_Ticket_Code 151
Begin_Code 1003
Class_Code 1111
End_Name 新左營
From_Departure_Time 1035
Begin_EName Qidu
Package N
Handicapped Y
Begin_Name 七堵
Discount_End_Date None
Comment 每日行駛。
End_EName Xinzuoying
MainViaRoad 2
Discount_Begin_Date None
Over_Night 0
From_Ticket_Code 131
To_Arrival_Time 1114
Everyday Y
--------------

下一步宣告item來進行資料的存取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#coding:utf-8
import scrapy
from train_timetable.items import TrainTimetableItem
import json

class Train_Timetable_Spider(scrapy.Spider):
name = 'train_timetable'
#start_urls = ['']

def start_requests(self):
url = 'http://twtraffic.tra.gov.tw/twrail/TW_CTSearchResult.aspx'

payload = {
'FromCity': '4',
'FromStation': '1115',
'FromStationName': '0',
'ToCity': '5',
'ToStation': '1203',
'ToStationName': '0',
'ToBackSelect': '0',
'searchdate_to': '2019-08-13',
'FromTimeSelect': '0600'
}

yield scrapy.FormRequest(
url = url,
formdata = payload,
callback = self.parse_list)

def parse_list(self,response):

item = TrainTimetableItem()

train = response.xpath('//script[2]/text()')
train = train.extract()[0]

start = train.find('[')
end = train.find(']')+1

data = json.loads(train[start:end])

for d in data:
print type(d['TrainType'])
item['trainType'] = d['TrainType']
item['trainClass'] = d['Train_Code']
item['startStation'] = d['Begin_Name']
item['endStation'] = d['End_Name']
item['startTime'] = d['From_Departure_Time']
item['arrivalTime'] = d['To_Arrival_Time']
print '--------------'
yield item

最後將需要的資料進行保存

1
2
scrapy crawl train_timetable -o output.json # 輸出為JSON文件
scrapy crawl train_timetable -o output.csv # 輸出為CSV文件

out.cvs

資料元素

  • 起始站,終點站,山/海線,發車時間,抵達時間,車次
    1
    2
    3
    4
    5
    6
    7
    8
    七堵,潮州,0,0859,0928,103
    七堵,新左營,0,1035,1114,507
    七堵,潮州,0,1216,1249,511
    七堵,潮州,0,1417,1453,513
    花蓮,潮州,0,1644,1724,561
    七堵,新左營,0,1918,1952,521
    七堵,潮州,0,2047,2116,145
    花蓮,員林,0,2240,2307,285

列車時刻查詢系統