Quiero eliminar http://www.3andena.com/, este sitio web comienza primero en árabe y almacena las configuraciones de idioma en las cookies. Si intentó acceder a la versión del idioma directamente a través de la URL(), se produce un error y se devuelve el error del servidor.cómo sobrescribir/usar cookies en scrapy
Por lo tanto, quiero establecer el valor de la cookie "store_language" en "en", y luego comenzar a eliminar el sitio web utilizando los valores de esta cookie.
Estoy usando CrawlSpider con un par de reglas.
Aquí está el código
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re
class AndenaSpider(CrawlSpider):
name = "andena"
domain_name = "3andena.com"
start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]
product_urls = []
rules = (
# The following rule is for pagination
Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
# The following rule is for produt details
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
)
def start_requests(self):
yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})
for url in self.start_urls:
yield Request(url, callback=self.parse_category)
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())
for product in self.product_urls:
yield Request(product, callback=self.parse_product)
def parse_product(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = Product()
'''
some parsing
'''
items.append(item)
return items
SPIDER = AndenaSpider()
Aquí está el registro:
2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)
Intenté esto antes de publicar mi pregunta, pero no funciona –
¿Podría poner su código fuente? – VenkatH
Acabo de agregarlo –