¿Cómo puedo obtener la URL absoluta al extraer enlaces usando Nokogiri?

Estoy usando Nokogiri para extraer enlaces de una página, pero me gustaría obtener la ruta absoluta aunque la de la página sea relativa. ¿Cómo puedo lograr esto?¿Cómo puedo obtener la URL absoluta al extraer enlaces usando Nokogiri?

Fuente

2011-02-01 Mridang Agarwalla

Nokogiri no está relacionado, aparte del hecho de que te da el ancla de enlace para empezar. URI utilizar la biblioteca de Ruby a administrar rutas:

absolute_uri = URI.join(page_url, href).to_s

visto en acción:

require 'uri' 

# The URL of the page with the links 
page_url = 'http://foo.com/zee/zaw/zoom.html' 

# A variety of links to test. 
hrefs = %w[ 
    http://zork.com/    http://zork.com/#id 
    http://zork.com/bar   http://zork.com/bar#id 
    http://zork.com/bar/   http://zork.com/bar/#id 
    http://zork.com/bar/jim.html http://zork.com/bar/jim.html#id 
    /bar       /bar#id 
    /bar/      /bar/#id 
    /bar/jim.html    /bar/jim.html#id 
    jim.html      jim.html#id 
    ../jim.html     ../jim.html#id 
    ../       ../#id 
    #id 
] 

hrefs.each do |href| 
    root_href = URI.join(page_url,href).to_s 
    puts "%-32s -> %s" % [ href, root_href ] 
end 
#=> http://zork.com/     -> http://zork.com/ 
#=> http://zork.com/#id    -> http://zork.com/#id 
#=> http://zork.com/bar    -> http://zork.com/bar 
#=> http://zork.com/bar#id   -> http://zork.com/bar#id 
#=> http://zork.com/bar/    -> http://zork.com/bar/ 
#=> http://zork.com/bar/#id   -> http://zork.com/bar/#id 
#=> http://zork.com/bar/jim.html  -> http://zork.com/bar/jim.html 
#=> http://zork.com/bar/jim.html#id -> http://zork.com/bar/jim.html#id 
#=> /bar        -> http://foo.com/bar 
#=> /bar#id       -> http://foo.com/bar#id 
#=> /bar/       -> http://foo.com/bar/ 
#=> /bar/#id       -> http://foo.com/bar/#id 
#=> /bar/jim.html     -> http://foo.com/bar/jim.html 
#=> /bar/jim.html#id     -> http://foo.com/bar/jim.html#id 
#=> jim.html       -> http://foo.com/zee/zaw/jim.html 
#=> jim.html#id      -> http://foo.com/zee/zaw/jim.html#id 
#=> ../jim.html      -> http://foo.com/zee/jim.html 
#=> ../jim.html#id     -> http://foo.com/zee/jim.html#id 
#=> ../        -> http://foo.com/zee/ 
#=> ../#id       -> http://foo.com/zee/#id 
#=> #id        -> http://foo.com/zee/zaw/zoom.html#id

La respuesta más complicado aquí anteriormente utilizados URI.parse(root).merge(URI.parse(href)).to_s.
Gracias a @pguardiario por la mejora.

Fuente

2011-02-01 15:05:33 Phrogz

Nokogiri podría estar relacionado con esto. He aquí cómo: si un documento html contiene una etiqueta base, la solución anterior no funcionará correctamente. En ese caso, se debe usar el valor del atributo href de la etiqueta base en lugar de page_url. Eche un vistazo a la explicación más detallada de @david-thomas aquí: http://stackoverflow.com/questions/5559578/havling-links-relative-to-root – draganstankovic

Necesita verificar si la URL es absoluta o relativa con la marca si comienza por http: Si la URL es relativa necesita agregar el host a esta URL. No puedes hacer eso por nokogiri. Necesitas procesar todas las URL dentro para renderizar como absolutas.

Fuente

2011-02-01 11:08:43 shingara

respuesta Phrogz' está bien, pero más simplemente:

URI.join(base, url).to_s

Fuente

2012-01-04 06:50:28 pguardiario

¿Puede dar un ejemplo de qué base y URL son? – lulalala

'base =" http://www.google.com/somewhere "; url = '/ over/there'; 'Creo que los nombres variables de pguardino son un poco imprecisos –

¿Cómo puedo obtener la URL absoluta al extraer enlaces usando Nokogiri?

Respuesta

Cuestiones relacionadas