Perl-LWP库

perldoc LWP

三个重要的组成部分:

  • Request Object
  • Response Object
  • UserAgent

Request Object

LWP 中的 Request Object 属于 HTTP::Request 类 (不只是发送 HTTP 请求). 其主要的属性为 method, uri, headers, content.

Response Object

LWP 中的 Response Object 属于 HTTP::Response 类. 其主要属性有 code, message, headers, content. 常用来检测 code 属性的方法为 is_success()is_error().

User Agent

将 Request Object 交给 User Agent 处理, 并从 User Agent 处获取 Response Object.

LWP 中的 User Agent 属于 LWP::UserAgent 类.

request() 方法接收一个 Request Object, 然后返回一个 Response Object.

常用的属性有:

  • timeout
  • agent
  • from
  • parse_head
  • proxy, no_proxy
  • credentials

示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Create a user agent object
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1 ");

# Create a request
my $req = HTTP::Request->new(POST => 'http://search.cpan.org/search');
$req->content_type('application/x-www-form-urlencoded');
$req->content('query=libwww-perl&mode=dist');

# Pass request to the user agent and get a response back
my $res = $ua->request($req);

# Check the outcome of the response
if ($res->is_success) {
print $res->content;
}
else {
print $res->status_line, "\n";
}

可用的和 LWP 相关的模块:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
    LWP::MemberMixin   -- Access to member variables of Perl5 classes
LWP::UserAgent -- WWW user agent class
LWP::RobotUA -- When developing a robot applications
LWP::Protocol -- Interface to various protocol schemes
LWP::Protocol::http -- http:// access
LWP::Protocol::file -- file:// access
LWP::Protocol::ftp -- ftp:// access
...

LWP::Authen::Basic -- Handle 401 and 407 responses
LWP::Authen::Digest

HTTP::Headers -- MIME/RFC822 style header (used by HTTP::Message)
HTTP::Message -- HTTP style message
HTTP::Request -- HTTP request
HTTP::Response -- HTTP response
HTTP::Daemon -- A HTTP server class

WWW::RobotRules -- Parse robots.txt files
WWW::RobotRules::AnyDBM_File -- Persistent RobotRules

Net::HTTP -- Low level HTTP client

The following modules provide various functions and definitions.

LWP -- This file. Library version number and documentation.
LWP::MediaTypes -- MIME types configuration (text/html etc.)
LWP::Simple -- Simplified procedural interface for common functions
HTTP::Status -- HTTP status code (200 OK etc)
HTTP::Date -- Date parsing module for HTTP date formats
HTTP::Negotiate -- HTTP content negotiation calculation
File::Listing -- Parse directory listings
HTML::Form -- Processing for <form>s in HTML documents

Mojo::DOM 处理 HTML

perldoc Mojo::DOM 中查看文档.

其会把 HTML 解析为 nodes tree, 有 8 种 nodes:

  • cdata
  • comment
  • doctype
  • pi
  • raw
  • root
  • tag
  • text

其常见的结构为:

1
2
3
4
5
6
7
8
root
|- doctype (html)
+- tag (html)
|- tag (head)
| +- tag (title)
| +- raw (Hello)
+- tag (body)
+- text (World!)

所有的 nodes 都是 Mojo::DOM 对象.

创建 Mojo::DOM 对象

1
2
my $dom = Mojo::DOM->new;
my $dom = Mojo::DOM->new('<foo bar="baz">I ♥ Mojolicious!</foo>');

只包含一个 tag:

1
2
3
4
5
6
7
my $tag = Mojo::DOM->new_tag('div');
my $tag = $dom->new_tag('div');
my $tag = $dom->new_tag('div', id => 'foo', hidden => undef);
my $tag = $dom->new_tag('div', 'safe content');
my $tag = $dom->new_tag('div', id => 'foo', 'safe content');
my $tag = $dom->new_tag('div', data => {mojo => 'rocks'}, 'safe content');
my $tag = $dom->new_tag('div', id => 'foo', sub { 'unsafe content' });

解析 HTML 为 Mojo::DOM 对象

1
$dom = $dom->parse('<foo bar="baz">I ♥ Mojolicious!</foo>');

获取根元素

1
my $root = $dom->root;

返回一个 element 所有子孙 nodes 的文本内容

1
2
3
4
my $text = $dom->all_text;

# "foo\nbarbaz\n"
$dom->parse("<div>foo\n<p>bar</p>baz\n</div>")->at('div')->all_text;

查找祖先元素

1
2
3
my $collection = $dom->ancestors;
my $collection = $dom->ancestors('div ~ p');
say $dom->ancestors->map('tag')->join("\n");

向 HTML 追加添加内容

1
2
$dom = $dom->append_content('<p>I ♥ Mojolicious!</p>');
$dom = $dom->append_content(Mojo::DOM->new);

(不知道和 append 的区别)

应用元素选择器

1
2
my $result = $dom->at('div ~ p');
my $result = $dom->at('svg|line', svg => 'http://www.w3.org/2000/svg');

获取, 修改, 删除属性值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
my $hash = $dom->attr;
my $foo = $dom->attr('foo');
$dom = $dom->attr({foo => 'bar'});
$dom = $dom->attr(foo => 'bar');

This element's attributes.

# Remove an attribute
delete $dom->attr->{id};

# Attribute without value
$dom->attr(selected => undef);

# List id attributes
say $dom->find('*')->map(attr => 'id')->compact->join("\n");

获取子节点

1
2
3
4
5
6
7
8
9
10
11
#Return a Mojo::Collection object containing all child nodes of this element as Mojo::DOM objects.
my $collection = $dom->child_nodes;

# "<p><b>123</b></p>"
$dom->parse('<p>Test<b>123</b></p>')->at('p')->child_nodes->first->remove;

# "<!DOCTYPE html>"
$dom->parse('<!DOCTYPE html><b>123</b>')->child_nodes->first;

# " Test "
$dom->parse('<b>123</b><!-- Test -->')->child_nodes->last->content;

通过选择器获取子节点:

1
2
3
4
5
6
7
# Find all child elements of this element matching the CSS selector and return a Mojo::Collection object containing these elements as Mojo::DOM objects. All selectors from "SELECTORS" in Mojo::DOM::CSS are supported.

my $collection = $dom->children;
my $collection = $dom->children('div ~ p');

# Show tag name of random child element
say $dom->children->shuffle->first->tag;

返回, 设置元素内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Return this node's content or replace it with HTML/XML fragment (for "root" and "tag" nodes) or raw content.
my $str = $dom->content;
$dom = $dom->content('<p>I ♥ Mojolicious!</p>');
$dom = $dom->content(Mojo::DOM->new);

# "<b>Test</b>"
$dom->parse('<div><b>Test</b></div>')->at('div')->content;

# "<div><h1>123</h1></div>"
$dom->parse('<div><h1>Test</h1></div>')->at('h1')->content('123')->root;

# "<p><i>123</i></p>"
$dom->parse('<p>Test</p>')->at('p')->content('<i>123</i>')->root;

# "<div><h1></h1></div>"
$dom->parse('<div><h1>Test</h1></div>')->at('h1')->content('')->root;

# " Test "
$dom->parse('<!-- Test --><br>')->child_nodes->first->content;

# "<div><!-- 123 -->456</div>"
$dom->parse('<div><!-- Test -->456</div>')
->at('div')->child_nodes->first->content(' 123 ')->root;

获取所有子孙元素

1
2
3
4
5
6
7
8
9
10
11
12
# Return a Mojo::Collection object containing all descendant nodes of this element as Mojo::DOM objects.
my $collection = $dom->descendant_nodes;

# "<p><b>123</b></p>"
$dom->parse('<p><!-- Test --><b>123<!-- 456 --></b></p>')
->descendant_nodes->grep(sub { $_->type eq 'comment' })
->map('remove')->first;

# "<p><b>test</b>test</p>"
$dom->parse('<p><b>123</b>456</p>')
->at('p')->descendant_nodes->grep(sub { $_->type eq 'text' })
->map(content => 'test')->first->root;

选择器查找整个 HTML 中的元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Find all descendant elements of this element matching the CSS selector and return a Mojo::Collection object containing these elements as Mojo::DOM objects. All selectors from "SELECTORS" in Mojo::DOM::CSS are supported.
my $collection = $dom->find('div ~ p');
my $collection = $dom->find('svg|line', svg => 'http://www.w3.org/2000/svg');

# Find a specific element and extract information
my $id = $dom->find('div')->[23]{id};

# Extract information from multiple elements
my @headers = $dom->find('h1, h2, h3')->map('text')->each;

# Count all the different tags
my $hash = $dom->find('*')->reduce(sub { $a->{$b->tag}++; $a }, {});

# Find elements with a class that contains dots
my @divs = $dom->find('div.foo\.bar')->each;

查找所有兄弟元素

1
2
3
4
5
6
# Find all sibling elements after this node matching the CSS selector and return a Mojo::Collection object containing these elements as Mojo::DOM objects. All selectors from "SELECTORS" in Mojo::DOM::CSS are supported.
my $collection = $dom->following;
my $collection = $dom->following('div ~ p');

# List tags of sibling elements after this node
say $dom->following->map('tag')->join("\n");

或:

1
2
3
4
5
# Return a Mojo::Collection object containing all sibling nodes after this node as Mojo::DOM objects.
my $collection = $dom->following_nodes;

# "C"
$dom->parse('<p>A</p><!-- B -->C')->at('p')->following_nodes->last->content;

获取下一个兄弟元素:

1
2
3
4
5
# Return Mojo::DOM object for next sibling element, or "undef" if there are no more siblings.
my $sibling = $dom->next;

# "<h2>123</h2>"
$dom->parse('<div><h1>Test</h1><h2>123</h2></div>')->at('h1')->next;

获取下一个兄弟 node:

1
2
3
4
5
6
7
8
9
10
# Return Mojo::DOM object for next sibling node, or "undef" if there are no more siblings.
my $sibling = $dom->next_node;

# "456"
$dom->parse('<p><b>123</b><!-- Test -->456</p>')
->at('b')->next_node->next_node;

# " Test "
$dom->parse('<p><b>123</b><!-- Test -->456</p>')
->at('b')->next_node->content;

获取之前的兄弟元素:

1
2
3
4
5
6
# Find all sibling elements before this node matching the CSS selector and return a Mojo::Collection object containing these elements as Mojo::DOM objects. All selectors from "SELECTORS" in Mojo::DOM::CSS are supported.
my $collection = $dom->preceding;
my $collection = $dom->preceding('div ~ p');

# List tags of sibling elements before this node
say $dom->preceding->map('tag')->join("\n");

获取之前的兄弟 node:

1
2
3
4
5
# Return a Mojo::Collection object containing all sibling nodes before this node as Mojo::DOM objects.
my $collection = $dom->preceding_nodes;

# "A"
$dom->parse('A<!-- B --><p>C</p>')->at('p')->preceding_nodes->first->content;

(似乎也可以用 previousprevious_node)

添加元素到 HTML 之前

1
2
3
4
5
6
7
8
9
10
11
# Prepend HTML/XML fragment to this node (for all node types other than "root").
$dom = $dom->prepend('<p>I ♥ Mojolicious!</p>');
$dom = $dom->prepend(Mojo::DOM->new);

# "<div><h1>Test</h1><h2>123</h2></div>"
$dom->parse('<div><h2>123</h2></div>')
->at('h2')->prepend('<h1>Test</h1>')->root;

# "<p>Test 123</p>"
$dom->parse('<p>123</p>')
->at('p')->child_nodes->first->prepend('Test ')->root;

添加文本到 HTML 之前

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Prepend HTML/XML fragment (for "root" and "tag" nodes) or raw content to this node's content.
$dom = $dom->prepend_content('<p>I ♥ Mojolicious!</p>');
$dom = $dom->prepend_content(Mojo::DOM->new);

# "<div><h2>Test123</h2></div>"
$dom->parse('<div><h2>123</h2></div>')
->at('h2')->prepend_content('Test')->root;

# "<!-- Test 123 --><br>"
$dom->parse('<!-- 123 --><br>')
->child_nodes->first->prepend_content(' Test')->root;

# "<p><i>123</i>Test</p>"
$dom->parse('<p>Test</p>')->at('p')->prepend_content('<i>123</i>')->root;

移除一个元素

1
2
3
4
5
6
7
8
9
# Remove this node and return "root" (for "root" nodes) or "parent".
my $parent = $dom->remove;

# "<div></div>"
$dom->parse('<div><h1>Test</h1></div>')->at('h1')->remove;

# "<p><b>456</b></p>"
$dom->parse('<p>123<b>456</b></p>')
->at('p')->child_nodes->first->remove->root;

替换一个元素

1
2
3
4
5
6
7
8
9
10
# Replace this node with HTML/XML fragment and return "root" (for "root" nodes) or "parent".
my $parent = $dom->replace('<div>I ♥ Mojolicious!</div>');
my $parent = $dom->replace(Mojo::DOM->new);

# "<div><h2>123</h2></div>"
$dom->parse('<div><h1>Test</h1></div>')->at('h1')->replace('<h2>123</h2>');

# "<p><b>123</b></p>"
$dom->parse('<p>Test</p>')
->at('p')->child_nodes->[0]->replace('<b>123</b>')->root;

获取父节点

1
2
3
4
5
# Return Mojo::DOM object for parent of this node, or "undef" if this node has no parent.
my $parent = $dom->parent;

# "<b><i>Test</i></b>"
$dom->parse('<p><b><i>Test</i></b></p>')->at('i')->parent;

判断是否有某一元素

1
2
3
4
5
6
7
8
9
10
11
# Check if this element matches the CSS selector. All selectors from "SELECTORS" in Mojo::DOM::CSS are supported.
my $bool = $dom->matches('div ~ p');
my $bool = $dom->matches('svg|line', svg => 'http://www.w3.org/2000/svg');

# True
$dom->parse('<p class="a">A</p>')->at('p')->matches('.a');
$dom->parse('<p class="a">A</p>')->at('p')->matches('p[class]');

# False
$dom->parse('<p class="a">A</p>')->at('p')->matches('.b');
$dom->parse('<p class="a">A</p>')->at('p')->matches('p[id]');

获取一个元素的选择器

1
2
3
4
5
6
7
8
# Get a unique CSS selector for this element.
my $selector = $dom->selector;

# "ul:nth-child(1) > li:nth-child(2)"
$dom->parse('<ul><li>Test</li><li>123</li></ul>')->find('li')->last->selector;

# "p:nth-child(1) > b:nth-child(1) > i:nth-child(1)"
$dom->parse('<p><b><i>Test</i></b></p>')->at('i')->selector;

移除一个元素但是保留其内容

1
2
3
4
5
# Remove this element while preserving its content and return "parent".
my $parent = $dom->strip;

# "<div>Test</div>"
$dom->parse('<div><h1>Test</h1></div>')->at('h1')->strip;

获取一个元素的 tag name

1
2
3
4
5
6
# This element's tag name.
my $tag = $dom->tag;
$dom = $dom->tag('div');

# List tag names of child elements
say $dom->children->map('tag')->join("\n");

使用 Mojo::Base

1
2
# Alias for "tap" in Mojo::Base.
$dom = $dom->tap(sub {...});

获取当前元素的内容

1
2
3
4
5
6
7
8
# Extract text content from this element only (not including child elements).
my $text = $dom->text;

# "bar"
$dom->parse("<div>foo<p>bar</p>baz</div>")->at('p')->text;

# "foo\nbaz\n"
$dom->parse("<div>foo\n<p>bar</p>baz\n</div>")->at('div')->text;

获取 node 的类型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# This node's type, usually "cdata", "comment", "doctype", "pi", "raw", "root", "tag" or "text".
my $type = $dom->type;

# "cdata"
$dom->parse('<![CDATA[Test]]>')->child_nodes->first->type;

# "comment"
$dom->parse('<!-- Test -->')->child_nodes->first->type;

# "doctype"
$dom->parse('<!DOCTYPE html>')->child_nodes->first->type;

# "pi"
$dom->parse('<?xml version="1.0"?>')->child_nodes->first->type;

# "raw"
$dom->parse('<title>Test</title>')->at('title')->child_nodes->first->type;

# "root"
$dom->parse('<p>Test</p>')->type;

# "tag"
$dom->parse('<p>Test</p>')->at('p')->type;

# "text"
$dom->parse('<p>Test</p>')->at('p')->child_nodes->first->type;

获取元素的 value 值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Extract value from form element (such as "button", "input", "option", "select" and "textarea"), or return "undef" if this element has no value. In the case of "select" with "multiple" attribute, find "option" elements with "selected" attribute and return an array reference with all values, or "undef" if none could be found.
my $value = $dom->val;

# "a"
$dom->parse('<input name=test value=a>')->at('input')->val;

# "b"
$dom->parse('<textarea>b</textarea>')->at('textarea')->val;

# "c"
$dom->parse('<option value="c">Test</option>')->at('option')->val;

# "d"
$dom->parse('<select><option selected>d</option></select>')
->at('select')->val;

# "e"
$dom->parse('<select multiple><option selected>e</option></select>')
->at('select')->val->[0];

# "on"
$dom->parse('<input name=test type=checkbox>')->at('input')->val;

用一个元素包裹另一个元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Wrap HTML/XML fragment around this node (for all node types other than "root"), placing it as the last child of the first innermost element.
$dom = $dom->wrap('<div></div>');
$dom = $dom->wrap(Mojo::DOM->new);

# "<p>123<b>Test</b></p>"
$dom->parse('<b>Test</b>')->at('b')->wrap('<p>123</p>')->root;

# "<div><p><b>Test</b></p>123</div>"
$dom->parse('<b>Test</b>')->at('b')->wrap('<div><p></p>123</div>')->root;

# "<p><b>Test</b></p><p>123</p>"
$dom->parse('<b>Test</b>')->at('b')->wrap('<p></p><p>123</p>')->root;

# "<p><b>Test</b></p>"
$dom->parse('<p>Test</p>')->at('p')->child_nodes->first->wrap('<b>')->root;

(不知道和 wrap_content 的区别)

XML 开关

1
2
my $bool = $dom->xml;
$dom = $dom->xml($bool);

应用 grep

其参数是一个匿名函数:

1
my $img_collection = $dom->find('img')->grep(sub { $_->attr('src') =~ /http/ });

perldoc LWP::Simple


Perl-LWP库
http://example.com/2023/10/26/Perl-LWP库/
作者
Jie
发布于
2023年10月26日
许可协议