(redirected from Main.GettingTheHeadOfAWebPage)
On this page... (hide)
Sometimes, all you want to retrieve is just the headers from a web page. This can be done in a number of ways depending on what language you are using. Here, I will show how one can do it in PHP and Perl.
PHP methods
HTTP.php
Installation
First, you need to install the HTTP.php module. This is done with the Pear? utility:
Using HTTP.php
Once that is installed, you need to include it in your script:
Then, you can call the head method with the URL you are seeking information from:
You can check if there are valid headers by checking to see what type of object $headers is:
$errors[] = "Invalid URI.";
return NULL;
}
From then on, you can discover more about the headers that got returned with the call.
# process error code
}
PHP cURL Library
CURL is a library that handles fetching of web items that is built in to PHP. It has many many options and can also be used to fetch just the head of the page. It's not quite as convenient as using the HTTP.php pear extension, but if you can't use that, CURL is still available.
From the basic curl example page on php.net:
The option we're most interested in here is the CURLOPT_HEADER option, which should be set to 1 to retrieve the header.
- $ch = curl_init("http://www.tamaratemple.com/"); // sets the URL to retrieve
- $fp = fopen("example_homepage.txt", "w"); // opens a file to receive the body of the page
- $hp = fopen("example_homepage_header.txt", "w"); // opens of file to receive the header of the page
- curl_setopt($ch, CURLOPT_FILE, $fp); // writes page contents to file
- curl_setopt($ch, CURLOPT_HEADER, false); // do not include the header with the page contents
- curl_setopt($ch, CURLOPT_WRITEHEADER, $hp); // file to hold the header
- curl_exec($ch); // execute the curl operation
- // close curl and file handles
- curl_close($ch);
- fclose($fp);
- fclose($hp);
Then, you can read in the header and do something with it:
- $header = file_get_contents("example_homepage_header.txt");
- echo "Header:".PHP_EOL;
- echo "----".PHP_EOL.$header."----".PHP_EOL;
- echo PHP_EOL;
- // Processing the header
- $header_lines = explode("\n",$header); // turn the text into lines in an array
- $header_lines = array_map("rtrim",$header_lines);
- print_r($header_lines);
- // First line of header is the response text and code
- $response = explode(" ",$header_lines[0], 3);
- $protocol = explode("/",$response[0]);
- $header_hash['Protocol'] = $protocol[0];
- $header_hash['ProtocolVersion'] = $protocol[1];
- $header_hash['ResponseCode'] = $response[1];
- $header_hash['ResponseText'] = $response[2];
- // Remainder of header is field: value lines
- for ($i=1; $i<count($header_lines); $i++) {
- if (empty($header_lines[$i])) continue;
- $line = explode(": ",$header_lines[$i],2);
- $header_hash[$line[0]] = $line[1];
- }
- print_r($header_hash);
The output from the combined example would be:
$ php curlexample.php
Header:
----
HTTP/1.1 200 OK
Date: Mon, 09 Apr 2012 22:00:17 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.13
X-Pingback: http://www.tamaratemple.com/xmlrpc.php
Set-Cookie: PHPSESSID=h3h24ha5po4o1c2tjb76d4obk1; path=/
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
----
Array
(
[0] => HTTP/1.1 200 OK
[1] => Date: Mon, 09 Apr 2012 22:00:17 GMT
[2] => Server: Apache/2.2.3 (CentOS)
[3] => X-Powered-By: PHP/5.2.13
[4] => X-Pingback: http://www.tamaratemple.com/xmlrpc.php
[5] => Set-Cookie: PHPSESSID=h3h24ha5po4o1c2tjb76d4obk1; path=/
[6] => Transfer-Encoding: chunked
[7] => Content-Type: text/html; charset=UTF-8
[8] =>
[9] =>
)
Array
(
[Protocol] => HTTP
[ProtocolVersion] => 1.1
[ResponseCode] => 200
[ResponseText] => OK
[Date] => Mon, 09 Apr 2012 22:00:17 GMT
[Server] => Apache/2.2.3 (CentOS)
[X-Powered-By] => PHP/5.2.13
[X-Pingback] => http://www.tamaratemple.com/xmlrpc.php
[Set-Cookie] => PHPSESSID=h3h24ha5po4o1c2tjb76d4obk1; path=/
[Transfer-Encoding] => chunked
[Content-Type] => text/html; charset=UTF-8
)
Perl
The standard way of getting urls in perl is via the estimable LWP::UserAgent? module. A simple example:
- #!/opt/local/bin/perl -w
- use strict;
- use LWP::UserAgent;
- use Data::Dumper::Names;
- my $ua = LWP::UserAgent->new;
- my $response = $ua->get("http://www.tamaratemple.com");
- die "$response->status_line" if ! $response->is_success;
- # The header is in $response->{_headers}
- # Access fields in header with $response->header($field)
- print Dumper($response->{_headers}) . "\n\n";
- print "Content-type is: " . $response->header("Content-type") . "\n";
Produces:
$VAR1 = bless( {
'x-meta-viewport' => 'width=device-width',
'connection' => 'close',
'set-cookie' => 'PHPSESSID=h15e25e00vqcg06fviallshsk5; path=/',
'date' => 'Mon, 09 Apr 2012 22:49:39 GMT',
'client-peer' => '97.74.126.203:80',
'client-date' => 'Mon, 09 Apr 2012 22:49:40 GMT',
'content-type' => 'text/html; charset=UTF-8',
'client-transfer-encoding' => [
'chunked'
],
'server' => 'Apache/2.2.3 (CentOS)',
'link' => [
'<http://gmpg.org/xfn/11>; rel="profile"',
'<http://www.tamaratemple.com/wp-content/themes/twentyeleven/style.css>; media="all"; rel="stylesheet"; type="text/css"',
'<http://www.tamaratemple.com/xmlrpc.php>; rel="pingback"',
'<http://www.tamaratemple.com/feed/>; rel="alternate"; title="Tamara Temple\'s Blog » Feed"; type="application/rss+xml"',
'<http://www.tamaratemple.com/comments/feed/>; rel="alternate"; title="Tamara Temple\'s Blog » Comments Feed"; type="application/rss+xml"',
'<http://www.tamaratemple.com/showcase/feed/>; rel="alternate"; title="Tamara Temple\'s Blog » Showcase Comments Feed"; type="application/rss+xml"',
'<http://www.tamaratemple.com/wp-content/themes/twentyeleven/colors/dark.css>; id="dark-css"; media="all"; rel="stylesheet"; type="text/css"',
'<http://www.tamaratemple.com/wp-content/plugins/rtsocial/styles/style.css?ver=3.3.1>; id="styleSheet-css"; media="all"; rel="stylesheet"; type="text/css"',
'<http://www.tamaratemple.com/wp-content/plugins/twitter-blackbird-pie/css/blackbirdpie.css?ver=20110416>; id="blackbirdpie-css-css"; media="all"; rel="stylesheet"; type="text/css"',
'<http://www.tamaratemple.com/xmlrpc.php?rsd>; rel="EditURI"; title="RSD"; type="application/rsd+xml"',
'<http://www.tamaratemple.com/wp-includes/wlwmanifest.xml>; rel="wlwmanifest"; type="application/wlwmanifest+xml"',
'<http://www.tamaratemple.com/visitors/>; rel="prev"; title="Visitors"',
'<http://www.tamaratemple.com/posts/>; rel="next"; title="Posts"',
'<http://www.tamaratemple.com/>; rel="canonical"'
],
'x-powered-by' => 'PHP/5.2.13',
'x-meta-generator' => 'WordPress 3.3.1',
'client-response-num' => 1,
'x-meta-charset' => 'UTF-8',
'x-pingback' => 'http://www.tamaratemple.com/xmlrpc.php',
'title' => 'Tamara Temple\'s Blog | Rumblings and Ramblings of a Well-Travelled Mouse',
'x-meta-google-site-verification' => 'X8LsSEQsjd45g6RMdTR0CCVkl-3pgVAgOb4oG7s5JH0'
}, 'HTTP::Headers' );
Content-type is: text/html; charset=UTF-8
| Tags: | Categories: Articles |
Recent Changes |
Printable View |
Page History |
Edit Page
Page last modified on April 09, 2012, at 11:00 PM by tamara