The XML_Statistics PEAR Package
To analyze XML documents, you can install and use the XML_Statistics package, which provides methods for obtaining statistics about tags, attributes, entities, processing instructions, data blocks, and CDATA sections for any well-formatted XML document.
To install the PEAR, use this command (version 0.1 is the beta version):
> pear install -alldeps XML_Statistics-0.1
Authors' Note Use the -alldeps option when installing, because the XML_Statistics PEAR depends on the XML_Parser PEAR package, which is an XML parser based on PHP's built-in XML extension. The latest released of the XML_Parser PEAR is 1.2.8 (stable). |
After installing the package, you'll find the code for the base class of the XML_Statistics package in the file XML_Statistics.php. Here are some of its main functions:
- boolean analyzeFile(mixed $file,string $filename): This function analyzes an XML file by loading it from a file path or URL. To see the results of the analysis use the countX() and getX() methods.
- integer countTag([string $tagname = null]): This function returns the number of occurrences of a tag in the entire XML document. The tag name is passed to the function through the $tagname argument.
- integer countAttribute([string $attribute = null], [string $tagname = null]): This function returns the number of occurrences of an attribute. You pass the attribute name to the function via the $attribute argument. If you don't specify the second argument, $tagname, then the function searches for the specified attribute in the entire XML document; otherwise, it limits the search range to the specified tag.
- integer getMaxDepth(): This function returns the maximum nesting level in the document, the "depth" of the XML tree.
- integer countTagsInDepth(integer $depth): This function returns the number of tags that "live" at the specified depth. The root tag depth is zero.
- integer countExternalEntity([string $name = null]): This function returns the number of occurrences of external entities. If you don't specify an entity name then the function counts the total number of external entities; otherwise, it counts only the occurrences of the specified entity.
- integer getCDataLength(): This function return the total length of all CDATA sections.
Listing 1 contains an XML document used for test purposes in the example code (see the file myxml.xml in the downloadable code).
The following PHP example uses the XML document in Listing 1 to retrieve some basic statistics about tags, attributes, and CDATA sections:
<?php
//import Statistics.php
require_once 'XML/Statistics.php';
//ignore whitespaces
$stat = new XML_Statistics(array("ignoreWhitespace" => true));
//analyze a file or URL
$result = $stat->analyzeFile("myxml.xml");
if ($stat->isError($result)) {
die("Error: " . $result->getMessage());
}
else
{
// total number of tags
echo "Total tags: " . $stat->countTag()."<br>";
// count number of 'type' attribute
echo "Occurences of attribute type: " . $stat->countAttribute("type")."<br>";
// get the maximum depth
echo "Maximum depth: " . $stat->getMaxDepth()."<br>";
// count total number of tags in depth 3
echo "The number of tags in depth 3: " . $stat->countTagsInDepth(3)."<br>";
// count the occurences of data blocks
echo "Data chunks: " . $stat->countDataChunks()."<br>";
// get the length of all CData sections
echo "Length of all data chunks: " . $stat->getCDataLength()."<br>";
}
?>
The output of this example is:
Total tags: 16
Occurences of attribute type: 2
Maximum depth: 3
The number of tags in depth 3: 10
Data chunks: 6
Length of all data chunks: 93
You can combine the XML statistics with statistical functions such as max, min, midrange, sum, variance, quartiles, etc. by installing the Math_stats PEAR package, which is usually used in conjunction with XML_Statistics PEAR. You install the Math_stats PEAR like this:
pear install Math_stats
The most-commonly-used functions of the Math_stats package are:
- mixed calcBasic([boolean $returnErrorObject = true]): This function calculates a basic set of statistics.
- mixed calcFull([boolean $returnErrorObject = true]): This function calculates a full set of statistics.
The next example demonstrates both these functions using the myxml.xml document in Listing 1:
<?php
//import Statistics.php and Stats.php
require_once 'XML/Statistics.php';
require_once 'Math/Stats.php';
$stat = new XML_Statistics();
$result = $stat->analyzeFile("myxml.xml");
if ($stat->isError($result)) {
die("Error: " . $result->getMessage());
}
else {
// get the number of tags per tagname
$tags = $stat->getTagOccurences();
// use Math_Stats class
$stats = new Math_Stats();
// set the data
$stats->setData($tags);
// calculates a basic set of statistics
$stats1 = $stats->calcBasic();
// calculates a full set of statistics
$stats2 = $stats->calcFull();
echo "<pre>";
print_r('<b><u>'."A basic set of statistics".'</u></b><br /><br />');
// print a basic set of statistics
print_r($stats1);
print_r('<br /><b><u>'."A full set of statistics".'</u></b><br /><br />');
// print a full set of statistics
print_r($stats2);
echo "</pre>";
}
?>
The "basic statistics" portion of the output of this example is shown below. Listing 2 shows the full output.
A basic set of statistics
Array
(
[min] => 1
[max] => 2
[sum] => 16
[sum2] => 20
[count] => 14
[mean] => 1.1428571428571
[stdev] => 0.36313651960128
[variance] => 0.13186813186813
)
Now that you've seen how to generate reports and statistics from flat files, you can move on to generating them from data stored in relational databases.