文章/答案/技术大牛

发布

社区首页 >问答首页 >我怎样才能让一个函数在找到所有信息之前一直重复执行？

问我怎样才能让一个函数在找到所有信息之前一直重复执行？
EN

Stack Overflow用户

提问于 2010-07-18 13:56:46

回答 4查看 359关注 0票数 2

我想创建一个PHP函数，通过一个网站的主页，找到主页中的所有链接，通过它找到的链接，并继续进行，直到所有的链接在说的网站是最终的。我真的需要建立这样的东西，这样我就可以蜘蛛我的网站网络，并提供“一站式”的搜索。

这是我到目前为止得到的-

function spider($urltospider, $current_array = array(), $ignore_array = array('')) {
    if(empty($current_array)) {
        // Make the request to the original URL
        $session = curl_init($urltospider);
        curl_setopt($session, CURLOPT_RETURNTRANSFER, true);
        $html = curl_exec($session);
        curl_close($session);
        if($html != '') {
            $dom = new DOMDocument();
            @$dom->loadHTML($html);
            $xpath = new DOMXPath($dom);
            $hrefs = $xpath->evaluate("/html/body//a");
            for($i = 0; $i < $hrefs->length; $i++) {
                $href = $hrefs->item($i);
                $url = $href->getAttribute('href');
                if(!in_array($url, $ignore_array) && !in_array($url, $current_array)) {
                    // Add this URL to the current spider array
                    $current_array[] = $url;
                }
            }               
        } else {
            die('Failed connection to the URL');
        }
    } else {
        // There are already URLs in the current array
        foreach($current_array as $url) {
            // Connect to this URL

            // Find all the links in this URL

            // Go through each URL and get more links
        }
    }
}

唯一的问题是，我似乎不知道该怎么做。有人能帮我吗？基本上，此函数将重复执行，直到找到所有内容为止。

php

loops

web-crawler

回答 4

Stack Overflow用户

回答已采纳

发布于 2010-07-18 14:11:22

我不是PHP专家，但你似乎把它搞得太复杂了。

function spider($urltospider, $current_array = array(), $ignore_array = array('')) {
    if(empty($current_array)) {
        $current_array[] =  $urltospider;
    $cur_crawl = 0;
    while ($cur_crawl < len($current_array)) { //don't use foreach because that can get messed up if you change the array while inside the loop.
        $links_found = crawl($current_array($cur_crawl)); //crawl should return all links found in the given page
        //Now keep adding $links_found to $current_array. Maybe you can check if any of the links found are already in $current_array so you don't crawl them multiple times
        $current_array = array_merge($current_array, $links_found);
        $cur_crawl += 1;
    }
return $current_array;
}

票数 3

Stack Overflow用户

发布于 2010-07-18 14:12:36

您要查找的单词是recursion。在foreach循环中，您只需再次调用spider，它将进入每个URL的函数并递归地执行爬行。

但是有一个相当重要的问题-你没有基本情况，除非你最终到达的页面没有链接到其他页面(死胡同)。此函数将永远运行，并且不会终止。你需要用几种方式来绑定它。

使用memoization记住您已经看到的URL的结果，而不是反复请求相同的页面。
将您将访问的URL限制为特定的域，即以“http://www.somedomain.com”开头，这样您就不会遍历整个互联网。

票数 2

Stack Overflow用户

发布于 2010-07-18 14:17:02

你(可能)想要使用的东西叫做“递归”。

网页是图形。有几种遍历图形的算法；最容易理解的是深度优先。

假设你的站点是这样布局的(递归终止)：

* http://example.com/
  * http://example.com/
    * ...
  * http://example.com/post/1/
    * http://example.com/
      * ...
    * http://example.com/about/
      * ...
    * http://example.com/archives/
      * ...
  * http://example.com/post/2/
    * http://example.com/
      * ...
    * http://example.com/about/
      * ...
    * http://example.com/archives/
      * ...
  * http://example.com/post/3/
    * http://example.com/
      * ...
    * http://example.com/about/
      * ...
    * http://example.com/archives/
      * ...
  * http://example.com/about/
    * http://example.com/
      * ...
    * http://example.com/archives/
  * http://example.com/archives/
    * http://example.com/
      * ...
    * http://example.com/about/
      * ...
    * http://example.com/post/1/
      * http://example.com/
        * ...
      * http://example.com/about/
        * ...
      * http://example.com/archives/
        * ...
    * http://example.com/post/2/
      * http://example.com/
        * ...
      * http://example.com/about/
        * ...
      * http://example.com/archives/
        * ...
    * http://example.com/post/3/
      * http://example.com/
        * ...
      * http://example.com/about/
        * ...
      * http://example.com/archives/
        * ...
    * http://example.com/post/4/
      * http://example.com/
        * ...
      * http://example.com/about/
        * ...
      * http://example.com/archives/
        * ...
    * http://example.com/post/5/
      * http://example.com/
        * ...
      * http://example.com/about/
        * ...
      * http://example.com/archives/
        * ...

当您第一次点击http://example.com/时，您会看到以下链接：

http://example.com/
http://example.com/post/1/
http://example.com/post/2/
http://example.com/post/3/
http://example.com/about/
http://example.com/archives/

您需要跟踪您已经访问过的页面，以便可以忽略它们。(否则，爬取一个页面将永远耗费时间...字面意思。)您每次访问页面时都会添加到忽略列表中。现在，忽略列表中唯一的条目是http://example.com/。

接下来，过滤掉忽略的链接，将列表减少为：

http://example.com/post/1/
http://example.com/post/2/
http://example.com/post/3/
http://example.com/about/
http://example.com/archives/

然后，在这些链接中的每个链接上再次运行获取程序。您可以通过使用当前URL和ignore list：spider($url, &$ignoredUrls)再次调用函数来实现这一点(我们使用对$ignoredUrls的引用，以便新忽略的项对于父spider调用是可见的。)

查看http://example.com/post/1/，我们可以看到以下链接：

http://example.com/
http://example.com/about/
http://example.com/archives/

我们已经看过http://example.com/了。下一个不可忽略的链接是about页面。从about页面，我们转到归档页面，在那里我们可以查看每个帖子。每个帖子都有一组相同的链接：

http://example.com/
http://example.com/about/
http://example.com/archives/

因为我们已经访问了所有这些链接，所以返回一个空数组。

回到/archives/，我们将/post/2/链接(/archives/中第一个不可忽略的链接)附加到一个$foundLinks局部变量，以及使用/post/2/ (这是一个空数组)调用spider on的返回值。然后我们转到第二篇文章。

当我们浏览所有的帖子时，我们返回$foundLinks。然后，除了/about/链接之外，/about/页面还将这些链接添加到它自己的$foundLinks中。流程返回到/post/1/，它查看/archives/ (现在被忽略)。/posts/1/爬行器现在已经完成，并返回自己的$foundLinks。最终，原始调用将获得所有找到的链接。

这种方法适用于完全封闭的小型站点。然而，如果你链接到维基百科，你就会一整天都在爬虫。你可以通过至少两种方式来解决这个问题：

在某一深度之后终止爬行(例如，10个链接deep).

Restricting该URL的，例如，到某个域或子域(如example.com).

)的

下面是spider的快速实现(未经测试)：

function get_urls($url) {
    // curl/DOM code here
}

define('SPIDER_MAX_DEPTH', 10);

function spider_internal($url, &$ignoredUrls, $depth = 0) {
    $foundUrls = array($url);

    $ignoredUrls[] = $foundUrls;

    if($depth >= SPIDER_MAX_DEPTH) {
        return $foundUrls;
    }

    $links = get_links($url);

    foreach($links as $link) {
        if(array_search($link, $ignoredUrls) !== false) {
            continue;
        }

        $foundUrls = array_merge($foundUrls, spider($link, $ignoredUrls, $depth + 1));
    }

    return $foundUrls;
}

function spider($url) {
    $ignoredUrls = array();

    return spider_internal($url, $ignoredUrls);
}

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/3274504

复制

相似问题

问我怎样才能让一个函数在找到所有信息之前一直重复执行？
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我怎样才能让一个函数在找到所有信息之前一直重复执行？EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我怎样才能让一个函数在找到所有信息之前一直重复执行？
EN