问如何获取维基项目的维基百科数据？
EN

Stack Overflow用户

提问于 2019-02-17 10:07:22

回答 3查看 633关注 0票数 1

我最近发现维基百科有基于discipline (https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline)分类的Wikiprojects。如链接所示，它有34个学科。

我想知道是否有可能获得所有与这些wikipedia disciplines相关的维基百科文章。

例如，考虑WikiProject Computer science‎。有没有可能使用WikiProject Computer science‎类别获取维基百科上所有与计算机科学相关的文章？如果是，是否存在与其相关的数据转储，或者是否有任何其他方法来获取这些数据？

我目前使用的是python (即pywikibot和pymediawiki)。然而，我也很高兴收到其他语言的答案。

如果需要，我很乐意提供更多的细节。

python

mediawiki

wikipedia

wikipedia-api

mediawiki-api

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-02-18 00:29:29

正如我在@arash的回答中所建议和补充的那样，您可以使用Wikipedia API来获取Wikipedia数据。下面是关于如何执行此操作的描述的链接，API:Categorymembers#GET_request

正如您所说的，您需要使用program来获取数据，下面是JavaScript中的示例代码。它将从Category:WikiProject_Computer_science_articles获取前500个名称并显示为输出。您可以根据以下示例转换您选择的语言：

// Importing the module
const fetch = require('node-fetch');

// URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

// Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        console.log(t.query.categorymembers[i].title);
    }
});

要将数据写入文件，您可以执行如下操作：

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Initializing an empty array
    let titles = [];
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        let title = t.query.categorymembers[i].title;
        console.log(title);
        titles[i] = title;
    }
    fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

因为我们在上面使用了JavaScript数组，所以上面的数组会将数据存储在一个,分隔的文件中。如果你想在每一行中不使用逗号进行存储，那么你需要这样做：

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Initializing an empty array
    let titles = '';
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        let title = t.query.categorymembers[i].title;
        console.log(title);
        titles += title + "\n";
    }
    fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

通过使用cmlimit，我们不能获取超过500个标题，所以我们需要使用cmcontinue来检查和获取下一个页面……

尝试下面的代码，它获取特定类别的所有标题，并打印，将数据附加到文件：

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";

// Method to fetch and append the data to a file 
var fetchTheData = async (url, index) => {
    return await fetch(url).then(res => res.json()).then(data => {
        // Getting the length of the returned array
        let len = data.query.categorymembers.length;
        // Initializing an empty string
        let titles = '';
        // Iterating over all the response data
        for(let i=0;i<len;i++) {
            // Printing the names
            let title = data.query.categorymembers[i].title;
            console.log(title);
            titles += title + "\n";
        }
        // Appending to the file
        fs.appendFileSync('pathtotitles\\titles.txt', titles);
        // Handling an end of error fetching titles exception
        try {
            return data.continue.cmcontinue;
        } catch(err) {
            return "===>>> Finished Fetching...";
        }
    });
}

// Method which will construct the next URL with next page to fetch the data
var constructNextPageURL = async (url) => {
    // Getting the next page token
    let nextPage = await fetchTheData(url);
    for(let i=1;i<=14;i++) {
        await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
        // Constructing the next page URL with next page token and sending the fetch request
        nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
    }
}

// Calling to begin extraction
constructNextPageURL(url);

我希望这会有帮助..。

票数 3

Stack Overflow用户

发布于 2019-02-17 17:43:45

您可以使用API:Categorymembers来获取子类别和页面的列表。设置cmtype参数为subcat获取子分类，cmnamespace设置为0获取文章。

您还可以从数据库获取列表( categorylinks table中的类别层次结构信息和page table中的文章信息)

票数 2

Stack Overflow用户

发布于 2021-11-26 09:27:59

在我的google搜索结果中发现了这个页面，我在这里给后人留下了一些工作代码。这将直接与维基百科的api交互，不会使用pywikibot或pymediawiki。

获取文章名称是一个两步的过程。因为一个类别的成员不是文章本身，而是它们的讨论页面。所以首先我们得到讨论页面，然后我们必须得到父页面，实际的文章。

(有关接口请求中使用的参数的详细信息，请查看querying category members和querying page info页面。)

import time
import requests
from datetime import datetime,timezone
import json

utc_time_now = datetime.now(timezone.utc)
utc_time_now_string =\
utc_time_now.replace(microsecond=0).replace(tzinfo=None).isoformat() + 'Z'

api_url = 'https://en.wikipedia.org/w/api.php'
headers = {'User-Agent': '<Your purpose>, owner_name: <Your name>, 
          email_id: <Your email id>'}
        # or you can follow instructions at 
        # https://www.mediawiki.org/wiki/API:Etiquette#The_User-Agent_header

category = "Category:WikiProject_Computer_science_articles"

combined_category_members = []

params = {
        'action': 'query',
        'format': 'json',
        'list':'categorymembers',
        'cmtitle': category,
        'cmprop': 'ids|title|timestamp',
        'cmlimit': 500,
        'cmstart': utc_time_now_string,
        # you can also put a 'cmend': '20210101000000' 
        # (that YYYYMMDDHHMMSS string stands for 12 am UTC on Nov 1, 2021)
        # this then gathers category members added from now till value for 'cmend'
        'cmdir': 'older',
        'cmnamespace': '0|1',
        'cmsort': 'timestamp'
}

response = requests.get(api_url, headers=headers, params=params)
data = response.json()
category_members = data['query']['categorymembers']
combined_category_members.extend(category_members)

while 'continue' in data:
    params.update(data['continue'])
    time.sleep(1)
    response = requests.get(api_url, headers=headers, params=params)
    data = response.json()
    category_members = data['query']['categorymembers']
    combined_category_members.extend(category_members)

#now we've gotten only the talk page ids so far
#now we have to get the parent page ids from talk page ids

final_dict = {}

talk_page_id_list = []
for member in combined_category_members:
    talk_page_id = member['pageid']
    talk_page_id_list.append(talk_page_id)

while talk_page_id_list: #while not an empty list
    fifty_pageid_batch = talk_page_id_list[0:50]
    fifty_pageid_batch_converted = [str(number) for number in fifty_pageid_batch]
    fifty_pageid_string = '|'.join(fifty_pageid_batch_converted)
    params = {
            'action':   'query',
            'format':   'json',
            'prop':     'info',
            'pageids':  fifty_pageid_string,
            'inprop': 'subjectid|associatedpage'
            }
    time.sleep(1)
    response = requests.get(api_url, headers=headers, params=params)
    data = response.json()
    for talk_page_id, talk_page_id_dict in data['query']['pages'].items():
        page_id_raw = talk_page_id_dict['subjectid']
        page_id = str(page_id_raw)
        page_title = talk_page_id_dict['associatedpage']
        final_dict[page_id] = page_title

    del talk_page_id_list[0:50] 

with open('comp_sci_category_members.json', 'w', encoding='utf-8') as filex:
    json.dump(final_dict, filex, ensure_ascii=False)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/54729496

复制

相似问题

问如何获取维基项目的维基百科数据？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何获取维基项目的维基百科数据？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何获取维基项目的维基百科数据？
EN