如何使用Mathematica从HTML中提取信息？

### 2 个回答

``````tmp = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]
``````

``````tmp1 = Cases[tmp, {_, _?NumberQ, _}, \[Infinity]]
``````

``````tmp1[[All, 3]] = Flatten[If[StringQ[#],
StringCases[#, x__ ~~ Whitespace ~~ "[" ~~ __ :> x], #] & /@ tmp1[[All, 3]]]

Grid[tmp1, Frame -> All]
``````

``````Grid[Join[{{"Country / Region", "Unemployment rate (%)",
"Source / date of information"}}, tmp1], Frame -> All]
``````

``````tmp2 = Flatten[
If[StringMatchQ[#, __ ~~ "(" ~~ __],
StringCases[#,
z__ ~~ Shortest["(" ~~ __ ~~ ")" ~~ EndOfString] :>
StringTrim@z], StringTrim[#]] & /@ tmp1[[All, 1]]]
``````

``````flags = CountryData[#, "Flag"] & /@ tmp2;
Cases[flags, _CountryData]
``````

``````flags = If[Head[#] === CountryData, {""}, {#}] & /@ flags; (*much faster than rule replacement*)
tmp2 = Join[flags, tmp1, 2];
Grid[tmp2, Frame -> All]
``````

``````Clear[findAndParseTables];
findAndParseTables[text_String] :=
Module[{parsed = postProcess@parseText[text]},
DeleteCases[
Cases[parsed, _tableContainer, Infinity],
_attribContainer | _spanContainer, Infinity
] //.
{(supContainer | tdContainer | trContainer | thContainer)[x___] :> {x},
iContainer[x___] :> x,
aContainer[x_] :> x,
"\n" :> Sequence[],
divContainer[] | ulContainer[] | liContainer[] | aContainer[] :> Sequence[]}];
``````

``````text = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Text"];
myData = First@findAndParseTables[text];
``````

``````In[92]:= Short[myData,5]
Out[92]//Short=
tableContainer[{{Country / Region},{Unemployment rate (%)},{Source / date of information}},
{{Afghanistan},{35.0},{2008,{3}}},{{Albania},{13.49},{2010 (Q4),{4}}},
{{Algeria},{10.0},{2010 (September),{5}}},<<188>>,{{West Bank},{17.2},{2010,{43}}},
{{Yemen},{35.0},{2009 (June),{128}}},{{Zambia},{16.0},{2005,{129}}},{{Zimbabwe},{97.0},{2009}}]
``````