前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >kaggle案例:核电站在世界上的分布

kaggle案例:核电站在世界上的分布

作者头像
用户7010445
发布2020-03-03 14:37:00
4560
发布2020-03-03 14:37:00
举报

原文地址 https://www.kaggle.com/jonathanbouchet/nuclear-power-plant-geo-dataNuclear Power Plant Locations data

新遇到的R包
  • skimr : skimr is designed to provide summary statistics about variables. It is opinionated in its defaults, but easy to modify. In base R, the most similar functions are summary() for vectors and data frames and fivenum() for numeric vectors. 简单理解 skim()函数是 summary()函数的升级版
  • 运行 help(package="skimr")命令查看帮助文档里面提供的小例子
代码语言:javascript
复制
>summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

>fivenum(iris$Sepal.Length)
[1] 4.3 5.1 5.8 6.4 7.9

>skim(iris)
Skim summary statistics
 n obs: 150 
 n variables: 5 

-- Variable type:factor --------------------------------------------------------
 variable missing complete   n n_unique                       top_counts ordered
  Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0   FALSE

-- Variable type:numeric -------------------------------------------------------
     variable missing complete   n mean   sd  p0 p25  p50 p75 p100     hist
 Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9 ▇▁▁▂▅▅▃▁
  Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5 ▇▁▁▅▃▃▂▂
 Sepal.Length       0      150 150 5.84 0.83 4.3 5.1 5.8  6.4  7.9 ▂▇▅▇▆▅▂▂
  Sepal.Width       0      150 150 3.06 0.44 2   2.8 3    3.3  4.4 ▁▂▅▇▃▂▁▁
>
  • lubridate: Functions to work with data-times and time-spans: fast and user friendly parsing of date-time data, extraction and updating of components of a data-time.简单理解就是提供处理时间格式的函数
代码语言:javascript
复制
> ymd("20110604")
[1] "2011-06-04"
> mdy("06-04-2011")
[1] "2011-06-04"
> dmy("04/06/2011")
[1] "2011-06-04"
>
  • viridis:调色板 The viridis color palettes: Use the color scales in this package to make plots that are pretty, better represent your data, easier to read by those with colorblindness, and print well in grey scale.
代码语言:javascript
复制
ggplot(mtcars,aes(wt,mpg))+
  geom_point(size=4,aes(colour=factor(cyl)))+
  scale_color_viridis_d()+theme_bw()
  • broom:Convert Statistical Analysis Objects into Tidy Tibbles.将统计计算结果装换成数据框格式
代码语言:javascript
复制
> lmfit<-lm(mpg~wt,mtcars)
> lmfit

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344  

> summary(lmfit)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

> broom::tidy(lmfit)
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.3      1.88      19.9  8.24e-19
2 wt             -5.34     0.559     -9.56 1.29e-10
> broom::glance(lmfit)
# A tibble: 1 x 11
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
*     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl>
1     0.753         0.745  3.05      91.4 1.29e-10     2  -80.0  166.
# ... with 3 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>
> broom::augment(lmfit)
# A tibble: 32 x 10
   .rownames   mpg    wt .fitted .se.fit .resid   .hat .sigma .cooksd
 * <chr>     <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
 1 Mazda RX4  21    2.62    23.3   0.634 -2.28  0.0433   3.07 1.33e-2
 2 Mazda RX~  21    2.88    21.9   0.571 -0.920 0.0352   3.09 1.72e-3
 3 Datsun 7~  22.8  2.32    24.9   0.736 -2.09  0.0584   3.07 1.54e-2
 4 Hornet 4~  21.4  3.22    20.1   0.538  1.30  0.0313   3.09 3.02e-3
 5 Hornet S~  18.7  3.44    18.9   0.553 -0.200 0.0329   3.10 7.60e-5
 6 Valiant    18.1  3.46    18.8   0.555 -0.693 0.0332   3.10 9.21e-4
 7 Duster 3~  14.3  3.57    18.2   0.573 -3.91  0.0354   3.01 3.13e-2
 8 Merc 240D  24.4  3.19    20.2   0.539  4.16  0.0313   3.00 3.11e-2
 9 Merc 230   22.8  3.15    20.5   0.540  2.35  0.0314   3.07 9.96e-3
10 Merc 280   19.2  3.44    18.9   0.553  0.300 0.0329   3.10 1.71e-4
# ... with 22 more rows, and 1 more variable: .std.resid <dbl>
新遇到的函数
  • left_join简单理解就是按照相同的列合并两个数据框

使用 dplyr::rename函数的时候报错 Error:`petal_length`=Petal.Lengthmust be a symbolorastring,nota formula;搜索报错找到了一个解决办法https://stackoverflow.com/questions/47755534/dplyr-rename-error-new-name-old-name-must-be-a-symbol-or-a-string-not-fo自己把R由R-3.4.2换成了R-3.5.1就不在有这个报错了

  • fortify()暂时还没有搞懂这个函数是什么作用,帮助文档中说这个函数可能会被舍弃 fortity may be deprecated in the future. I now recommend using the broom package
重复原文的两张地图
  • ggplot2画地图
代码语言:javascript
复制
library(rworldmap)
library(ggplot2)
worldMap <- fortify(map_data("world"), region = "region")
ggplot() + 
  geom_map(data = worldMap, 
           map = worldMap,aes(x = long, y = lat,
                              map_id = region, 
                              group = group),
           fill = "white", color = "black", size = 0.1) + 
  theme_fivethirtyeight(10)
  • 核电站在全球范围的分布 数据整合的部分暂时跳过,有时间回头细看!
代码语言:javascript
复制
library(ggplot2)
library(rworldmap)
ggplot(res) + 
  geom_polygon(aes(x=long, y=lat,group=group,fill=totMWe),
               color='white', size=.1) + 
  theme_fivethirtyeight() + 
  theme(panel.grid.major = element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank()) + 
  scale_fill_gradientn(name="",
                       colors = rev(viridis::viridis(50))) + 
  guides(fill = guide_colorbar(barwidth = 20, barheight = .5)) + 
  labs(title="Nuclear power plant landscape in 2019", 
       subtitle='energy produced(MWe) by nuclear source from active powerplant')

根据上图可以得到的结论: Top 3 producers: 美国;法国;中国 朝鲜:No production

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-05-04,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 小明的数据分析笔记本 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 新遇到的R包
  • 新遇到的函数
  • 重复原文的两张地图
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档