作者 | Kirsten Perry
来源 | Medium
编辑 | 代码医生团队
最近写了一些脚本,自动数据来自公开的FracFocus数据库,这是一个政府运营的数据源,提供了在美国非常规石油和天然气完井工作中泵送的水力压裂化学品的全面列表。该数据库是一个很好的资源 - 不仅适用于公众,也适用于希望对其他运营商的完井策略进行竞争情报分析的石油和天然气公司。随着这个宝库的数据随时可用,想用它做一些分析。在今天的帖子中使用Python中的基本数据可视化包Matplotlib,分析了西德克萨斯州二叠纪盆地的运营商完井信息。
https://github.com/kperry2215/automated_fracfocus_data_pull/blob/master/web_scrape_fracfocus.py
http://fracfocus.org/
首先,关于数据本身的一些背景知识。FracFocus数据库包含几个用于数据输入的可选和非可选字段。字段包括井名,井标识号(称为API),纬度和经度坐标,操作员,完井的总垂直深度(TVD)(钻井下降的距离),压裂作业期间泵送的水量以及数量泵送的非水添加剂。此外,该数据库还包括泵送的不同类型的化学品和压裂砂,以及购买产品的供应商。下面提供了数据的快照:
一些FracFocus数据的快照
在网上抓取的FracFocus数据库是一个超过400万行的大型文件,因此创建了一个Python对象来过滤掉数据库(作为pandas数据帧上传到Python),使用运算符名称等特征,州和县:
class fracfocus_data_search:
"""
This class generates an object that is used the filter the master
fracfocus dataframe so it only contains a certain state/state abbreviation,
county (list), and/or operator
"""
def __init__(self, state=None, state_abbreviation=None, county_list=None,
operator=None):
#All data in initialize def optional depending on what kind of filtering
#we want to do
self.state = state
self.state_abbreviation = state_abbreviation
self.county_list=county_list
self.operator=operator
def filter_dataframe(self, df, column_state,
column_county, column_operator):
"""
Filter the data by parameters
"""
#Filter by all the parameters and return the subset
df_subset=df[
#By state
(((df[column_state].str.lower())==self.state.lower()) |
((df[column_state].str.lower())==self.state_abbreviation.lower())) &
#By county
((df[column_county].str.lower()).isin(map(str.lower,self.county_list))) &
#By operator
((df[column_operator].str.lower()).str.contains(self.operator.lower()))
]
return df_subset
为了本教程的目的,将数据过滤到西德克萨斯州二叠纪盆地(安德鲁斯,博登,克莱恩,道森,埃克特,艾迪,盖恩斯和格拉斯科克)的八个县,为运营商XTO非常规衍生埃克森公司):
#Pull all of the fracfocus data from a csv
fracfocus_registry=pd.read_csv('fracfocus_data_example.csv', low_memory=False)
#Make all of the state column lowercase
fracfocus_filter=fracfocus_data_search(state='Texas', state_abbreviation='TX',
county_list=['Andrews', 'Borden', 'Crane', 'Dawson',
'Ector', 'Eddy', 'Gaines', 'Glasscock'], operator='XTO')
#Filter dataframe by its parameters
subsetted_df=fracfocus_filter.filter_dataframe(fracfocus_registry, column_state='StateName',
column_county='CountyName', column_operator='OperatorName')
应用过滤器时,返回42,850行数据。显然XTO在二叠纪盆地是一个相当活跃的运营商。
接下来,想要想象一下XTO在每次完井期间泵送多少水。使用以下函数generate_plot()来创建随时间推移的基础水的可视化:
def generate_plot(df, x_variable, y_variables, plot_title):
"""
This function is used to map x- and y-variables against each other
Arguments:
df: Pandas dataframe.
x_variable: String. Name of the column that we want to set as the
x-variable in the plot
y_variables: string (single), or list of strings (multiple). Name(s)
of the column(s) that we want to set as the y-variable in the plot
Outputs:
Scatter plot in console.
"""
#Plot results
df.plot(x=x_variable, y=y_variables, title=plot_title)
plt.show()
"""
Main script execution here
"""
generate_plot(dataframe, x_variable='JobStartDate',
y_variables=['TotalBaseWaterVolume'],
plot_title='Total Base Water Volume for Fracs over Time')
generate_scatter_plot()函数的输出:每个压裂随时间泵送的总水量
那么上面的情节告诉什么呢?主要是随着时间的推移,XTO在每次压裂中泵送的水量变得更加激进,其最大压裂率从2014年泵送的每个压裂不到500万加仑增加到2018年每个压裂超过2500万加仑。
接下来,使用generate_plot()函数生成一个随时间推移压缩的非水添加剂图:
#Plot the 'TotalBaseNonWaterVolume' variable over time
generate_plot(dataframe, x_variable='JobStartDate',
y_variables=['TotalBaseNonWaterVolume'],
plot_title='Total Base Non-Water Volume for Fracs over Time')
每个压裂泵送的非水量总量随时间推移
这里没有明显的减产趋势。除了2019年中期的一个大型异常值外,数据看起来相当稳定。为了显示去除了异常值的非水添加剂数据的分布,使用以下函数生成盒须图:
def generate_boxplot(df, x_variable):
"""
This function generates a basic histogram of a column's data, with
outliers removed
Arguments:
df: Pandas dataframe
x_variable: String. Name of the column that we want to generate
boxplot from
Outputs:
Box plot in console.
"""
plt.boxplot(list(df[x_variable].dropna()), showfliers=False)
plt.show()
...
#Execution in main script
generate_boxplot(dataframe, x_variable='TotalBaseNonWaterVolume')
二叠纪XTO每次压裂泵送的非水添加剂量的盒须图
看看上面的盒须图,看起来平均压裂有大约12,000加仑的非水添加剂,向上倾斜分布。
这个数据集的一个更有趣的方面是供应商信息,可以从中收集各种有趣的见解。例如,XTO从谁那里购买其化学品和压裂砂进行完井?公司是否随着时间推移首选供应商?使用以下代码生成条形图,以深入了解其中一些问题:
def generate_bar_graph(df, title):
"""
This function creates a bar graph from pandas dataframe columns.
Arguments:
df: Pandas dataframe. Index will be x-axis. Categories and
associated amounts are from columns
title: String. Name of the bar graph
Outputs:
Bar graph in console.
"""
df.plot.bar(rot=0)
plt.title(title, color='black')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()
...
#Execution in main script
generate_bar_graph(vendor_data_stacked_bar_graph_pivoted,
title='Number of Times Vendor Was Purchased From Over Quarter')
条形图说明了XTO在2018年从每个供应商处购买的次数
可以从上图中收集一些见解:
虽然上面的条形图受到时间和地理位置的限制,但它仍然提供了XTO完成和供应商战略的快照。有了这些信息,O&G公司可以收集重要的供应商见解,帮助在未来规划完成。
下面显示的本文代码也可以通过Github帐户获得。代码很容易定制,因此可以按资产和操作员进行过滤,只需最少的编辑:
https://github.com/kperry2215/analyze_fracfocus_data
import pandas as pd
import matplotlib.pyplot as plt
class fracfocus_data_search:
"""
This class generates an object that is used the filter the master
fracfocus dataframe so it only contains a certain state/state abbreviation,
county (list), and/or operator
"""
def __init__(self, state=None, state_abbreviation=None, county_list=None,
operator=None):
#All data in initialize def optional depending on what kind of filtering
#we want to do
self.state = state
self.state_abbreviation = state_abbreviation
self.county_list=county_list
self.operator=operator
def filter_dataframe(self, df, column_state,
column_county, column_operator):
"""
Filter the data by parameters
Arguments:
df: Pandas dataframe that we want to to subset by
column_state: String. Name of the column for state
column_county: String. Name of column for county
column_operator: String. Name of column for operator
Outputs:
df_subset: Pandas dataframe. Dataframe after it's been subsetted.
"""
#Filter by all the parameters and return the subset
df_subset=df[
#By state
(((df[column_state].str.lower())==self.state.lower()) |
((df[column_state].str.lower())==self.state_abbreviation.lower())) &
#By county
((df[column_county].str.lower()).isin(map(str.lower,self.county_list))) &
#By operator
((df[column_operator].str.lower()).str.contains(self.operator.lower()))
]
return df_subset
def clean_vendor_data(df, column, column_search_string, column_string_rename):
"""
This function is used to search the vendor data for a specific keyword, and
if it's found, change the row value to that specific keyword. Used to clean up the data
if a vendor is added in multiple different ways.
Arguments:
df: Pandas dataframe.
column: String. Name of the column that we're cleaning up
column_search_string: String. String that we're looking to match in the column
column_string_rename: String. What we want to rename the string to if we find
any string matches in the column
Outputs:
df: Pandas dataframe. Dataframe with returned cleaned up column
"""
df.loc[df[column].str.contains(column_search_string), column] = column_string_rename
return df
def generate_plot(df, x_variable, y_variables, plot_title):
"""
This function is used to map x- and y-variables against each other
Arguments:
df: Pandas dataframe.
x_variable: String. Name of the column that we want to set as the
x-variable in the plot
y_variables: string (single), or list of strings (multiple). Name(s)
of the column(s) that we want to set as the y-variable in the plot
Outputs:
Scatter plot in console.
"""
#Plot results
df.plot(x=x_variable, y=y_variables, title=plot_title)
plt.show()
def generate_boxplot(df, x_variable):
"""
This function generates a basic histogram of a column's data, with
outliers removed
Arguments:
df: Pandas dataframe
x_variable: String. Name of the column that we want to generate
boxplot from
Outputs:
Box plot in console.
"""
plt.boxplot(list(df[x_variable].dropna()), showfliers=False)
plt.show()
def generate_bar_graph(df, title):
"""
This function creates a bar graph from pandas dataframe columns.
Arguments:
df: Pandas dataframe. Index will be x-axis. Categories and
associated amounts are from columns
title: String. Name of the bar graph
Outputs:
Bar graph in console.
"""
df.plot.bar(rot=0)
plt.title(title, color='black')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()
def main():
"""
Main definition, where we subset the data, analyze it, and generate plots from
"""
#Pull all of the fracfocus data from a csv
fracfocus_registry=pd.read_csv('fracfocus_data_example.csv', low_memory=False)
#Make all of the state column lowercase
fracfocus_filter=fracfocus_data_search(state='Texas', state_abbreviation='TX',
county_list=['Andrews', 'Borden', 'Crane', 'Dawson',
'Ector', 'Eddy', 'Gaines', 'Glasscock'],
operator='XTO')
#Filter dataframe by its parameters
subsetted_df=fracfocus_filter.filter_dataframe(fracfocus_registry, column_state='StateName',
column_county='CountyName', column_operator='OperatorName')
#Convert the data column to a pandas datetime
subsetted_df['JobStartDate']=pd.to_datetime(subsetted_df['JobStartDate'])
#Now that we have our desired dataframe, it's time to analyze the data
#Let's calculate the average amount of liquids that the operator pumps in fracs
#over time
basic_frac_characteristics=subsetted_df[['JobStartDate', 'JobEndDate', 'APINumber',
'TotalBaseNonWaterVolume', 'TVD',
'TotalBaseWaterVolume',
'Latitude', 'Longitude']].drop_duplicates()
#Plot the 'TotalBaseWaterVolume' variable over time
generate_plot(basic_frac_characteristics, x_variable='JobStartDate',
y_variables=['TotalBaseWaterVolume'],
plot_title='Total Base Water Volume for Fracs over Time')
#Plot the 'TotalBaseNonWaterVolume' variable over time
generate_plot(basic_frac_characteristics, x_variable='JobStartDate',
y_variables=['TotalBaseNonWaterVolume'],
plot_title='Total Base Non-Water Volume for Fracs over Time')
#Plot the distribution of TotalBaseNoneWaterVolume using a box-and-whisker plot, with outliers removed
generate_boxplot(basic_frac_characteristics, x_variable='TotalBaseNonWaterVolume')
#PERFORM VENDOR ANALYSIS BELOW
#Subset the data set to get unique rows for vendor data
vendor_data=subsetted_df[['JobStartDate', 'JobEndDate', 'APINumber',
'Latitude', 'Longitude', 'Supplier', 'TradeName']].drop_duplicates()
#PERFORM SOME DATA CLEANING ON THE VENDOR DATA
#Remove NaN supplier values
vendor_data=vendor_data.dropna(subset=['Supplier'])
#Make all Supplier data uppercase
vendor_data['Supplier']=vendor_data['Supplier'].str.upper()
#Use the clean_vendor_data() function to clean up the vendor categories so they're more standardized
vendor_lookup_dict={'RISING STAR': 'RISING STAR',
'CHEMPLEX': 'CHEMPLEX',
'SAN.*TROL': 'SANDTROL',
'MULTI.*CHEM': 'MULTI-CHEM',
'XTO': 'OPERATOR',
'PFP': 'PFP',
'CESI': 'CESI',
'NALCO': 'NALCO',
'FRITZ': 'FRITZ INDUSTRIES',
'ASK': 'ASK',
'ACE': 'ACE',
'BRENNTAG': 'BRENNTAG',
'COIL.*CHEM': 'COILCHEM',
'COOPER': 'COOPER NATURAL RESOURCES',
'ECONOMY': 'ECONOMY POLYMERS',
'FINORIC': 'FINORIC',
'EES': 'ENVIRONMENTAL ENERGY SERVICE',
'PREFERRED': 'PREFERRED SANDS',
'ROCKWATER': 'ROCKWATER',
'SNF': 'SNF',
'MULTIPLE': 'MULTIPLE SUPPLIERS',
'REAGENT': 'REAGENT',
'PRO.*FRAC': 'PROFRAC'}
#Loop through the dictionary and change the name accordingly based on character lookups
for vendor_lookup, vendor_name in vendor_lookup_dict.items():
vendor_data=clean_vendor_data(vendor_data, 'Supplier', vendor_lookup, vendor_name)
#Make a column converting data to monthly (this will be the x-axis for the bar graph)
vendor_data['JobStartDateQuarter'] = vendor_data['JobStartDate'].dt.to_period('Q')
#Subset to only include 2018 data
vendor_data=vendor_data[(vendor_data.JobStartDate>=pd.to_datetime('2018-01-01')) &
(vendor_data.JobStartDate<=pd.to_datetime('2019-01-01'))]
#Count the number of purchases that the operator makes over time from each vendor
vendor_data['NumberTimesSupplierUsed']=vendor_data[['JobStartDateQuarter', 'Supplier']].groupby([
'JobStartDateQuarter', 'Supplier'])['Supplier'].transform('count')
#Subset the data to only include vendor counts by quarter
vendor_data_stacked_bar_graph=vendor_data[['Supplier', 'JobStartDateQuarter',
'NumberTimesSupplierUsed']].drop_duplicates()
#Pivot the data from long to wide format
vendor_data_stacked_bar_graph_pivoted=vendor_data_stacked_bar_graph.pivot(
index='JobStartDateQuarter',
columns='Supplier',
values='NumberTimesSupplierUsed').fillna(0)
#Filter to only include suppliers purchased from more than 20 times
total_col_sums=pd.DataFrame(vendor_data_stacked_bar_graph_pivoted.sum()>=20)
vendor_data_stacked_bar_graph_pivoted=vendor_data_stacked_bar_graph_pivoted.drop(list(
total_col_sums[total_col_sums[0]==False].index), axis=1)
#Plot the vendor data in a bar chart
generate_bar_graph(vendor_data_stacked_bar_graph_pivoted,
title='Number of Times Vendor Was Purchased From Over Quarter')
#Run main
if __name__== "__main__":
main()
一如既往,感谢阅读!