使用公开可用的FracFocus数据和Python的Matplotlib函数可视化二叠纪盆地石油和天然气公司的完井策略

代码医生工作室

发布于 2019-07-16 15:48:37

5990

发布于 2019-07-16 15:48:37

文章被收录于专栏：相约机器人

作者 | Kirsten Perry

来源 | Medium

编辑 | 代码医生团队

最近写了一些脚本，自动数据来自公开的FracFocus数据库，这是一个政府运营的数据源，提供了在美国非常规石油和天然气完井工作中泵送的水力压裂化学品的全面列表。该数据库是一个很好的资源 - 不仅适用于公众，也适用于希望对其他运营商的完井策略进行竞争情报分析的石油和天然气公司。随着这个宝库的数据随时可用，想用它做一些分析。在今天的帖子中使用Python中的基本数据可视化包Matplotlib，分析了西德克萨斯州二叠纪盆地的运营商完井信息。

https://github.com/kperry2215/automated_fracfocus_data_pull/blob/master/web_scrape_fracfocus.py

http://fracfocus.org/

首先，关于数据本身的一些背景知识。FracFocus数据库包含几个用于数据输入的可选和非可选字段。字段包括井名，井标识号（称为API），纬度和经度坐标，操作员，完井的总垂直深度（TVD）（钻井下降的距离），压裂作业期间泵送的水量以及数量泵送的非水添加剂。此外，该数据库还包括泵送的不同类型的化学品和压裂砂，以及购买产品的供应商。下面提供了数据的快照：

一些FracFocus数据的快照

在网上抓取的FracFocus数据库是一个超过400万行的大型文件，因此创建了一个Python对象来过滤掉数据库（作为pandas数据帧上传到Python），使用运算符名称等特征，州和县：

class fracfocus_data_search:
    """
    This class generates an object that is used the filter the master
    fracfocus dataframe so it only contains a certain state/state abbreviation,
    county (list), and/or operator
    """
 
    def __init__(self, state=None, state_abbreviation=None, county_list=None,
                 operator=None):
        #All data in initialize def optional depending on what kind of filtering
        #we want to do
        self.state = state
        self.state_abbreviation = state_abbreviation
        self.county_list=county_list
        self.operator=operator
 
    def filter_dataframe(self, df, column_state,
                        column_county, column_operator):
        """
        Filter the data by parameters
        """
        #Filter by all the parameters and return the subset
        df_subset=df[
                #By state
                (((df[column_state].str.lower())==self.state.lower()) |
                ((df[column_state].str.lower())==self.state_abbreviation.lower())) &
                #By county
                ((df[column_county].str.lower()).isin(map(str.lower,self.county_list))) &
                #By operator
                ((df[column_operator].str.lower()).str.contains(self.operator.lower()))  
                ]
        return df_subset

为了本教程的目的，将数据过滤到西德克萨斯州二叠纪盆地（安德鲁斯，博登，克莱恩，道森，埃克特，艾迪，盖恩斯和格拉斯科克）的八个县，为运营商XTO非常规衍生埃克森公司）：

#Pull all of the fracfocus data from a csv
fracfocus_registry=pd.read_csv('fracfocus_data_example.csv', low_memory=False)
#Make all of the state column lowercase
fracfocus_filter=fracfocus_data_search(state='Texas', state_abbreviation='TX',
                          county_list=['Andrews', 'Borden', 'Crane', 'Dawson',
                                       'Ector', 'Eddy', 'Gaines', 'Glasscock'], operator='XTO')
#Filter dataframe by its parameters
subsetted_df=fracfocus_filter.filter_dataframe(fracfocus_registry, column_state='StateName',
                        column_county='CountyName', column_operator='OperatorName')

应用过滤器时，返回42,850行数据。显然XTO在二叠纪盆地是一个相当活跃的运营商。

接下来，想要想象一下XTO在每次完井期间泵送多少水。使用以下函数generate_plot（）来创建随时间推移的基础水的可视化：

def generate_plot(df, x_variable, y_variables, plot_title):
    """
    This function is used to map x- and y-variables against each other
    Arguments:
        df: Pandas dataframe.
        x_variable: String. Name of the column that we want to set as the
        x-variable in the plot
        y_variables: string (single), or list of strings (multiple). Name(s)
        of the column(s) that we want to set as the y-variable in the plot
    Outputs:
        Scatter plot in console.
    """
    #Plot results
    df.plot(x=x_variable, y=y_variables, title=plot_title)
    plt.show()
 
"""
Main script execution here
"""
 
generate_plot(dataframe, x_variable='JobStartDate',
              y_variables=['TotalBaseWaterVolume'],
              plot_title='Total Base Water Volume for Fracs over Time')

generate_scatter_plot（）函数的输出：每个压裂随时间泵送的总水量

那么上面的情节告诉什么呢？主要是随着时间的推移，XTO在每次压裂中泵送的水量变得更加激进，其最大压裂率从2014年泵送的每个压裂不到500万加仑增加到2018年每个压裂超过2500万加仑。

接下来，使用generate_plot（）函数生成一个随时间推移压缩的非水添加剂图：

#Plot the 'TotalBaseNonWaterVolume' variable over time
generate_plot(dataframe, x_variable='JobStartDate',
              y_variables=['TotalBaseNonWaterVolume'],
              plot_title='Total Base Non-Water Volume for Fracs over Time')

每个压裂泵送的非水量总量随时间推移

这里没有明显的减产趋势。除了2019年中期的一个大型异常值外，数据看起来相当稳定。为了显示去除了异常值的非水添加剂数据的分布，使用以下函数生成盒须图：

def generate_boxplot(df, x_variable):
    """
    This function generates a basic histogram of a column's data, with
    outliers removed
    Arguments:
        df: Pandas dataframe
        x_variable: String. Name of the column that we want to generate
        boxplot from
    Outputs:
        Box plot in console.
    """  
    plt.boxplot(list(df[x_variable].dropna()), showfliers=False)
    plt.show()
 
...
#Execution in main script
generate_boxplot(dataframe, x_variable='TotalBaseNonWaterVolume')

二叠纪XTO每次压裂泵送的非水添加剂量的盒须图

看看上面的盒须图，看起来平均压裂有大约12,000加仑的非水添加剂，向上倾斜分布。

这个数据集的一个更有趣的方面是供应商信息，可以从中收集各种有趣的见解。例如，XTO从谁那里购买其化学品和压裂砂进行完井？公司是否随着时间推移首选供应商？使用以下代码生成条形图，以深入了解其中一些问题：

def generate_bar_graph(df, title):
    """
    This function creates a bar graph from pandas dataframe columns.
    Arguments:
        df: Pandas dataframe. Index will be x-axis. Categories and
        associated amounts are from columns
        title: String. Name of the bar graph
    Outputs:
        Bar graph in console.
    """
    df.plot.bar(rot=0)
    plt.title(title, color='black')
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    plt.show()
 
...
#Execution in main script
generate_bar_graph(vendor_data_stacked_bar_graph_pivoted,
                       title='Number of Times Vendor Was Purchased From Over Quarter')

条形图说明了XTO在2018年从每个供应商处购买的次数

可以从上图中收集一些见解：

在2018年第二季度，使用P3作为首选供应商，看起来XTO大量转向。在2018年第一季度，P3根本没有使用。
在整个2018年，XTO在购买完井产品时始终依赖以下供应商：Ace，Chemplex，Finoric，Nalco和Sandtrol。
看起来Universal Pressure Pumping在2018年第二季度作为完井供应商进行了试验。但是，它没有在Q2之前或之后使用，因此很可能没有成为数据被过滤的地理位置内的首选供应商。

虽然上面的条形图受到时间和地理位置的限制，但它仍然提供了XTO完成和供应商战略的快照。有了这些信息，O＆G公司可以收集重要的供应商见解，帮助在未来规划完成。

下面显示的本文代码也可以通过Github帐户获得。代码很容易定制，因此可以按资产和操作员进行过滤，只需最少的编辑：

https://github.com/kperry2215/analyze_fracfocus_data
 
import pandas as pd
import matplotlib.pyplot as plt
 
class fracfocus_data_search:
    """
    This class generates an object that is used the filter the master
    fracfocus dataframe so it only contains a certain state/state abbreviation,
    county (list), and/or operator
    """
 
    def __init__(self, state=None, state_abbreviation=None, county_list=None,
                 operator=None):
        #All data in initialize def optional depending on what kind of filtering
        #we want to do
        self.state = state
        self.state_abbreviation = state_abbreviation
        self.county_list=county_list
        self.operator=operator
 
    def filter_dataframe(self, df, column_state,
                        column_county, column_operator):
        """
        Filter the data by parameters
        Arguments:
            df: Pandas dataframe that we want to to subset by
            column_state: String. Name of the column for state
            column_county: String. Name of column for county
            column_operator: String. Name of column for operator
        Outputs:
            df_subset: Pandas dataframe. Dataframe after it's been subsetted.
        """
        #Filter by all the parameters and return the subset
        df_subset=df[
                #By state
                (((df[column_state].str.lower())==self.state.lower()) |
                ((df[column_state].str.lower())==self.state_abbreviation.lower())) &
                #By county
                ((df[column_county].str.lower()).isin(map(str.lower,self.county_list))) &
                #By operator
                ((df[column_operator].str.lower()).str.contains(self.operator.lower()))  
                ]
        return df_subset
 
def clean_vendor_data(df, column, column_search_string, column_string_rename):
    """
    This function is used to search the vendor data for a specific keyword, and
    if it's found, change the row value to that specific keyword. Used to clean up the data
    if a vendor is added in multiple different ways.
    Arguments:
        df: Pandas dataframe.
        column: String. Name of the column that we're cleaning up
        column_search_string: String. String that we're looking to match in the column
        column_string_rename: String. What we want to rename the string to if we find
        any string matches in the column
    Outputs:
        df: Pandas dataframe. Dataframe with returned cleaned up column
    """
    df.loc[df[column].str.contains(column_search_string), column] = column_string_rename
    return df
 
def generate_plot(df, x_variable, y_variables, plot_title):
    """
    This function is used to map x- and y-variables against each other
    Arguments:
        df: Pandas dataframe.
        x_variable: String. Name of the column that we want to set as the
        x-variable in the plot
        y_variables: string (single), or list of strings (multiple). Name(s)
        of the column(s) that we want to set as the y-variable in the plot
    Outputs:
        Scatter plot in console.
    """
    #Plot results
    df.plot(x=x_variable, y=y_variables, title=plot_title)
    plt.show()
 
def generate_boxplot(df, x_variable):
    """
    This function generates a basic histogram of a column's data, with
    outliers removed
    Arguments:
        df: Pandas dataframe
        x_variable: String. Name of the column that we want to generate
        boxplot from
    Outputs:
        Box plot in console.
    """  
    plt.boxplot(list(df[x_variable].dropna()), showfliers=False)
    plt.show()
    
def generate_bar_graph(df, title):
    """
    This function creates a bar graph from pandas dataframe columns.
    Arguments:
        df: Pandas dataframe. Index will be x-axis. Categories and
        associated amounts are from columns
        title: String. Name of the bar graph
    Outputs:
        Bar graph in console.
    """
    df.plot.bar(rot=0)
    plt.title(title, color='black')
    plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
    plt.show()
    
def main():
    """
    Main definition, where we subset the data, analyze it, and generate plots from
    """
    #Pull all of the fracfocus data from a csv
    fracfocus_registry=pd.read_csv('fracfocus_data_example.csv', low_memory=False)
    #Make all of the state column lowercase
    fracfocus_filter=fracfocus_data_search(state='Texas', state_abbreviation='TX',
                          county_list=['Andrews', 'Borden', 'Crane', 'Dawson',
                                       'Ector', 'Eddy', 'Gaines', 'Glasscock'],
                          operator='XTO')
    #Filter dataframe by its parameters
    subsetted_df=fracfocus_filter.filter_dataframe(fracfocus_registry, column_state='StateName',
                        column_county='CountyName', column_operator='OperatorName')
    #Convert the data column to a pandas datetime
    subsetted_df['JobStartDate']=pd.to_datetime(subsetted_df['JobStartDate'])
    #Now that we have our desired dataframe, it's time to analyze the data
    #Let's calculate the average amount of liquids that the operator pumps in fracs
    #over time
    basic_frac_characteristics=subsetted_df[['JobStartDate', 'JobEndDate', 'APINumber',
                                             'TotalBaseNonWaterVolume', 'TVD',
                                             'TotalBaseWaterVolume',
                                             'Latitude', 'Longitude']].drop_duplicates()
    #Plot the 'TotalBaseWaterVolume' variable over time
    generate_plot(basic_frac_characteristics, x_variable='JobStartDate',
              y_variables=['TotalBaseWaterVolume'],
              plot_title='Total Base Water Volume for Fracs over Time')
    #Plot the 'TotalBaseNonWaterVolume' variable over time
    generate_plot(basic_frac_characteristics, x_variable='JobStartDate',
              y_variables=['TotalBaseNonWaterVolume'],
              plot_title='Total Base Non-Water Volume for Fracs over Time')
    #Plot the distribution of TotalBaseNoneWaterVolume using a box-and-whisker plot, with outliers removed
    generate_boxplot(basic_frac_characteristics, x_variable='TotalBaseNonWaterVolume')
    
    #PERFORM VENDOR ANALYSIS BELOW
    #Subset the data set to get unique rows for vendor data
    vendor_data=subsetted_df[['JobStartDate', 'JobEndDate', 'APINumber',
                              'Latitude', 'Longitude', 'Supplier', 'TradeName']].drop_duplicates()
    #PERFORM SOME DATA CLEANING ON THE VENDOR DATA
    #Remove NaN supplier values
    vendor_data=vendor_data.dropna(subset=['Supplier'])
    #Make all Supplier data uppercase
    vendor_data['Supplier']=vendor_data['Supplier'].str.upper()
    #Use the clean_vendor_data() function to clean up the vendor categories so they're more standardized
    vendor_lookup_dict={'RISING STAR': 'RISING STAR',
                             'CHEMPLEX': 'CHEMPLEX',
                             'SAN.*TROL': 'SANDTROL',
                             'MULTI.*CHEM': 'MULTI-CHEM',
                             'XTO': 'OPERATOR',
                             'PFP': 'PFP',
                             'CESI': 'CESI',
                             'NALCO': 'NALCO',
                             'FRITZ': 'FRITZ INDUSTRIES',
                             'ASK': 'ASK',
                             'ACE': 'ACE',
                             'BRENNTAG': 'BRENNTAG',
                             'COIL.*CHEM': 'COILCHEM',
                             'COOPER': 'COOPER NATURAL RESOURCES',
                             'ECONOMY': 'ECONOMY POLYMERS',
                             'FINORIC': 'FINORIC',
                             'EES': 'ENVIRONMENTAL ENERGY SERVICE',
                             'PREFERRED': 'PREFERRED SANDS',
                             'ROCKWATER': 'ROCKWATER',
                             'SNF': 'SNF',
                             'MULTIPLE': 'MULTIPLE SUPPLIERS',
                             'REAGENT': 'REAGENT',
                             'PRO.*FRAC': 'PROFRAC'}
    #Loop through the dictionary and change the name accordingly based on character lookups
    for vendor_lookup, vendor_name in vendor_lookup_dict.items():
        vendor_data=clean_vendor_data(vendor_data, 'Supplier', vendor_lookup, vendor_name)
    #Make a column converting data to monthly (this will be the x-axis for the bar graph)
    vendor_data['JobStartDateQuarter'] = vendor_data['JobStartDate'].dt.to_period('Q')
    #Subset to only include 2018 data
    vendor_data=vendor_data[(vendor_data.JobStartDate>=pd.to_datetime('2018-01-01')) &
                            (vendor_data.JobStartDate<=pd.to_datetime('2019-01-01'))]
    #Count the number of purchases that the operator makes over time from each vendor
    vendor_data['NumberTimesSupplierUsed']=vendor_data[['JobStartDateQuarter', 'Supplier']].groupby([
            'JobStartDateQuarter', 'Supplier'])['Supplier'].transform('count')
    #Subset the data to only include vendor counts by quarter
    vendor_data_stacked_bar_graph=vendor_data[['Supplier', 'JobStartDateQuarter',
                                    'NumberTimesSupplierUsed']].drop_duplicates()
    #Pivot the data from long to wide format
    vendor_data_stacked_bar_graph_pivoted=vendor_data_stacked_bar_graph.pivot(
                                                            index='JobStartDateQuarter',
                                                            columns='Supplier',
                                                            values='NumberTimesSupplierUsed').fillna(0)
    #Filter to only include suppliers purchased from more than 20 times
    total_col_sums=pd.DataFrame(vendor_data_stacked_bar_graph_pivoted.sum()>=20)
    vendor_data_stacked_bar_graph_pivoted=vendor_data_stacked_bar_graph_pivoted.drop(list(
            total_col_sums[total_col_sums[0]==False].index), axis=1)
    #Plot the vendor data in a bar chart
    generate_bar_graph(vendor_data_stacked_bar_graph_pivoted,
                       title='Number of Times Vendor Was Purchased From Over Quarter')
    
#Run main
if __name__== "__main__":
    main()

一如既往，感谢阅读！

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-07-15，如有侵权请联系 cloudcommunity@tencent.com 删除

数据库