问Craigslist搜索-跨区域脚本
EN

Code Review用户

提问于 2013-06-26 22:35:18

回答 1查看 513关注 0票数 5

我是一个JavaScript开发人员。我很肯定这一点在下面的代码中会立即显现出来，如果没有其他原因，那就是我喜欢的链接的级别/深度。但是，我正在学习Ruby，所以我也很想编写漂亮的Ruby代码。我的第一个简单项目是一个Craigslist搜索跨区域脚本。

完整的代码是论GitHub，但是分解为下面的问题片段。

def get_us_regions()
  # Accumulator for building up the returned object.
  results = {}

  # Important URLs
  sitelist = URI('http://www.craigslist.org/about/sites')
  geospatial = URI('http://www.craigslist.org/about/areas.json')

  # Get a collection of nodes for the US regions out of craigslist's site list.
  usregions =  Nokogiri::HTML(Net::HTTP.get(sitelist)).search("a[name=US]").first().parent().next_element().search('a')

  # Parse out the information to build a usable representation.
  usregions.each { |usregion|
    hostname = usregion.attr('href').gsub('http://','').gsub('.craigslist.org','')
    results[hostname] = { name: usregion.content, state: usregion.parent().parent().previous_element().content }
  }

  # Merge that information with craigslist's geographic information.
  areas = JSON.parse(Net::HTTP.get(geospatial))
  areas.each { |area|
    if results[area["hostname"]]
      results[area["hostname"]][:stateabbrev] = area["region"]
      results[area["hostname"]][:latitude] = area["lat"]
      results[area["hostname"]][:longitude] = area["lon"]
    end
  }

  # This is a complete list of the US regions, keyed off of their hostname.
  return results
end

引导

我应该如何获得我需要开始的程序信息？
如果我是在一个长期运行的应用程序的引导程序上这样做，并且想要刷新，比如说，每月刷新，那么这种情况会改变吗？
我是不是应该把它放到一个真正抽象的类中呢？

这不是JS

如何将调用链接到类方法？
为什么要使用字符串键而不是奇怪的命名键呢？

对象-离子

我是否应该为每个区域创建一个对象，并将我从文档中解析出来的部分输入构造函数？
如果我这样做了，该构造函数应该只接受一个DOM节点并聪明地计算出我传递给它的内容吗？
对于重新打开一个对象，因为我必须跨两个来源进行整理，什么是“正确”的方法？

# Perform a search in a particular region.
def search_region(regionhostname, query)
  # In case there are multiple pages of results from a search
  pages = []
  pagecount = false

  # An accumulator for storing what we need to return.
  result = []

  # Make requests for every page.
  while (pages.length != pagecount)
    # End up with a start of "0" on the first time, 100 is craigslist's page length.
    page = pages.length * 100    

    # Here is the URL we'll be making the request of.
    url = URI("http://#{regionhostname}.craigslist.org/search/cto?query=#{query}&srchType=T&s=#{page}")

    # Get the response and parse it.
    pages << Nokogiri::HTML(Net::HTTP.get(url))

    # If this is the first time through
    if (pagecount == false)

      #check to make sure there are results.
      if pages.last().search('.resulttotal').length() != 0
        # There are results, and we need to see if additional requests are necessary.
        pagecount = (pages.last().search('.resulttotal').first().content().gsub(/[^0-9]/,'').to_i / 100.0).ceil
      else
        # There are no results, we're done here.
        return []
      end
    end
  end

  # Go through each of the pages of results and process the listings
  pages.each { |page|
    # Go through all of the listings on each page
    page.search('.row').each { |listing|
      # Skip listings from other regions in case there are any ("FEW LOCAL RESULTS FOUND").
      if listing.search('a[href^=http]').length() != 0
        next
      end

      # Parse information out of the listing.
      car = {}
      car["id"] = listing["data-pid"]
      car["date"] = listing.search(".date").length() == 1 ? Date.parse(listing.search(".date").first().content) : nil
      # When Craigslist wraps at the end of the year it doesn't add a year field.
      # Fortunately Craigslist has an approximately one month time limit that makes it easy to know which year is being referred to.
      # Overshooting by incrementing the month to make sure that timezone differences between this and CL servers don't result in weirdness
      if car["date"].month > Date.today.month + 1
        car["date"] = car["date"].prev_year
      end
      car["link"] = "http://#{regionhostname}.craigslist.org/cto/#{car['id']}.html"
      car["description"] = listing.search(".pl > a").length() == 1 ? listing.search(".pl > a").first().content : nil
      car["price"] = listing.search("span.price").length() == 1 ? listing.search("span.price").first().content : nil
      car["location"] = listing.search(".l2 small").length() == 1 ? listing.search(".l2 small").first().content.gsub(/[\(\)]/,'').strip : nil
      car["longitude"] = listing["data-longitude"]
      car["latitude"] = listing["data-latitude"]

      # Pull car model year from description
      # Can be wrong, but likely to be accurate.
      if /(?:\b19[0-9]{2}\b|\b20[0-9]{2}\b|\b[0-9]{2}\b)/.match(car["description"]) { |result|

        # Two digit year
        if result[0].length == 2
          # Not an arbitrary wrapping point like it is in MySQL, etc.
          # Cars have known manufacture dates and can't be too far in the future.
          if result[0].to_i <= Date.today.strftime("%y").to_i + 1
            car["year"] = "20#{result[0]}"
          else
            car["year"] = "19#{result[0]}"
          end
        # Four digit year is easy.
        elsif result[0].length == 4
          car["year"] = result[0]
        end
      }
      else
        car["year"] = nil
      end

      # Store the region lookup key.
      car["regionhostname"] = regionhostname

      result << car
    }
  }

  return result
end

Car与Listing

现在我有两个可能的“竞争”对象，如果我要把它扔到一个类中。清单描述的是一辆汽车，但我关心的是从两者中获取信息。我应该把它们都储存起来并连接起来吗？一辆“有一辆”的车？

结果页

每个页面应该是一个对象吗？我首先要了解的是，我需要请求多少页。
我该如何防止它连续运行呢？我是否应该通过返回函数将这些函数泡沫化？这真的有可能在Ruby中做到干净吗？

如果代码看起来像这样.

我正在检查的if语句是否存在(并两次调用该方法)是很糟糕的。但是，如果我试图访问不存在的东西，它会抛出丑陋的错误。
三元是我发现的最好的，还有其他的窍门吗？

Misc.

“下一个”很受欢迎吗？
有从匹配对象中提取信息的成语吗？
那建立关系呢？我这样做对吗？

def search(query)
  results = []

  # Get a copy of the regions we're going to search.
  regions = get_us_regions()

  # Divide the requests to each region across the "right" number of threads.
  iterations = 5
  count = (regions.length/iterations.to_f).ceil

  # Spin up the threads!
  (0..(iterations-1)).each { |iteration|
    threads = []

    # Protect against source exhaustion
    if iteration * count > regions.length()
      next
    end

    # Split the requests by region.
    regions.keys.slice(iteration*count,count).each { |regionhostname|
      threads << Thread.new(regionhostname) { |activeregionhostname|
        # New block for proper scoping of regionhostname
        results << search_region(activeregionhostname, query)
      }
    }

    # Wait until all threads are complete before kicking off the next set.
    threads.each { |thread| thread.join }
  }

  # From search_region we return an array, which means we need to flatten(1) to pull everything up to the top level.
  results = results.flatten(1)

  # Sort the search results by date, descending.
  results.sort! { |a,b|
    if a["date"] == b["date"]
      b["id"].to_i <=> a["id"].to_i
    else
      b["date"] <=> a["date"]
    end
  }

  return results
end

puts search("TDI").to_json

`public static void main`

穿线！异步代码对我来说是有意义的，但是如果我一次创建太多的代码，我的(Ruby)线程就会崩溃。是否有为一组工作线程排队活动的成语？
为了陈述？还是(0..5).each {x=0 }？
对象集合的全局？只在“主”里？
我是不是做错了命名约定？
还有什么我应该问的吗？

结论

代码可以工作，您可以使用它从每个区域获得结果，以便在Craigslist上搜索汽车。这对于稀有/难找的车辆来说是件好事。我希望线程更好，并在不同的线程上包含来自分页的多个请求，但是我需要一些池来处理这个问题。最后，我考虑将Rack集成到这个简单的车辆搜索API中。或者，它会变得更聪明，并将结果存储在数据库中，以跟踪一段时间的价格，创造出更多受过良好教育的卖家和消费者，或者标记出好的交易。

object-oriented

ruby

multithreading

http

web-scraping

回答 1

Code Review用户

回答已采纳

发布于 2013-06-27 12:55:48

很长很长的问题我会拿走你的第一个片段，让其他人来处理剩下的。首先，关于您的代码的一些评论：

def get_us_regions()：把这些()放在没有参数的方法上并不是惯用的。
first()：在没有参数的调用上编写它们也不是惯用的。
results = {}：我已经用CR写了很多关于这个主题的文章，所以我只给出链接：用Ruby进行函数编程。
# Important URLs：不确定是否重要到值得评论:-)
Nokogiri::HTML(...) ... long expression。表达式可以无尾链接，您必须决定何时中断并给出有意义的名称。我至少把它分解成两个子表达式。
gsub('http://','').gsub('.craigslist.org','')：使用模块URI代替手动操作URL。
results[area["hostname"]][:stateabbrev]：同样，这种表达式的函数式方法使它们更加简洁和清晰。
return results：明确的returns不是惯用的。
def get_us_regions。当一个方法如此琐碎而使其可配置时，请在这里给出这个国家作为参数-> def get_regions(country_code)。

现在我是怎么写这个方法的。首先，我会使用面，这是一个优秀的库，有许多核心中没有提供的很酷的抽象：

require 'uri'
require 'nokogiri'
require 'net/http'
require 'json'
require 'facets'

module CraigsList
  SitelistUrl = 'http://www.craigslist.org/about/sites'
  GeospatialUrl = 'http://www.craigslist.org/about/areas.json'

  # Return hash of pairs (hostname, {:name, :state, :stateabbr, :latitude, :longitude})
  # for US regions in craigslist.
  def self.get_regions(country_code)
    doc = Nokogiri::HTML(Net::HTTP.get(URI(SitelistUrl)))
    usregions = doc.search("a[name=#{country_code}]").first.parent.next_element.search('a')
    state_info = usregions.mash do |usregion|
      hostname = URI.parse(usregion.attr('href')).host.split(".").first
      state = usregion.parent.parent.previous_element.content
      info = {name: usregion.content, state: state}
      [hostname, info] 
    end

    areas = JSON.parse(Net::HTTP.get(URI(GeospatialUrl)))
    geo_info = areas.slice(*state_info.keys).mash do |area|
      info = {stateabbrev: area["region"], latitude: area["lat"], longitude: area["lon"]}
      [area["hostname"], info] 
    end

    state_info.deep_merge(geo_info)
  end
end