我是一个JavaScript开发人员。我很肯定这一点在下面的代码中会立即显现出来,如果没有其他原因,那就是我喜欢的链接的级别/深度。但是,我正在学习Ruby,所以我也很想编写漂亮的Ruby代码。我的第一个简单项目是一个Craigslist搜索跨区域脚本。
完整的代码是论GitHub,但是分解为下面的问题片段。
def get_us_regions()
# Accumulator for building up the returned object.
results = {}
# Important URLs
sitelist = URI('http://www.craigslist.org/about/sites')
geospatial = URI('http://www.craigslist.org/about/areas.json')
# Get a collection of nodes for the US regions out of craigslist's site list.
usregions = Nokogiri::HTML(Net::HTTP.get(sitelist)).search("a[name=US]").first().parent().next_element().search('a')
# Parse out the information to build a usable representation.
usregions.each { |usregion|
hostname = usregion.attr('href').gsub('http://','').gsub('.craigslist.org','')
results[hostname] = { name: usregion.content, state: usregion.parent().parent().previous_element().content }
}
# Merge that information with craigslist's geographic information.
areas = JSON.parse(Net::HTTP.get(geospatial))
areas.each { |area|
if results[area["hostname"]]
results[area["hostname"]][:stateabbrev] = area["region"]
results[area["hostname"]][:latitude] = area["lat"]
results[area["hostname"]][:longitude] = area["lon"]
end
}
# This is a complete list of the US regions, keyed off of their hostname.
return results
end
# Perform a search in a particular region.
def search_region(regionhostname, query)
# In case there are multiple pages of results from a search
pages = []
pagecount = false
# An accumulator for storing what we need to return.
result = []
# Make requests for every page.
while (pages.length != pagecount)
# End up with a start of "0" on the first time, 100 is craigslist's page length.
page = pages.length * 100
# Here is the URL we'll be making the request of.
url = URI("http://#{regionhostname}.craigslist.org/search/cto?query=#{query}&srchType=T&s=#{page}")
# Get the response and parse it.
pages << Nokogiri::HTML(Net::HTTP.get(url))
# If this is the first time through
if (pagecount == false)
#check to make sure there are results.
if pages.last().search('.resulttotal').length() != 0
# There are results, and we need to see if additional requests are necessary.
pagecount = (pages.last().search('.resulttotal').first().content().gsub(/[^0-9]/,'').to_i / 100.0).ceil
else
# There are no results, we're done here.
return []
end
end
end
# Go through each of the pages of results and process the listings
pages.each { |page|
# Go through all of the listings on each page
page.search('.row').each { |listing|
# Skip listings from other regions in case there are any ("FEW LOCAL RESULTS FOUND").
if listing.search('a[href^=http]').length() != 0
next
end
# Parse information out of the listing.
car = {}
car["id"] = listing["data-pid"]
car["date"] = listing.search(".date").length() == 1 ? Date.parse(listing.search(".date").first().content) : nil
# When Craigslist wraps at the end of the year it doesn't add a year field.
# Fortunately Craigslist has an approximately one month time limit that makes it easy to know which year is being referred to.
# Overshooting by incrementing the month to make sure that timezone differences between this and CL servers don't result in weirdness
if car["date"].month > Date.today.month + 1
car["date"] = car["date"].prev_year
end
car["link"] = "http://#{regionhostname}.craigslist.org/cto/#{car['id']}.html"
car["description"] = listing.search(".pl > a").length() == 1 ? listing.search(".pl > a").first().content : nil
car["price"] = listing.search("span.price").length() == 1 ? listing.search("span.price").first().content : nil
car["location"] = listing.search(".l2 small").length() == 1 ? listing.search(".l2 small").first().content.gsub(/[\(\)]/,'').strip : nil
car["longitude"] = listing["data-longitude"]
car["latitude"] = listing["data-latitude"]
# Pull car model year from description
# Can be wrong, but likely to be accurate.
if /(?:\b19[0-9]{2}\b|\b20[0-9]{2}\b|\b[0-9]{2}\b)/.match(car["description"]) { |result|
# Two digit year
if result[0].length == 2
# Not an arbitrary wrapping point like it is in MySQL, etc.
# Cars have known manufacture dates and can't be too far in the future.
if result[0].to_i <= Date.today.strftime("%y").to_i + 1
car["year"] = "20#{result[0]}"
else
car["year"] = "19#{result[0]}"
end
# Four digit year is easy.
elsif result[0].length == 4
car["year"] = result[0]
end
}
else
car["year"] = nil
end
# Store the region lookup key.
car["regionhostname"] = regionhostname
result << car
}
}
return result
end
def search(query)
results = []
# Get a copy of the regions we're going to search.
regions = get_us_regions()
# Divide the requests to each region across the "right" number of threads.
iterations = 5
count = (regions.length/iterations.to_f).ceil
# Spin up the threads!
(0..(iterations-1)).each { |iteration|
threads = []
# Protect against source exhaustion
if iteration * count > regions.length()
next
end
# Split the requests by region.
regions.keys.slice(iteration*count,count).each { |regionhostname|
threads << Thread.new(regionhostname) { |activeregionhostname|
# New block for proper scoping of regionhostname
results << search_region(activeregionhostname, query)
}
}
# Wait until all threads are complete before kicking off the next set.
threads.each { |thread| thread.join }
}
# From search_region we return an array, which means we need to flatten(1) to pull everything up to the top level.
results = results.flatten(1)
# Sort the search results by date, descending.
results.sort! { |a,b|
if a["date"] == b["date"]
b["id"].to_i <=> a["id"].to_i
else
b["date"] <=> a["date"]
end
}
return results
end
puts search("TDI").to_json
public static void main
代码可以工作,您可以使用它从每个区域获得结果,以便在Craigslist上搜索汽车。这对于稀有/难找的车辆来说是件好事。我希望线程更好,并在不同的线程上包含来自分页的多个请求,但是我需要一些池来处理这个问题。最后,我考虑将Rack集成到这个简单的车辆搜索API中。或者,它会变得更聪明,并将结果存储在数据库中,以跟踪一段时间的价格,创造出更多受过良好教育的卖家和消费者,或者标记出好的交易。
发布于 2013-06-27 12:55:48
很长很长的问题我会拿走你的第一个片段,让其他人来处理剩下的。首先,关于您的代码的一些评论:
def get_us_regions()
:把这些()
放在没有参数的方法上并不是惯用的。first()
:在没有参数的调用上编写它们也不是惯用的。results = {}
:我已经用CR写了很多关于这个主题的文章,所以我只给出链接:用Ruby进行函数编程。# Important URLs
:不确定是否重要到值得评论:-)Nokogiri::HTML(...) ... long expression
。表达式可以无尾链接,您必须决定何时中断并给出有意义的名称。我至少把它分解成两个子表达式。gsub('http://','').gsub('.craigslist.org','')
:使用模块URI代替手动操作URL。results[area["hostname"]][:stateabbrev]
:同样,这种表达式的函数式方法使它们更加简洁和清晰。return results
:明确的return
s不是惯用的。def get_us_regions
。当一个方法如此琐碎而使其可配置时,请在这里给出这个国家作为参数-> def get_regions(country_code)
。现在我是怎么写这个方法的。首先,我会使用面,这是一个优秀的库,有许多核心中没有提供的很酷的抽象:
require 'uri'
require 'nokogiri'
require 'net/http'
require 'json'
require 'facets'
module CraigsList
SitelistUrl = 'http://www.craigslist.org/about/sites'
GeospatialUrl = 'http://www.craigslist.org/about/areas.json'
# Return hash of pairs (hostname, {:name, :state, :stateabbr, :latitude, :longitude})
# for US regions in craigslist.
def self.get_regions(country_code)
doc = Nokogiri::HTML(Net::HTTP.get(URI(SitelistUrl)))
usregions = doc.search("a[name=#{country_code}]").first.parent.next_element.search('a')
state_info = usregions.mash do |usregion|
hostname = URI.parse(usregion.attr('href')).host.split(".").first
state = usregion.parent.parent.previous_element.content
info = {name: usregion.content, state: state}
[hostname, info]
end
areas = JSON.parse(Net::HTTP.get(URI(GeospatialUrl)))
geo_info = areas.slice(*state_info.keys).mash do |area|
info = {stateabbrev: area["region"], latitude: area["lat"], longitude: area["lon"]}
[area["hostname"], info]
end
state_info.deep_merge(geo_info)
end
end
你提到你写了Javascript代码。函数式方法的好处是,代码在任何语言中都是相同的(不包括语法差异)(如果它具有最小的功能能力)。在JS中(尽管FP样式对Coffeescript更友好)和下划线 (+自定义抽象为mixins),您可以编写相同的代码。
https://codereview.stackexchange.com/questions/27832
复制相似问题