我试图解析csv文件,我需要确定每个字段的类型,从它的字符串值开始。
例如:
val row: Array[String] = Array("1/1/06 0:00","3108 OCCIDENTAL DR","3","3C","1115")
这就是我会得到的:
row(0) --> Date
row(1) --> String
row(2) --> Int
Ecc....
我该怎么做?
--溶液
这是我发现的识别字段String、Date、Int、Double和Boolean的解决方案。我希望将来有人能为我服务。
def typeDetection(x: String): String = {
x match {
// Matches: [12], [-22], [0] Non-Matches: [2.2], [3F]
case int if int.matches("^-?[0-9]+$") => "Int"
// Matches: [2,2], [-2.3], [0.2232323232332] Non-Matches: [.2], [,2], [2.2.2]
case double if double.matches("^-?[0-9]+(,|.)[0-9]+$") => "Double"
// Matches: [29/02/2004 20:15:27], [29/2/04 8:9:5], [31/3/2004 9:20:17] Non-Matches: [29/02/2003 20:15:15], [2/29/04 20:15:15], [31/3/4 9:20:17]
case d1 if d1.matches("^((((31\\/(0?[13578]|1[02]))|((29|30)\\/(0?[1,3-9]|1[0-2])))\\/(1[6-9]|[2-9]\\d)?\\d{2})|(29\\/0?2\\/(((1[6-9]|[2-9]\\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))|(0?[1-9]|1\\d|2[0-8])\\/((0?[1-9])|(1[0-2]))\\/((1[6-9]|[2-9]\\d)?\\d{2})) *(?:(?:([01]?\\d|2[0-3])(\\-|:|\\.))?([0-5]?\\d)(\\-|:|\\.))?([0-5]?\\d)")
=> "Date"
// Matches: [01.1.02], [11-30-2001], [2/29/2000] Non-Matches: [02/29/01], [13/01/2002], [11/00/02]
case d2 if d2.matches("^(?:(?:(?:0?[13578]|1[02])(\\/|-|\\.)31)\\1|(?:(?:0?[1,3-9]|1[0-2])(\\/|-|\\.)(?:29|30)\\2))(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$|^(?:0?2(\\/|-|\\.)29\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\\/|-|\\.)(?:0?[1-9]|1\\d|2[0-8])\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$")
=> "Date"
// Matches: [12/01/2002], [12/01/2002 12:32:10] Non-Matches: [32/12/2002], [12/13/2001], [12/02/06]
case d3 if d3.matches("^(([0-2]\\d|[3][0-1])(\\/|-|\\.)([0]\\d|[1][0-2])(\\/|-|\\.)[2][0]\\d{2})$|^(([0-2]\\d|[3][0-1])(\\/|-|\\.)([0]\\d|[1][0-2])(\\/|-|\\.)[2][0]\\d{2}\\s([0-1]\\d|[2][0-3])\\:[0-5]\\d\\:[0-5]\\d)$")
=> "Date"
case boolean if boolean.equalsIgnoreCase("true") || boolean.equalsIgnoreCase("false") => "Boolean"
case _ => "String"
}
}
发布于 2014-05-14 14:23:29
val row: Array[String] = Array("1/1/06 0:00","3108 OCCIDENTAL DR","3","3C","1115")
val types: Array[String] = row.map(x => x match {
case string if string.contains("/") => "Date probably"
case string if string.matches("[0-9]+") => "Int probably"
case _ => "String probably"
})
types.foreach( x => println(x))
产出:
Date probably
String probably
Int probably
String probably
Int probably
但老实说,我不会使用这种方法,这很容易出错,而且有很多事情可能出错,我甚至不想去想,最简单的例子是,如果一个字符串包含一个/
,那么这一小块代码将与它匹配为一个Date
。
我不知道您的用例,但在我的经验中,创建一些尝试猜测类型的东西来形成不安全的数据是个坏主意,如果您能够控制它,您可以引入一些标识符,例如"1/1/06 0:00 %d%"
,%d%
将指示日期等等,然后从字符串中删除它,即使这样,您也永远不会100%地肯定这不会失败。
发布于 2014-05-14 16:08:13
对于每个字符串:尝试将其解析为所需的类型。您必须为每种类型编写一个函数。继续努力,直到其中一个成功为止,秩序是重要的。您可以使用您最喜欢的日期/时间库。
import java.util.Date
def stringdetect (s : String) = {
dateFromString(s) orElse intFromString(s) getOrElse s
}
def arrayDetect(row : Array[String]) = row map stringdetect
def arrayTypes(row : Array[String]) = {
arrayDetect(row) map { _ match {
case x:Int => "Int"
case x:Date => "Date"
case x:String => "String"
case _ => "?"
} }
}
def intFromString(s : String): Option[Int] = {
try {
Some(s.toInt)
} catch {
case _ : Throwable => None
}
}
def dateFromString(s : String): Option[Date] = {
try {
val formatter = new java.text.SimpleDateFormat("d/M/yy h:mm")
formatter.format(new java.util.Date)
Some(formatter.parse(s))
} catch {
case _ : Throwable => None
}
}
来自REPL /工作表:
val row: Array[String] = Array("1/1/06 0:00","3108 OCCIDENTAL DR","3","3C","1115")
//> row : Array[String] = Array(1/1/06 0:00, 3108 OCCIDENTAL DR, 3, 3C, 1115)
arrayDetect(row)
//> res0: Array[Any] = Array(Sun Jan 01 00:00:00 CST 2006, 3108 OCCIDENTAL DR, 3 , 3C, 1115)
arrayTypeDisplay(row)
//> res1: Array[String] = Array(Date, String, Int, String, Int)
https://stackoverflow.com/questions/23656672
复制相似问题