首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >查找从S3下载的压缩文件的MIME类型

查找从S3下载的压缩文件的MIME类型
EN

Stack Overflow用户
提问于 2018-08-24 05:57:34
回答 1查看 2.5K关注 0票数 1

客户端应该将压缩文件上传到S3文件夹中。然后下载并解压缩压缩文件以对其包含的文件执行各种操作。最初我们告诉客户端将其文件压缩为ZIP文件,但事实证明这对我们的客户端来说太难了。相反,它提交了一个带有ZIP扩展名的RAR文件...多聪明啊。由于显而易见的原因,我们不能使用ZIP解压缩算法来解压缩RAR文件。

因此,我正在寻找一种方法来找出S3下载文件的文件类型,因为我正在使用Amazon的SDK在Linux OS上处理一个Java项目。我将关注如何根据获取的文件类型对文件进行解压缩。

我看过很多堆栈溢出的问题,比如this one,但是仅仅通过查看它们(和它的评论),似乎没有一个是100%有效的。

找出压缩文件类型的最佳方法是什么?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-08-24 05:57:34

TL;DR;

当用户以编程方式将文件上传到亚马逊S3时,可以指定对象的Content-Type。如果指定了none,正如@Michael-bot所阐明的那样,默认情况下分配的值将是binary/octet-stream。或者,如果决定通过AmazonS3的图形用户界面上传文件,文件将从其文件扩展名获取其Content-Type (遗憾的是,不是其内容)。如果您可以相信上传文件的人能够正确地设置Content-Type,请继续查看ObjectMetadata,但如果您不能(像我一样),您将需要另一个解决方案。

因此,如果您正在寻找适用于最常见文件压缩类型的解决方案,Files.probeContentTypeApache TikaSimpleMagic似乎是可接受的解决方案。

最后,我选择了Files.probeContentType,因为它不需要额外的库,并且在Linux机器上运行得很好(只要文件没有错误的扩展名就可以使用,对于这个问题有一个变通办法:删除文件扩展名,让它执行神奇的)。

测试设置

首先,人们会认为从亚马逊的S3下载文件时的响应对象包含文件类型。它确实包含这些信息,但是当文件的扩展名与其内容不匹配时,问题就出现了。

代码语言:javascript
复制
import com.amazonaws.services.s3.model.S3Object;

final S3Object s3Object = ...;
final String contentType = s3Object.getObjectMetadata().getContentType();

即使文件的内容是Rar文件,此代码也会返回application/zip。所以这个解决方案对我不起作用。

出于这个原因,我花时间构建了一个示例项目,该项目使用不同的方法和可用的库测试各种场景。顺便说一下,我使用的是Java8

测试的文件类型包括:

  • 带Zip扩展名但不带扩展名的压缩文件
  • 带Rar扩展名、Zip扩展名但不带扩展名的Rar文件
  • 带7z扩展名、Zip扩展名但不带扩展名的
  • Tar.xz带Tar.xz扩展名、Zip扩展名但不带扩展名
  • 带Tar.gz扩展名、Zip扩展名但不带扩展名的Tar.gz

请注意,此处提供的实现仅用于测试目的。它们没有以任何方式被认可用于生产代码中,因为它们不会考虑文件锁定问题以及其他我的想象力不会考虑的问题。=)

MimetypesFileTypeMap

实现

代码语言:javascript
复制
import java.io.File;
import javax.activation.MimetypesFileTypeMap;

final File file = new File(basePath + "/" + fileName);
try {
    return MimetypesFileTypeMap.getDefaultFileTypeMap().getContentType(file);
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

代码语言:javascript
复制
Rar with Rar extension is:       application/octet-stream
Rar with Zip extension is:       application/octet-stream
Zip with Zip extension is:       application/octet-stream
7z with 7z extension is:         application/octet-stream
7z with Zip extension is:        application/octet-stream
Tar.xz with Tar.xz extension is: application/octet-stream
Tar.xz with Zip extension is:    application/octet-stream
Tar.gz with Tar.gz extension is: application/octet-stream
Tar.gz with Zip extension is:    application/octet-stream
Rar without extension is:        application/octet-stream
Zip without extension is:        application/octet-stream
7z without extension is:         application/octet-stream
Tar.xz without extension is:     application/octet-stream
Tar.gz without extension is:     application/octet-stream

结论

当文件类型未被识别时,此方法返回的值为application/octet-stream。似乎所有的场景都失败了,所以我们应该放弃这种方法。

URLConnection.guessContentTypeFromStream

实现

代码语言:javascript
复制
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.BufferedInputStream;
import java.net.URLConnection;

final File file = new File(basePath + "/" + fileName);
try {
    final FileInputStream fileInputStream = new FileInputStream(file);
    final InputStream inputStream = new BufferedInputStream(fileInputStream);

    return URLConnection.guessContentTypeFromStream(inputStream);
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

代码语言:javascript
复制
Rar with Rar extension is:       null
Rar with Zip extension is:       null
Zip with Zip extension is:       null
7z with 7z extension is:         null
7z with Zip extension is:        null
Tar.xz with Tar.xz extension is: null
Tar.xz with Zip extension is:    null
Tar.gz with Tar.gz extension is: null
Tar.gz with Zip extension is:    null
Rar without extension is:        null
Zip without extension is:        null
7z without extension is:         null
Tar.xz without extension is:     null
Tar.gz without extension is:     null

结论

同样,这种方法不适用于所有场景。It seems its support is very limited

Files.probeContentType

实现

代码语言:javascript
复制
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

try {
    final Path path = Paths.get(basePath + "/" + fileName);
    return Files.probeContentType(path);
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

代码语言:javascript
复制
Rar with Rar extension is:       application/vnd.rar
Rar with Zip extension is:       application/zip
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/zip
Tar.xz with Tar.xz extension is: application/x-xz-compressed-tar
Tar.xz with Zip extension is:    application/zip
Tar.gz with Tar.gz extension is: application/x-compressed-tar
Tar.gz with Zip extension is:    application/zip
Rar without extension is:        application/vnd.rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

结论

这种方法工作得出奇地好,但不要被愚弄了,有一种情况是它总是失败。如果文件具有错误的扩展名(与内容不匹配的扩展名),它将报告文件类型为该扩展名。这种情况不应该经常发生,但如果一个人非常挑剔,就不应该使用这种方法。

还有,some warn that his approach doesn't work well in Windows

变通方法:如果有人设法从文件名中删除扩展名,这将为所有给定的场景返回正确的值。

Apache Tika (tika-eval 1.18)

似乎有many flavors of this library (应用程序、服务器、eval等),但网络上的许多人抱怨它有点“依赖重”。

实现

代码语言:javascript
复制
import org.apache.tika.Tika;

try {
    return new Tika().detect(new File(basePath + "/" + fileName));
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

代码语言:javascript
复制
Rar with Rar extension is:       application/x-rar-compressed
Rar with Zip extension is:       application/x-rar-compressed
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: application/x-xz
Tar.xz with Zip extension is:    application/x-xz
Tar.gz with Tar.gz extension is: application/gzip
Tar.gz with Zip extension is:    application/gzip
Rar without extension is:        application/x-rar-compressed
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

结论

所有的文件都被正确地识别出来了,但是它有它的优点,也有它的缺点。

优点:

由Apache维护的

  • 不会被扩展欺骗。

缺点:

  • 真的很重,特别是当你只想检查文件类型的时候。Tika-eval Jar重量+40MB。

URLConnection

实现

代码语言:javascript
复制
import java.net.URL;
import java.net.URLConnection;

try {
    final URL url = new URL("file://" + basePath + "/" + fileName);
    final URLConnection urlConnection = url.openConnection();
    return urlConnection.getContentType();
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

代码语言:javascript
复制
Rar with Rar extension is:       content/unknown
Rar with Zip extension is:       application/zip
Zip with Zip extension is:       application/zip
7z with 7z extension is:         content/unknown
7z with Zip extension is:        application/zip
Tar.xz with Tar.xz extension is: content/unknown
Tar.xz with Zip extension is:    application/zip
Tar.gz with Tar.gz extension is: application/octet-stream
Tar.gz with Zip extension is:    application/zip
Rar without extension is:        content/unknown
Zip without extension is:        content/unknown
7z without extension is:         content/unknown
Tar.xz without extension is:     content/unknown
Tar.gz without extension is:     content/unknown

结论

它很难识别任何文件压缩格式,并通过扩展名引导自己,而不是其内容。

SimpleMagic 1.14

这个项目似乎更新了at least once a year

实现

代码语言:javascript
复制
import com.j256.simplemagic.ContentInfo;
import com.j256.simplemagic.ContentInfoUtil;

try {
    final ContentInfoUtil util = new ContentInfoUtil();
    final ContentInfo info = util.findMatch(basePath + "/" + fileName);

    return info.getMimeType();
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

代码语言:javascript
复制
Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: <EXCEPTION: null>
Tar.xz with Zip extension is:    <EXCEPTION: null>
Tar.gz with Tar.gz extension is: application/x-gzip
Tar.gz with Zip extension is:    application/x-gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     <EXCEPTION: null>
Tar.gz without extension is:     application/x-gzip

结论

它几乎适用于我们所有的场景,但似乎对于像Tar.xz这样的最“模糊”的压缩格式,它未能检测到它们(并在过程中抛出了一个异常)。

MimeUtil 2.1.3

这个项目是has not been modified since 2010的,所以不要期待支持或更新。这里只是为了完成而在这里列出。

实现

代码语言:javascript
复制
import eu.medsea.mimeutil.MimeUtil2;

try {
    final MimeUtil2 mimeUtil = new MimeUtil2();
        mimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");

    return MimeUtil2.getMostSpecificMimeType(mimeUtil.getMimeTypes(basePath + "/" + fileName)).toString();
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

代码语言:javascript
复制
Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/octet-stream
7z with Zip extension is:        application/octet-stream
Tar.xz with Tar.xz extension is: application/octet-stream
Tar.xz with Zip extension is:    application/octet-stream
Tar.gz with Tar.gz extension is: application/x-gzip
Tar.gz with Zip extension is:    application/x-gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/octet-stream
Tar.xz without extension is:     application/octet-stream
Tar.gz without extension is:     application/x-gzip

结论

它识别了一些最流行的文件类型,但在Tar.xz和7z中失败。

文件-命令行

这不是最好的解决方案,但必须尝试一下:Ubuntu file command

实现

代码语言:javascript
复制
import java.io.BufferedReader;
import java.io.InputStreamReader;

try {
    final Process process = Runtime.getRuntime().exec("file --mime-type " + basePath + "/" + fileName);

    final BufferedReader stdInput = new BufferedReader(new InputStreamReader(process.getInputStream()));

    String text = "";

    String s;
    while ((s = stdInput.readLine()) != null) {
        text += s;
    }

    return text.split(": ")[1];
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

结果

代码语言:javascript
复制
Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: application/x-xz
Tar.xz with Zip extension is:    application/x-xz
Tar.gz with Tar.gz extension is: application/gzip
Tar.gz with Zip extension is:    application/gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

结论

它适用于我们的所有场景,但同样,这依赖于运行代码的系统上是否存在命令File

票数 4
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/51994837

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档