我有一个多个URL的列表,有些目录有多个扩展名不同的文件,等等。示例:
List = [
"http://www.example.com/folder1",
"http://www.example.com/folder1",
"http://www.example.com/folder1/folder2",
"http://www.example.com/folder1/folder2/folder3",
"http://www.example.com/folder1/folder2",
"http://www.example.com/folder1/folder2/image1.png",
"http://www.example.com/folder1/folder2/image2.png",
"http://www.example.com/folder1/folder2/file.txt",
"http://www.example.com/folder1/folder2/folder3",
"http://www.example.com/folder1/folder2/folder3/file1.txt",
"http://www.example.com/folder1/folder2/folder3/file2.txt",
"http://www.example.com/folder1/folder2/folder3/file3.txt",
...
]
我试图实现的是过滤这些URL,以便获得一个列表,其中只有文件夹的URL和每个不同扩展名的一个URL。就像这样:
List = [
"http://www.example.com/folder1",
"http://www.example.com/folder1/folder2",
"http://www.example.com/folder1/folder2/image1.png",
"http://www.example.com/folder1/folder2/file.txt",
"http://www.example.com/folder1/folder2/folder3",
"http://www.example.com/folder1/folder2/folder3/file1.txt",
...
]
目前,我还在研究如何用它生成某种树,这样我就可以遍历它并删除重复的文件。
我尝试过一些不同的方法,但我对Python还有点陌生。
谢谢:)
发布于 2019-08-09 19:13:05
如果您的URL遵循这种简单的格式,则可以使用dict
筛选列表,以跟踪使用了哪些目录:
List = [
"http://www.example.com/folder1",
"http://www.example.com/folder1",
"http://www.example.com/folder1/folder2",
"http://www.example.com/folder1/folder2/folder3",
"http://www.example.com/folder1/folder2",
"http://www.example.com/folder1/folder2/image1.png",
"http://www.example.com/folder1/folder2/image2.png",
"http://www.example.com/folder1/folder2/file.txt",
"http://www.example.com/folder1/folder2/folder3",
"http://www.example.com/folder1/folder2/folder3/file1.txt",
"http://www.example.com/folder1/folder2/folder3/file2.txt",
"http://www.example.com/folder1/folder2/folder3/file3.txt",
...
]
dirnames = {}
filtered = []
for url in List:
dirname = os.path.dirname(url)
dirnames.setdefault(dirname, {})
extension = os.path.splitext(url)[1]
if extension not in dirnames[dirname]:
dirnames[dirname][extension] = True
filtered.append(url)
print(filtered)
发布于 2019-08-09 19:12:58
您可以在递归中使用itertools.groupby
:
import itertools, re
data = ['http://www.example.com/folder1', 'http://www.example.com/folder1', 'http://www.example.com/folder1/folder2', 'http://www.example.com/folder1/folder2/folder3', 'http://www.example.com/folder1/folder2', 'http://www.example.com/folder1/folder2/image1.png', 'http://www.example.com/folder1/folder2/image2.png', 'http://www.example.com/folder1/folder2/file.txt', 'http://www.example.com/folder1/folder2/folder3', 'http://www.example.com/folder1/folder2/folder3/file1.txt', 'http://www.example.com/folder1/folder2/folder3/file2.txt', 'http://www.example.com/folder1/folder2/folder3/file3.txt']
def group(d, path = []):
new_d = [[a, [j for _, *j in b]] for a, b in itertools.groupby(sorted(d, key=lambda x:x[0]), key=lambda x:x[0])]
for a, c in new_d:
_d, _fold, _path = [i[0] for i in c if len(i) == 1], [], []
for i in _d:
if not re.findall('\.\w+$', i):
if i not in _fold:
yield '/'.join(path+[a]+[i])
_fold.append(i)
else:
if i.split('.')[-1] not in _path:
yield '/'.join(path+[a]+[i])
_path.append(i.split('.')[-1])
r = [i for i in c if len(i) != 1]
yield from group(r, path+[a])
_data = [[a, *b.split('/')] for a, b in map(lambda x:re.split('(?<=\.com)/', x), data)]
print(list(group(_data)))
输出:
['http://www.example.com/folder1',
'http://www.example.com/folder1/folder2',
'http://www.example.com/folder1/folder2/folder3',
'http://www.example.com/folder1/folder2/image1.png',
'http://www.example.com/folder1/folder2/file.txt',
'http://www.example.com/folder1/folder2/folder3/file1.txt']
https://stackoverflow.com/questions/57435627
复制相似问题