腾讯云

文章/答案/技术大牛

发布

社区首页 >问答首页 >为什么潘达斯没有移除重复的

问为什么潘达斯没有移除重复的
EN

Stack Overflow用户

提问于 2022-11-12 23:14:36

回答 1查看 39关注 0票数 0

对于PythonandPandas...but来说，非常新的问题是，我的最终输出文件并不排除“客户号”上的任何重复项。任何关于为什么会发生这种情况的建议都将不胜感激！

import pandas as pd
import numpy as np #numpy is the module which can replace errors from huge datasets 
from openpyxl import load_workbook
from openpyxl.styles import Font

df_1 = pd.read_excel('PRT Tracings 2020.xlsx', sheet_name='Export') #this is reading the Excel document shifts and looks at sheet
df_2 = pd.read_excel('PRT Tracings 2021.xlsx', sheet_name='Export') #this reads the same Excel document but looks at a different sheet
df_3 = pd.read_excel('PRT Tracings YTD 2022.xlsx', sheet_name='Export') #this reads a different Excel file, and only has one sheet so no need to have it read a sheet

df_all = pd.concat([df_1, df_2, df_3], sort=False) #this combines the sheets from 1,2,3 and the sort function as false so our columns stay in the same order

to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet

df_all = df_all.replace(np.nan, 'N/A', regex=True) #replaces errors with N/A

remove = ['ORDERNUMBER', 'ORDER_TYPE', 'ORDERDATE', 'Major Code Description', 'Product_Number_And_Desc', 'Qty', 'Order_$', 'Order_List_$'] #this will remove all unwanted columns
df_all.drop(columns=remove, inplace=True)

df_all.drop_duplicates(subset=['Customer Number'], keep=False) #this will remove all duplicates from the tracing number syntax with pandas module

to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet

wb = load_workbook('Combined_PRT_Tracings.xlsx') #we are using this to have openpyxl read the data, from the spreadsheet already created
ws = wb.active #this workbook is active

wb.save('Combined_PRT_Tracings.xlsx')

python

pandas

numpy

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-11-12 23:22:32

您应该将df_all.drop_duplicates的返回值赋值给一个变量，或者将inplace=True设置为覆盖DataFrame内容。这是为了防止对原始数据的不需要的更改。

尝试：

df_all = df_all.drop_duplicates(subset='Customer Number', keep=False)

或相当于：

df_all.drop_duplicates(subset='Customer Number', keep=False, inplace=True)

这将从DataFrame中删除所有重复行。如果要保留包含重复的第一行或最后一行，请将keep更改为first或last。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74417286

复制

相似问题

问为什么潘达斯没有移除重复的
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么潘达斯没有移除重复的EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么潘达斯没有移除重复的
EN