python - How to identify duplicate datetime entries from a .csv file where pandas does not consider time down to the second? - S

admin•2025-04-19 19:21:07•questions•阅读2

I am working with a pandas DataFrame where one of the columns contains datetime values, and I need to i

I am working with a pandas DataFrame where one of the columns contains datetime values, and I need to identify duplicate entries in the "Data" column. The datetime values include both the date and the exact time (hours, minutes, and seconds). However, I noticed an issue when I read the data from a .csv file — pandas does not seem to consider the time down to the second when identifying duplicates.

Interestingly, when I create synthetic data directly in pandas (like in the example below), the expected output works correctly, and it identifies the duplicates as I would expect. But when I read the same data from a .csv file, it marks even datetime values that are different by the hour as duplicates, which is not what I want.

Here is an example of my synthetic DataFrame:

import pandas as pd

# Creating synthetic data with random IDs and names
data = {
    'ID': ['ID-1001', 'ID-1002', 'ID-1003', 'ID-1004', 'ID-1005', 'ID-1006', 'ID-1007', 'ID-1008', 'ID-1009', 'ID-1010'],
    'Name': ['Sensor-A', 'Sensor-B', 'Sensor-C', 'Sensor-D', 'Sensor-E', 'Sensor-F', 'Sensor-G', 'Sensor-H', 'Sensor-I', 'Sensor-J'],
    'Code': [330735, 330736, 330737, 330738, 330739, 330740, 330741, 330742, 330743, 330744],
    'Date': [
        '2022-01-01 12:00:00', '2022-01-01 12:00:00', '2022-01-01 13:00:00', '2022-01-01 14:00:00', 
        '2022-01-02 12:00:00', '2022-01-02 13:00:00', '2022-01-02 14:00:00', '2022-01-02 15:00:00', 
        '2022-01-03 12:00:00', '2022-01-03 13:00:00'
    ]
}

# Convert to DataFrame
dd_csv = pd.DataFrame(data)

# Ensure 'Date' is in datetime format
dd_csv['Date'] = pd.to_datetime(dd_csv['Date'])

In this dataset, the following rows have exact duplicate datetime values (same date and time):

2022-01-01 12:00:00 for Sensor-A and Sensor-B (these are duplicates). Now, I want to check for duplicates in the "Data" column based on the exact datetime value, including both date and time. It works ok for the synthetic data above.

duplicates_all = dd_csv['Date'].duplicated(keep=False)
print(dd_csv[duplicates_all])

      ID      Name    Code                Date
0  ID-1001  Sensor-A  330735  2022-01-01 12:00:00
1  ID-1002  Sensor-B  330736  2022-01-01 12:00:00

However, when the data is read from a .csv file (real data), the time is not correctly recognized down to the second. This results in pandas marking entries with the same date but different times (down to the hour) as duplicates, even if I set the format before:

import pandas as pd

# URL of the CSV file in the GitHub repository
url = '.csv'

# Read the CSV file directly from the URL
real_data = pd.read_csv(url)

# Convert the 'Date' column to datetime format
real_data['Date'] = pd.to_datetime(real_data['Date'], format="%Y-%m-%d %H:%M:%S", errors='coerce')

# Identify rows with duplicate dates
duplicates_all = real_data['Date'].duplicated(keep=False)

# Print the rows with duplicate dates
print(real_data[duplicates_all])

and the output is:


        Unnamed: 0 ID                Date         T
11774        11774  A 2017-05-25 12:00:00  20.55000
11775        11775  A 2017-05-25 13:00:00  20.56000
11776        11776  A 2017-05-25 14:00:00  20.56000
11777        11777  A 2017-05-25 15:00:00  20.57000
11778        11778  A 2017-05-25 16:00:00  20.57000

where clear the dates are not repeated since it have different times.

I have tried the suggestion from the answer below, but didn't work neither:

real_data['date_only'] = [x.date() for x in real_data['Date']]
real_data['time_only'] = [x.time() for x in real_data['Date']]

duplicates_all2 = real_data[['date_only', 'time_only']].duplicated(keep=False)
print(real_data[duplicates_all2])

How do I fix that? I need to fix because I'm going to use the ID + Data as a key for a database update, to make sure I only update data that is not in the database.

Here is an example of my synthetic DataFrame:

import pandas as pd

# Creating synthetic data with random IDs and names
data = {
    'ID': ['ID-1001', 'ID-1002', 'ID-1003', 'ID-1004', 'ID-1005', 'ID-1006', 'ID-1007', 'ID-1008', 'ID-1009', 'ID-1010'],
    'Name': ['Sensor-A', 'Sensor-B', 'Sensor-C', 'Sensor-D', 'Sensor-E', 'Sensor-F', 'Sensor-G', 'Sensor-H', 'Sensor-I', 'Sensor-J'],
    'Code': [330735, 330736, 330737, 330738, 330739, 330740, 330741, 330742, 330743, 330744],
    'Date': [
        '2022-01-01 12:00:00', '2022-01-01 12:00:00', '2022-01-01 13:00:00', '2022-01-01 14:00:00', 
        '2022-01-02 12:00:00', '2022-01-02 13:00:00', '2022-01-02 14:00:00', '2022-01-02 15:00:00', 
        '2022-01-03 12:00:00', '2022-01-03 13:00:00'
    ]
}

# Convert to DataFrame
dd_csv = pd.DataFrame(data)

# Ensure 'Date' is in datetime format
dd_csv['Date'] = pd.to_datetime(dd_csv['Date'])

In this dataset, the following rows have exact duplicate datetime values (same date and time):

duplicates_all = dd_csv['Date'].duplicated(keep=False)
print(dd_csv[duplicates_all])

      ID      Name    Code                Date
0  ID-1001  Sensor-A  330735  2022-01-01 12:00:00
1  ID-1002  Sensor-B  330736  2022-01-01 12:00:00

import pandas as pd

# URL of the CSV file in the GitHub repository
url = 'https://raw.githubusercontent/jc-barreto/Data/main/test_data.csv'

# Read the CSV file directly from the URL
real_data = pd.read_csv(url)

# Convert the 'Date' column to datetime format
real_data['Date'] = pd.to_datetime(real_data['Date'], format="%Y-%m-%d %H:%M:%S", errors='coerce')

# Identify rows with duplicate dates
duplicates_all = real_data['Date'].duplicated(keep=False)

# Print the rows with duplicate dates
print(real_data[duplicates_all])

and the output is:


        Unnamed: 0 ID                Date         T
11774        11774  A 2017-05-25 12:00:00  20.55000
11775        11775  A 2017-05-25 13:00:00  20.56000
11776        11776  A 2017-05-25 14:00:00  20.56000
11777        11777  A 2017-05-25 15:00:00  20.57000
11778        11778  A 2017-05-25 16:00:00  20.57000

where clear the dates are not repeated since it have different times.

I have tried the suggestion from the answer below, but didn't work neither:

real_data['date_only'] = [x.date() for x in real_data['Date']]
real_data['time_only'] = [x.time() for x in real_data['Date']]

duplicates_all2 = real_data[['date_only', 'time_only']].duplicated(keep=False)
print(real_data[duplicates_all2])

How do I fix that? I need to fix because I'm going to use the ID + Data as a key for a database update, to make sure I only update data that is not in the database.

Share Improve this question edited Mar 24 at 10:17 asked Mar 21 at 18:21 JCV 5171 gold badge7 silver badges21 bronze badges

Please provide a sample real_data.csv that reproduces the problem, because when I write out the sample data provided with dd_csv.to_csv('real_data.csv',index=None) and read it back in with the second code shown, the output is the same as the first example. Please provide a minimal reproducible example. – Mark Tolonen Commented Mar 22 at 1:42
@MarkTolonen I don't know how to reproduce the real data, so I have put it here: github/jc-barreto/Data.git , so now if you read the data with the code above it will see what I mean. Thank you – JCV Commented Mar 24 at 10:11

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Your data has the duplicated date/times shown, but the aren't consecutive. Sort the duplicated data if you want to see the duplicated dates together.

Example:

import pandas as pd

# Synthetic data with non-consecutive duplicated dates.
data = {
    'ID': ['ID-1001', 'ID-1002', 'ID-1003', 'ID-1004', 'ID-1005', 'ID-1006', 'ID-1007', 'ID-1008', 'ID-1009', 'ID-1010'],
    'Name': ['Sensor-A', 'Sensor-B', 'Sensor-C', 'Sensor-D', 'Sensor-E', 'Sensor-F', 'Sensor-G', 'Sensor-H', 'Sensor-I', 'Sensor-J'],
    'Code': [330735, 330736, 330737, 330738, 330739, 330740, 330741, 330742, 330743, 330744],
    'Date': [
        '2022-01-01 12:00:00', '2022-01-01 11:00:00', '2022-01-01 13:00:00', '2022-01-01 14:00:00', 
        '2022-01-02 12:00:00', '2022-01-02 13:00:00', '2022-01-02 14:00:00', '2022-01-02 15:00:00', 
        '2022-01-01 12:00:00', '2022-01-02 15:00:00'
    ]
}

# Convert to DataFrame
dd_csv = pd.DataFrame(data)

# Ensure 'Date' is in datetime format
dd_csv['Date'] = pd.to_datetime(dd_csv['Date'])

duplicates_all = dd_csv['Date'].duplicated(keep=False)
print(dd_csv[duplicates_all])
print()
print(dd_csv[duplicates_all].sort_values(by=['Date']))  # sort the Dates

Output below. Note that in the first instance, duplicates are listed by not together.

        ID      Name    Code                Date
0  ID-1001  Sensor-A  330735 2022-01-01 12:00:00
7  ID-1008  Sensor-H  330742 2022-01-02 15:00:00
8  ID-1009  Sensor-I  330743 2022-01-01 12:00:00
9  ID-1010  Sensor-J  330744 2022-01-02 15:00:00

        ID      Name    Code                Date
0  ID-1001  Sensor-A  330735 2022-01-01 12:00:00
8  ID-1009  Sensor-I  330743 2022-01-01 12:00:00
7  ID-1008  Sensor-H  330742 2022-01-02 15:00:00
9  ID-1010  Sensor-J  330744 2022-01-02 15:00:00

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744339610a4569325.html

admin

questions
javascript - How to show zoom level on openlayers (jshtml file only)? - Stack Overflow
looking for some help on this. I know it is something along the lines ofnew OpenLayers.Control.ZoomStat
admin
25分钟前
00
questions
dart - My backdropfilter in flutter app is giving a whitish glow why is it so event the color is transparent - Stack Overflow
here in this image it is showiing the whitish glow effectPositioned.fill(child: SizedBox(child: Backdr
admin
24分钟前
10
questions
javascript - Backbone Marionette: Marionette.Application causing Require.js module load error, "'Error: Module
I'm trying to include the App instance to use it's event aggregator as shown here I get an er
admin
24分钟前
00
questions
javascript - Array overlaps an array in Postgres - Stack Overflow
I'm making a search by tags function, in a table like thisCREATE TABLE permission (id serial prima
admin
24分钟前
10
questions
javascript - How to check the length of a Textarea - Stack Overflow
I am trying to calculate if a textarea has a length of more than zero, then run a piece of code. I can&
admin
23分钟前
10
questions
javascript - JSX vs component class instancing - Stack Overflow
Can somebody explain to me the difference between the following two statements?let test1 = new CustomCo
admin
22分钟前
00
questions
sharepoint 2010 - Can't get ClientContext.executeQuery() to work in javascript - Stack Overflow
I am getting information from a sharepoint list and then I want to use that data. The problem is that I
admin
21分钟前
00
questions
plugins - After migrating with duplicator all scripts are still tied to old CDN
Having issues here after using duplicator, I migrated my wordpress site from godaddy managed wordpress to a linux VPS. E
admin
15分钟前
00
questions
server - Editing post and page responding 503 Service Unavailable
I am trying to edit any page or post.While editing, it takes some time and returns 503 Service Unavailable message on we
admin
14分钟前
00
questions
Routing dynamic numeric slug to custom template
How do I make WordPress use a template for any requests where the url is careers[0-9]+?We used to have custom posts fo
admin
12分钟前
00
questions
javascript - jqplot completely redraw a graph - Stack Overflow
How do I pletely destroy a graph and then redraw one from scratch.The reason I need this is because I
admin
10分钟前
00
questions
wp query - Use Repository Pattern in WP theme
In my template files I have a lot of WP_Query to get posts of an author or from a specif category ecc... I'm thinki
admin
9分钟前
00
questions
javascript - Electron: Dynamic context menu - Stack Overflow
In Electron, is there a way to enabledisable specific MenuItem in context menu, depending on the eleme
admin
8分钟前
00
questions
javascript - Select all HTML checkboxes - Stack Overflow
I have a small list of checkboxes (see below), and I noticed I can use an input element with type="
admin
7分钟前
00
questions
angular - What is the correct architectural design of a Signal Store with many state objects and complex API services - Stack Ov
Let's say I'm working with something along these lines:I want to use Signal Stores and signal
admin
6分钟前
00
questions
javascript - mount Vue apps into container of main Vue app - Stack Overflow
I would like to create a base Vue app providing basic functionality like signing in, navigating through
admin
6分钟前
00
questions
node.js - ConversationRelay Twilio Not allow user input when the agent is speaking - Stack Overflow
When the agent is speaking and the user interupts them the agent starts to gather the input even when t
admin
4分钟前
00
questions
python - Smoothing Out Streamed Audio from ChatGPT - Stack Overflow
For a class, I am trying to stream audio from a ChatGPT API response. The code below mostly works, and
admin
2分钟前
00
questions
javascript - Waypoints in Google Maps: InvalidValueError: in property waypoints: at index 0: unknown property lat - Stack Overfl
I'm currently working with the maps api in bination with asp mvc5.My Controller is passing positi
admin
2分钟前
00
questions
javascript - How does twitter give me a callback for tweeting? - Stack Overflow
The Tweet Button is usually:<a href="" class="twitter-share-button" data-count=&
admin
1分钟前
00

发表回复

评论列表（0条）

暂无评论

python - How to identify duplicate datetime entries from a .csv file where pandas does not consider time down to the second? - S

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

python - How to identify duplicate datetime entries from a .csv file where pandas does not consider time down to the second? - S

1 Answer 1

相关推荐

发表回复

评论列表（0条）

联系我们

400-800-8888