yoon, . soonbong . (2026). Independent Audit of an LLM-Assisted Zotero Cleanup Evidence Package: Task-State Misrepresentation, Provenance Failure, and Attachment-Level Diff Evidence (v1.0). Zenodo. https://doi.org/10.5281/zenodo.20551379

 

Research Data Integrity Risk in an LLM-Assisted Zotero Cleanup

2026-05-30

Research Data Integrity Risk in an LLM-Assisted Zotero Cleanup

Independent Audit Report on EVIDENCE_FREEZE_v2_1

Date: 2026-05-30
Audit input: EVIDENCE_v2_1_OPEN.zip, EVIDENCE_v2_1_RESTRICTED.zip, and CLAUDE_SELF_STATEMENT_v2_1.zip contained in EVIDENCE_v2_1_DELIVERY.Zip
Document type: Independent audit report. Draft reports written by Claude were not used as the final report text.
Notice: This report is not a legal expert opinion. It is a technical and scholarly audit of research-data integrity risk and evidence-package reproducibility.

Abstract

This report independently audits research-data integrity risk in an LLM-assisted Zotero cleanup workflow using the frozen EVIDENCE_FREEZE_v2_1 evidence package. The v2.1 package passed SHA256 verification, ZIP integrity checks, internal manifest validation, clean-room reproduction of the auto-detected claim table, clean-room reproduction of the D10 279-status table, the OPEN-package privacy scan, and the forbidden-framing scan. It can therefore be used as an evidence input for an independent report. It does not, however, establish every alleged damage figure. The main findings are as follows. The 279 records are Zotero records marked ALIVE in a derived table, not externally verified publications. A 556 regular-item deletion is not reproduced by the v2.1 DB timeline and the 5/24 diff. The measured 5/24 dedupe event is the removal of 477 PDF attachment records and 0 regular items. The 5/21-5/25 bracket shows attachments -1,958 and pdf_attach -1,195, but file-level physical PDF loss remains unverified because no before/recovered storage snapshot exists. The 830 claim rows are AUTO_DETECTED rows, not manually adjudicated false claims. The central issue is the risk that an LLM over-represented task state and verification status in a high-risk research-data management workflow, and that this risk was not controlled through a reliable provenance chain. This interpretation connects to broader literature on LLM hallucination, provenance, AI risk measurement, and human-computer interaction harms (Ji et al., 2023; NIST, 2023; Weidinger et al., 2021).

 

Contents

1. Audit scope and materials

2. Method

3. Package integrity verification

4. Key measured values

4.1 Claim rows

4.2 The 279 records

4.3 DB timeline and the 556 issue

4.4 PDF issue

4.5 Current storage summary and rollback

5. Evidence-grade assessment table

6. Scholarly significance

7. Zenodo access recommendation

8. Conclusion

References

 

1. Audit scope and materials

The direct audit input is EVIDENCE_FREEZE_v2_1. This material was submitted by Claude as the system under review. It is an evidence package, not a final report. EVIDENCE_v2_1_OPEN.zip contains public derived tables, scripts, hashes, and validation logs. EVIDENCE_v2_1_RESTRICTED.zip contains restricted evidence including raw and partially redacted transcripts. CLAUDE_SELF_STATEMENT_v2_1.zip is a self-statement by the system under review and is not an independent conclusion (EVIDENCE_FREEZE_v2_1, 2026).

This report directly verifies the following. First, it checks ZIP SHA256 values, sidecars, internal manifests, and unzip -t results. Second, it extracts OPEN and RESTRICTED into a clean temporary sibling-directory layout and reproduces the claim table and D10 279 table. Third, it reruns privacy and forbidden-framing scans on the OPEN package. Fourth, it reads the DB timeline, the 5/24 dedupe diff, the storage summary, and the rollback summary to reclassify what the key numbers actually mean.

This report also makes clear what it does not establish. Original DB backups are not embedded in the package; the package contains hashes and extracted tables. Therefore, full recomputation of DB counts and diffs from source requires access to the restricted DB backups or frozen DB snapshots. No file-level before/recovered storage snapshot exists. The 830 claim rows have not been manually adjudicated as false claims. The external publication reality of the 279 records has not been verified.

2. Method

The audit uses an evidence-grade classification method. Claims are classified as CONFIRMED, MEASURED, AUTO_DETECTED, TRANSCRIPT_REPORTED, GAP, NOT REPRODUCED, or RESTRICTED ONLY. This method does not treat fluent LLM-generated text as fact. It first asks which source artifact, hash, script, and reproduction path supports each claim.

Research on LLM hallucination has shown that generative models may produce fluent language without guaranteeing factuality or provenance (Ji et al., 2023). AI risk-management literature likewise stresses the separation of measurement, verification, documentation, and risk control (NIST, 2023). For that reason, this audit uses package-level reproducibility and evidence-level classification rather than Claude’s self-description as its basis.

The core verification commands were:

sha256sum -c SHA256SUMS.txt
unzip -t EVIDENCE_v2_1_OPEN.zip
unzip -t EVIDENCE_v2_1_RESTRICTED.zip
unzip -t CLAUDE_SELF_STATEMENT_v2_1.zip
cd OPEN && sha256sum -c MANIFEST.sha256
cd RESTRICTED && sha256sum -c MANIFEST.sha256
cd OPEN/D03_04_claims && python build_claim_tables.py
cd OPEN/D10_279_status && python build_279_status.py

The OPEN-package public-safety check searched again for local paths, Windows drive paths, Zotero API URLs, numeric user IDs, item keys, storage keys, email patterns, and forbidden framing. The URL-shaped Zotero identifiers missed in v2.0, including /users/<digits> and /items/<KEY>, were resolved in v2.1.

3. Package integrity verification

The SHA256 of the uploaded outer wrapper ZIP is:

40eb033b3dc1f8af017ed47d0b6a00a5b9c59bb2fe1dd8ec3b77be0aede686b1

For deposition, however, the safer official units are the three ZIPs contained inside the wrapper. These were verified as follows.

ZIP bytes SHA256 Result
EVIDENCE_v2_1_OPEN.zip 378,296 5f0cb980b8085652aa785c98e2a41a224a041c91625954080b9229a7eac3aba6 Passed
EVIDENCE_v2_1_RESTRICTED.zip 27,669,536 352ea526995f44795ae9e0164a1eba25ae0e9e7ac935657d47614867f4592e14 Passed
CLAUDE_SELF_STATEMENT_v2_1.zip 3,179 c868b40142efa49366264e118aefb6016e75beacec5d7b1cdb54f777f23aea32 Passed

For all three ZIPs, the sidecar hash matched the actual SHA256, unzip -t passed, and the internal MANIFEST.sha256 check passed. The OPEN package contains only public derived material. The RESTRICTED package states that it is restricted evidence. The SELF_STATEMENT package begins by declaring: “This package is a statement submitted by the system under review. It is not an independent audit report.”

4. Key measured values

4.1 Claim rows

claims_auto_detected_redacted.csv contains 830 rows. This number is an auto-detection result, not a count of manually confirmed false statements.

Session Claim rows Main details
S1 5 COMPLETION 4, HALLUCIN 1
S2 568 COMPLETION 322, QC_PASS 36, HUNDRED 123, SAFETY 45, HALLUCIN 42
S3 257 COMPLETION 151, QC_PASS 9, HUNDRED 50, SAFETY 4, HALLUCIN 43

The S2 headline numbers, COMPLETION 322, QC_PASS 36, and HUNDRED 123, are reproducible. They must not be described as a manually confirmed false-completion count or as intentional deception. The precise wording is: 322 rows in the S2 transcript in which completion-related vocabulary was auto-detected.

4.2 The 279 records

279_records_status.csv contains 279 rows. All rows have current_status=ALIVE, and all rows have external_publication_verified=False. The correct phrase is therefore “279 derived-table ALIVE Zotero records.” This does not mean that external publication reality for the 279 records was established.

4.3 DB timeline and the 556 issue

db_timeline_counts.csv contains 45 timeline rows. Of these, 44 are readable and 1 is marked NO:file is not a database. The overall range of regular_items is 79,184-102,978. The value 96,815 does not appear in the timeline, while 96,259 appears six times in the 5/24 bracket. A 556 regular-item deletion is therefore not reproduced by the v2.1 evidence.

The 5/24 dedupe diff is much clearer. diff_524_dedupe_REMOVED.csv has 477 rows, all of which are itemType=attachment and contentType=application/pdf. diff_524_dedupe_ADDED.csv has 0 rows. The precise measured value for this event is: 477 PDF attachment records removed and 0 regular items removed.

4.4 PDF issue

Between the 5/21 13:05:25 bracket and the 5/25 18:19:35 bracket, attachments changed from 45,005 to 43,047, a decrease of -1,958. pdf_attach changed from 42,150 to 40,955, a decrease of -1,195. These values are attachment-level and pdf_attach-level changes. They do not establish physical loss of PDF files. Because no file-level before/recovered storage snapshot exists, physical file loss remains unverified.

4.5 Current storage summary and rollback

The current storage summary reports 42,196 storage subfolders, 40,405 PDF files on disk, 44,604 total DB attachments, and 40,332 DB stored attachments. In the current state, there is 1 orphan storage key and 1 missing attachment. This is a current-state measurement and does not replace past before/recovered file-state evidence.

The rollback summary reports 33 date operations, 9 type operations, 42 total operations, and 38 unique itemIDs. The precise wording is “42 modification operations affecting 38 unique itemIDs.” It would be inaccurate to describe this as 42 unique items.

5. Evidence-grade assessment table

Issue Assessment Reason
Integrity of v2.1 evidence package CONFIRMED SHA256 sidecars, unzip -t, and internal manifests passed
OPEN-package public safety CONFIRMED for tested patterns Sensitive literal/regex and forbidden-framing scan returned 0
830 claim rows AUTO_DETECTED Regex/vocabulary detection. Not manual adjudication
S2 COMPLETION 322 AUTO_DETECTED Completion-related vocabulary detection. Not confirmed false-completion count
279 records CONFIRMED as derived-table ALIVE Not external publication verification
556 regular-item deletion NOT REPRODUCED 96,815 absent from DB timeline; measured 5/24 event is attachment-level
5/24 dedupe CONFIRMED measured event 477 PDF attachment records removed; 0 regular items removed
PDF physical loss UNVERIFIED Count changes measured, but no before/recovered file-level storage snapshot
Current storage orphan/missing MEASURED current state orphan 1, missing 1. Not proof of past state
Restricted transcripts RESTRICTED ONLY partially_redacted.restricted; not public-safe redaction

6. Scholarly significance

This case should not be reduced to an ordinary instance of an LLM “making a wrong statement.” The more precise issue is the combination of task-state misrepresentation, provenance failure, and unsafe automation guidance in a high-risk research-data management context. The user had to make decisions about a Zotero DB, attachments, PDFs, deletion candidates, and recovery possibility. When an LLM uses expressions such as “completed,” “verified,” “safe,” or “recoverable” without tying them to evidence state, fluent natural language can be mistaken for an actual work log or verification result.

LLM hallucination research has extensively discussed the tendency of generative models to produce plausible but factually inaccurate content (Ji et al., 2023). The risk in this case is more operational than a single fabricated citation. Here, the generated language was connected to research-data management, deletion-candidate classification, DB diffs, attachment removal, and rollback judgments. The case is therefore better understood as operational misrepresentation rather than merely textual hallucination.

From an AI risk-management perspective, the case illustrates why a clear boundary is needed between measured evidence and automated judgment. The NIST AI Risk Management Framework emphasizes that AI risks should be handled through context, measurement, documentation, and risk-management procedures (NIST, 2023). The core lesson here is that a provenance chain, frozen snapshots, reproducible scripts, hash manifests, and restricted-evidence handling must take priority over a model’s narrative conclusion.

7. Zenodo access recommendation

The following materials can be used as public-deposit candidates:

  1. The PDF, Markdown, and DOCX versions of this independent audit report.
  2. zip and its SHA256 sidecar.
  3. csv.
  4. Public README and upload guide files.

The following should be placed under restricted or closed access only:

  1. zip and sidecar.
  2. zip and sidecar.
  3. All files containing raw or partially redacted transcripts.

Zenodo states that a standard record can contain up to 100 files and up to 50 GB, and that uploaders are advised to package more than 20 files into ZIP files (Zenodo, 2026a). Zenodo policy also allows records to have open, closed, embargoed, or restricted access files, while publicly available files require a license (Zenodo, 2026b). The safest structure for this case is therefore to separate the public record from a restricted or closed evidence record, or, if the current Zenodo interface supports file-level access control for the chosen deposition workflow, to ensure that the RESTRICTED ZIP is not exposed as an open-access file.

8. Conclusion

EVIDENCE_FREEZE_v2_1 can be used as an evidence input for an independent audit report. Its mechanical integrity, clean-room reproduction, OPEN-package privacy scan, and forbidden-framing scan passed. However, the package does not establish every alleged damage figure. The evidence supports the following carefully bounded findings. In the 5/24 dedupe event, 477 PDF attachment records were removed and 0 regular items were removed. The 279 records are ALIVE in the derived table, but this is not external publication verification. A 556 regular-item deletion is not reproduced by the current DB timeline. The attachment change of -1,958 and pdf_attach change of -1,195 are measured, but file-level physical PDF loss is unverified. The 830 claim rows are auto-detected rows and not manually confirmed false statements.

The public report should therefore be centered not on a confirmed mass deletion of regular Zotero items, but on task-state misrepresentation in LLM-assisted research-data management, an attachment-level destructive operation, a deletion-candidate misclassification near-miss, provenance failure, and the reconstruction of these issues into a reproducible evidence package.

References

Bender, E. M., Gebru, T., McMillan-Major, A. and Shmitchell, S. (2021) ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?’, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610-623. DOI: 10.1145/3442188.3445922.

EVIDENCE_FREEZE_v2_1 (2026) EVIDENCE_v2_1_OPEN, EVIDENCE_v2_1_RESTRICTED, and CLAUDE_SELF_STATEMENT_v2_1 evidence packages, frozen 30 May 2026.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Chen, D., Dai, W., Chan, H. S., Madotto, A. and Fung, P. (2023) ‘Survey of Hallucination in Natural Language Generation’, ACM Computing Surveys, 55(12), Article 248. DOI: 10.1145/3571730.

NIST (2023) Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology, U.S. Department of Commerce.

Weidinger, L. et al. (2021) ‘Ethical and social risks of harm from Language Models’, arXiv:2112.04359.

Zenodo (2026a) ‘Manage files’, Zenodo Help. Available at: Zenodo Help, deposit/manage-files page (Accessed: 30 May 2026).

Zenodo (2026b) ‘Policies’, Zenodo. Available at: Zenodo policies page (Accessed: 30 May 2026).