yoon, . soonbong . (2026). Independent Audit of an LLM-Assisted Zotero Cleanup Evidence Package: Task-State Misrepresentation, Provenance Failure, and Attachment-Level Diff Evidence (v1.0). Zenodo. https://doi.org/10.5281/zenodo.20551379
Research Data Integrity Risk in an LLM-Assisted Zotero Cleanup
2026-05-30
Research Data Integrity Risk in an LLM-Assisted Zotero Cleanup
Independent Audit Report on EVIDENCE_FREEZE_v2_1
Date: 2026-05-30
Audit input: EVIDENCE_v2_1_OPEN.zip, EVIDENCE_v2_1_RESTRICTED.zip, and CLAUDE_SELF_STATEMENT_v2_1.zip contained in EVIDENCE_v2_1_DELIVERY.Zip
Document type: Independent audit report. Draft reports written by Claude were not used as the final report text.
Notice: This report is not a legal expert opinion. It is a technical and scholarly audit of research-data integrity risk and evidence-package reproducibility.
Abstract
This report independently audits research-data integrity risk in an LLM-assisted Zotero cleanup workflow using the frozen EVIDENCE_FREEZE_v2_1 evidence package. The v2.1 package passed SHA256 verification, ZIP integrity checks, internal manifest validation, clean-room reproduction of the auto-detected claim table, clean-room reproduction of the D10 279-status table, the OPEN-package privacy scan, and the forbidden-framing scan. It can therefore be used as an evidence input for an independent report. It does not, however, establish every alleged damage figure. The main findings are as follows. The 279 records are Zotero records marked ALIVE in a derived table, not externally verified publications. A 556 regular-item deletion is not reproduced by the v2.1 DB timeline and the 5/24 diff. The measured 5/24 dedupe event is the removal of 477 PDF attachment records and 0 regular items. The 5/21-5/25 bracket shows attachments -1,958 and pdf_attach -1,195, but file-level physical PDF loss remains unverified because no before/recovered storage snapshot exists. The 830 claim rows are AUTO_DETECTED rows, not manually adjudicated false claims. The central issue is the risk that an LLM over-represented task state and verification status in a high-risk research-data management workflow, and that this risk was not controlled through a reliable provenance chain. This interpretation connects to broader literature on LLM hallucination, provenance, AI risk measurement, and human-computer interaction harms (Ji et al., 2023; NIST, 2023; Weidinger et al., 2021).
Contents
3. Package integrity verification
4.3 DB timeline and the 556 issue
4.5 Current storage summary and rollback
5. Evidence-grade assessment table
7. Zenodo access recommendation
1. Audit scope and materials
The direct audit input is EVIDENCE_FREEZE_v2_1. This material was submitted by Claude as the system under review. It is an evidence package, not a final report. EVIDENCE_v2_1_OPEN.zip contains public derived tables, scripts, hashes, and validation logs. EVIDENCE_v2_1_RESTRICTED.zip contains restricted evidence including raw and partially redacted transcripts. CLAUDE_SELF_STATEMENT_v2_1.zip is a self-statement by the system under review and is not an independent conclusion (EVIDENCE_FREEZE_v2_1, 2026).
This report directly verifies the following. First, it checks ZIP SHA256 values, sidecars, internal manifests, and unzip -t results. Second, it extracts OPEN and RESTRICTED into a clean temporary sibling-directory layout and reproduces the claim table and D10 279 table. Third, it reruns privacy and forbidden-framing scans on the OPEN package. Fourth, it reads the DB timeline, the 5/24 dedupe diff, the storage summary, and the rollback summary to reclassify what the key numbers actually mean.
This report also makes clear what it does not establish. Original DB backups are not embedded in the package; the package contains hashes and extracted tables. Therefore, full recomputation of DB counts and diffs from source requires access to the restricted DB backups or frozen DB snapshots. No file-level before/recovered storage snapshot exists. The 830 claim rows have not been manually adjudicated as false claims. The external publication reality of the 279 records has not been verified.
2. Method
The audit uses an evidence-grade classification method. Claims are classified as CONFIRMED, MEASURED, AUTO_DETECTED, TRANSCRIPT_REPORTED, GAP, NOT REPRODUCED, or RESTRICTED ONLY. This method does not treat fluent LLM-generated text as fact. It first asks which source artifact, hash, script, and reproduction path supports each claim.
Research on LLM hallucination has shown that generative models may produce fluent language without guaranteeing factuality or provenance (Ji et al., 2023). AI risk-management literature likewise stresses the separation of measurement, verification, documentation, and risk control (NIST, 2023). For that reason, this audit uses package-level reproducibility and evidence-level classification rather than Claude’s self-description as its basis.
The core verification commands were:
sha256sum -c SHA256SUMS.txt
unzip -t EVIDENCE_v2_1_OPEN.zip
unzip -t EVIDENCE_v2_1_RESTRICTED.zip
unzip -t CLAUDE_SELF_STATEMENT_v2_1.zip
cd OPEN && sha256sum -c MANIFEST.sha256
cd RESTRICTED && sha256sum -c MANIFEST.sha256
cd OPEN/D03_04_claims && python build_claim_tables.py
cd OPEN/D10_279_status && python build_279_status.py
The OPEN-package public-safety check searched again for local paths, Windows drive paths, Zotero API URLs, numeric user IDs, item keys, storage keys, email patterns, and forbidden framing. The URL-shaped Zotero identifiers missed in v2.0, including /users/<digits> and /items/<KEY>, were resolved in v2.1.
3. Package integrity verification
The SHA256 of the uploaded outer wrapper ZIP is:
40eb033b3dc1f8af017ed47d0b6a00a5b9c59bb2fe1dd8ec3b77be0aede686b1
For deposition, however, the safer official units are the three ZIPs contained inside the wrapper. These were verified as follows.
| ZIP | bytes | SHA256 | Result |
| EVIDENCE_v2_1_OPEN.zip | 378,296 | 5f0cb980b8085652 |
Passed |
| EVIDENCE_v2_1_RESTRICTED.zip | 27,669,536 | 352ea526995f4479 |
Passed |
| CLAUDE_SELF_STATEMENT_v2_1.zip | 3,179 | c868b40142efa493 |
Passed |
For all three ZIPs, the sidecar hash matched the actual SHA256, unzip -t passed, and the internal MANIFEST.sha256 check passed. The OPEN package contains only public derived material. The RESTRICTED package states that it is restricted evidence. The SELF_STATEMENT package begins by declaring: “This package is a statement submitted by the system under review. It is not an independent audit report.”
4. Key measured values
4.1 Claim rows
claims_auto_detected_redacted.csv contains 830 rows. This number is an auto-detection result, not a count of manually confirmed false statements.
| Session | Claim rows | Main details |
| S1 | 5 | COMPLETION 4, HALLUCIN 1 |
| S2 | 568 | COMPLETION 322, QC_PASS 36, HUNDRED 123, SAFETY 45, HALLUCIN 42 |
| S3 | 257 | COMPLETION 151, QC_PASS 9, HUNDRED 50, SAFETY 4, HALLUCIN 43 |
The S2 headline numbers, COMPLETION 322, QC_PASS 36, and HUNDRED 123, are reproducible. They must not be described as a manually confirmed false-completion count or as intentional deception. The precise wording is: 322 rows in the S2 transcript in which completion-related vocabulary was auto-detected.
4.2 The 279 records
279_records_status.csv contains 279 rows. All rows have current_status=ALIVE, and all rows have external_publication_verified=False. The correct phrase is therefore “279 derived-table ALIVE Zotero records.” This does not mean that external publication reality for the 279 records was established.
4.3 DB timeline and the 556 issue
db_timeline_counts.csv contains 45 timeline rows. Of these, 44 are readable and 1 is marked NO:file is not a database. The overall range of regular_items is 79,184-102,978. The value 96,815 does not appear in the timeline, while 96,259 appears six times in the 5/24 bracket. A 556 regular-item deletion is therefore not reproduced by the v2.1 evidence.
The 5/24 dedupe diff is much clearer. diff_524_dedupe_REMOVED.csv has 477 rows, all of which are itemType=attachment and contentType=application/pdf. diff_524_dedupe_ADDED.csv has 0 rows. The precise measured value for this event is: 477 PDF attachment records removed and 0 regular items removed.
4.4 PDF issue
Between the 5/21 13:05:25 bracket and the 5/25 18:19:35 bracket, attachments changed from 45,005 to 43,047, a decrease of -1,958. pdf_attach changed from 42,150 to 40,955, a decrease of -1,195. These values are attachment-level and pdf_attach-level changes. They do not establish physical loss of PDF files. Because no file-level before/recovered storage snapshot exists, physical file loss remains unverified.
4.5 Current storage summary and rollback
The current storage summary reports 42,196 storage subfolders, 40,405 PDF files on disk, 44,604 total DB attachments, and 40,332 DB stored attachments. In the current state, there is 1 orphan storage key and 1 missing attachment. This is a current-state measurement and does not replace past before/recovered file-state evidence.
The rollback summary reports 33 date operations, 9 type operations, 42 total operations, and 38 unique itemIDs. The precise wording is “42 modification operations affecting 38 unique itemIDs.” It would be inaccurate to describe this as 42 unique items.
5. Evidence-grade assessment table
| Issue | Assessment | Reason |
| Integrity of v2.1 evidence package | CONFIRMED | SHA256 sidecars, unzip -t, and internal manifests passed |
| OPEN-package public safety | CONFIRMED for tested patterns | Sensitive literal/regex and forbidden-framing scan returned 0 |
| 830 claim rows | AUTO_DETECTED | Regex/vocabulary detection. Not manual adjudication |
| S2 COMPLETION 322 | AUTO_DETECTED | Completion-related vocabulary detection. Not confirmed false-completion count |
| 279 records | CONFIRMED as derived-table ALIVE | Not external publication verification |
| 556 regular-item deletion | NOT REPRODUCED | 96,815 absent from DB timeline; measured 5/24 event is attachment-level |
| 5/24 dedupe | CONFIRMED measured event | 477 PDF attachment records removed; 0 regular items removed |
| PDF physical loss | UNVERIFIED | Count changes measured, but no before/recovered file-level storage snapshot |
| Current storage orphan/missing | MEASURED current state | orphan 1, missing 1. Not proof of past state |
| Restricted transcripts | RESTRICTED ONLY | partially_redacted.restricted; not public-safe redaction |
6. Scholarly significance
This case should not be reduced to an ordinary instance of an LLM “making a wrong statement.” The more precise issue is the combination of task-state misrepresentation, provenance failure, and unsafe automation guidance in a high-risk research-data management context. The user had to make decisions about a Zotero DB, attachments, PDFs, deletion candidates, and recovery possibility. When an LLM uses expressions such as “completed,” “verified,” “safe,” or “recoverable” without tying them to evidence state, fluent natural language can be mistaken for an actual work log or verification result.
LLM hallucination research has extensively discussed the tendency of generative models to produce plausible but factually inaccurate content (Ji et al., 2023). The risk in this case is more operational than a single fabricated citation. Here, the generated language was connected to research-data management, deletion-candidate classification, DB diffs, attachment removal, and rollback judgments. The case is therefore better understood as operational misrepresentation rather than merely textual hallucination.
From an AI risk-management perspective, the case illustrates why a clear boundary is needed between measured evidence and automated judgment. The NIST AI Risk Management Framework emphasizes that AI risks should be handled through context, measurement, documentation, and risk-management procedures (NIST, 2023). The core lesson here is that a provenance chain, frozen snapshots, reproducible scripts, hash manifests, and restricted-evidence handling must take priority over a model’s narrative conclusion.
7. Zenodo access recommendation
The following materials can be used as public-deposit candidates:
- The PDF, Markdown, and DOCX versions of this independent audit report.
- zip and its SHA256 sidecar.
- csv.
- Public README and upload guide files.
The following should be placed under restricted or closed access only:
- zip and sidecar.
- zip and sidecar.
- All files containing raw or partially redacted transcripts.
Zenodo states that a standard record can contain up to 100 files and up to 50 GB, and that uploaders are advised to package more than 20 files into ZIP files (Zenodo, 2026a). Zenodo policy also allows records to have open, closed, embargoed, or restricted access files, while publicly available files require a license (Zenodo, 2026b). The safest structure for this case is therefore to separate the public record from a restricted or closed evidence record, or, if the current Zenodo interface supports file-level access control for the chosen deposition workflow, to ensure that the RESTRICTED ZIP is not exposed as an open-access file.
8. Conclusion
EVIDENCE_FREEZE_v2_1 can be used as an evidence input for an independent audit report. Its mechanical integrity, clean-room reproduction, OPEN-package privacy scan, and forbidden-framing scan passed. However, the package does not establish every alleged damage figure. The evidence supports the following carefully bounded findings. In the 5/24 dedupe event, 477 PDF attachment records were removed and 0 regular items were removed. The 279 records are ALIVE in the derived table, but this is not external publication verification. A 556 regular-item deletion is not reproduced by the current DB timeline. The attachment change of -1,958 and pdf_attach change of -1,195 are measured, but file-level physical PDF loss is unverified. The 830 claim rows are auto-detected rows and not manually confirmed false statements.
The public report should therefore be centered not on a confirmed mass deletion of regular Zotero items, but on task-state misrepresentation in LLM-assisted research-data management, an attachment-level destructive operation, a deletion-candidate misclassification near-miss, provenance failure, and the reconstruction of these issues into a reproducible evidence package.
References
Bender, E. M., Gebru, T., McMillan-Major, A. and Shmitchell, S. (2021) ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?’, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610-623. DOI: 10.1145/3442188.3445922.
EVIDENCE_FREEZE_v2_1 (2026) EVIDENCE_v2_1_OPEN, EVIDENCE_v2_1_RESTRICTED, and CLAUDE_SELF_STATEMENT_v2_1 evidence packages, frozen 30 May 2026.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Chen, D., Dai, W., Chan, H. S., Madotto, A. and Fung, P. (2023) ‘Survey of Hallucination in Natural Language Generation’, ACM Computing Surveys, 55(12), Article 248. DOI: 10.1145/3571730.
NIST (2023) Artificial Intelligence Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology, U.S. Department of Commerce.
Weidinger, L. et al. (2021) ‘Ethical and social risks of harm from Language Models’, arXiv:2112.04359.
Zenodo (2026a) ‘Manage files’, Zenodo Help. Available at: Zenodo Help, deposit/manage-files page (Accessed: 30 May 2026).
Zenodo (2026b) ‘Policies’, Zenodo. Available at: Zenodo policies page (Accessed: 30 May 2026).