Secrets Exposed in arXiv LaTeX Sources

When researchers write academic papers, they often use a formatting system called LaTeX — think of it like a word processor but for scientists. When they submit these papers to arXiv (a free, public website where millions of research papers are shared), they also upload the raw "source files" — essentially the behind-the-scenes drafts of their work. The problem? Those source files frequently contain things never meant for public eyes: private notes between colleagues, login credentials, secret API keys that unlock cloud systems, and even GPS coordinates of researchers' home addresses. It's a bit like publishing your book manuscript and accidentally including all your private sticky notes, passwords, and handwritten arguments with your co-author — permanently and publicly, for anyone in the world to read.

Introduction

Here is an uncomfortable irony worth sitting with: the researchers most trained to spot information exposure are quietly leaking their own. A study from RWTH Aachen University, scheduled to appear at the 2026 IEEE Symposium on Security and Privacy, examined every arXiv submission with source files going back to 1991 — 2.7 million papers in total, covering 93% of all arXiv submissions. What they found was not a fringe problem.

In 88 percent of those submissions, the researchers found content that does nothing for the compiled PDF and was never meant for public distribution. The exposed material ranged from mundane build artifacts to genuinely dangerous secrets: working API tokens, private SSH links, editable Google Docs filled with PII, and passwords committed directly into LaTeX comments. The finding that cuts deepest is this: papers from top-tier security conferences leak more hidden information on average than submissions from any other field. This post breaks down what is leaking, why existing tools fail to stop it, and what researchers and institutions need to do before their next submission.

What Exactly Is Hiding in arXiv Source Files

The Three Categories of Unintended Disclosure

The leaks fall into three groups. First, files bundled into the upload that the paper does not need to compile — backup folders, old drafts, complete Git repositories with editing histories, configuration files, and leftover .nfs files created when something is deleted on a network filesystem. Second, metadata embedded in images and PDFs, which can include usernames, software versions, hardware identifiers, and GPS coordinates. Third, comments and commented-out text inside the LaTeX files themselves.

The third category deserves particular attention from a threat intelligence perspective. LaTeX comments — lines prefixed with % that are invisible in the compiled PDF — are a natural workspace for collaborators. Authors use them to flag disagreements, leave revision notes, include temporary test credentials, or paste API endpoints during development. None of that appears in the published paper. All of it appears in the source archive that arXiv makes publicly available.

What the Researchers Actually Found

Analysis of over 1.2 TB of source data uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, cloud API keys, confidential author communications, internal disagreements, and conference submission credentials exposing information that poses serious reputational risks to both researchers and institutions.

The specifics from the RWTH Aachen study sharpen that picture further: 699 links to Google Docs that granted edit access to anyone who clicked — at least 200 exposing material from the authoring process such as reviews, rebuttals, cover letters, and meeting minutes. Eighteen of those documents contained survey data with personally identifiable information about study participants. The researchers also recovered API tokens, private keys, and passwords in 82 submissions, along with links to FTP and SSH servers, and GPS coordinates that in some cases mapped both research buildings and authors' residential addresses.

Important: An exposed API token in a LaTeX comment is not theoretical risk — it is an active credential that may still be valid years after the paper was published. arXiv never removes old versions of submissions. The source file published in 2022 with a working cloud key is still downloadable today.

Leak Category	Examples Found	Risk Classification	Likely Exploit Path
Embedded credentials	API keys, passwords, SSH links	Critical	Direct unauthorized access
Editable cloud documents	Google Docs with edit access	High	Data exfiltration, PII theft
GPS-tagged image metadata	EXIF coordinates in figures	Medium	Physical reconnaissance
LaTeX comments	Author disputes, internal notes	Medium–High	Social engineering, reputational damage
Bundled Git repositories	Full editing history with diffs	High	Intellectual property exposure
Leftover draft files	Old versions, review rebuttals	Medium	Process intelligence, competitive espionage

Security Researchers Leak More Than Anyone Else

The Finding That Should Sting

The finding that should sting most: papers connected to top-tier security conferences leak more hidden information on average than papers from other fields. Submissions matched to A-star and A-ranked security venues showed statistically significant increases in dangling files, metadata, and unique comments compared to other computer science work and to the rest of arXiv.

The researchers offer a partial explanation: security papers tend to have larger and more complex source trees — more scripts, more auxiliary tools, more experimental apparatus — which creates more surface area for accidental inclusion. That explanation is technically sound and operationally unsatisfying. A researcher who publishes work on credential hygiene and simultaneously leaks an FTP password in their LaTeX comment block has a credibility problem that no complexity argument fully resolves.

Consider a realistic scenario: a research team submits a paper on cloud API security. During development, they pasted a test key into a LaTeX comment as a placeholder. They forgot to remove it before submission. The compiled PDF looks pristine. The source file — publicly downloadable by anyone — contains a valid credential for the team's AWS testing environment. That key may never expire unless someone explicitly rotates it. This maps directly to MITRE ATT&CK T1552.001 (Credentials in Files) and represents the kind of initial access vector that threat actors scan for systematically using OSINT tooling.

Pro Tip: Before submitting any paper to arXiv, run a targeted search across your entire source directory for patterns like key=, password=, token=, https://docs.google.com/, and @gmail.com. A basic grep pass catches the most dangerous exposures in under a minute.

Why Awareness Has Been Absent

Only 41% of researchers surveyed were already aware that arXiv publishes source files and of the associated risks before receiving the disclosure notice. 43% agreed that the disclosed content is sensitive. The majority of researchers, including security professionals, had no idea their source files were fully public. arXiv does warn users in its FAQ that LaTeX comments may be public, but it is evident that some users did not read the warning or consider its broader implications.

This is a classic security awareness gap: the warning exists, but it sits in documentation that users never read, surfaced at no point in the submission workflow that would create friction or recall.

Why the Existing Sanitization Tools Fail

The Tool Landscape Is Broken

arXiv recommends Google's arxiv_latex_cleaner, the most widely cited tool for this job. The study tested six sanitization tools against a set of standard cases and found none that handles all of them. Some crash on basic LaTeX comment environments. Some remove files the paper needs in order to compile. Most rely on regular expressions, which produce both false positives and false negatives. Metadata cleaning is missing from every existing tool the researchers evaluated.

Regular expression-based secret detection is a fundamentally fragile approach in complex source trees. It works for simple patterns — AWS_SECRET_ACCESS_KEY= typed in plain text — and fails for anything more creative: base64-encoded credentials, credentials split across comment lines, or API endpoints embedded in documentation strings. This is the same limitation that affects secret-scanning tools in CI/CD pipelines when not tuned correctly.

The Proposed Replacement

The researchers built their own replacement, ALC-NG, which combines structured parsing of LaTeX source with metadata stripping and a more reliable method for identifying which files the paper uses. Their evaluation shows it covers more cases than any prior tool, and it produces visually identical PDFs in 87 percent of the submissions tested.

ALC-NG is available as open source. For any researcher or institution managing regular arXiv submissions, adopting it over arxiv_latex_cleaner is a straightforward upgrade with meaningful security impact.

Tool	Comment Removal	Metadata Stripping	Dangling File Detection	Compilation Safety
arxiv_latex_cleaner (Google)	Partial	No	Partial	Generally stable
Most regex-based tools	Unreliable	No	Limited	Risk of breakage
ALC-NG (RWTH Aachen)	Yes	Yes	Yes	87% identical PDFs

The Disclosure Problem: Who Is Responsible for Fixing This at Scale?

arXiv's Refusal to Coordinate

The authors notified 2,660 researchers across 1,141 papers about specific findings in their submissions. They first asked arXiv to coordinate the disclosure. arXiv declined, citing author identity protection, and assigned the responsibility back to the authors of affected papers.

The practical consequence: the research team manually collected contact addresses and notified authors individually. Only 18 affected submissions were updated after disclosure. The original versions remain permanently accessible — arXiv never removes old paper versions.

Juan Mathews Rebello Santos, a cybersecurity analyst, said the right model is shared responsibility but asymmetrical: "The platform should handle systemic risk reduction at scale, while users remain accountable for final verification." This maps cleanly to NIST CSF 2.0's Govern and Protect functions — platforms like arXiv carry systemic responsibility for building controls into the submission pipeline, while individual researchers retain accountability for final verification before submission.

For institutions operating under GDPR or handling research data that includes participant PII, this is not an abstract governance discussion. Survey data with personally identifiable information about study participants — found in at least 18 of the exposed Google Docs — is a reportable data exposure under Article 33 of GDPR if it involves EU residents. Eighteen documents is not a large number in absolute terms; the legal exposure they carry is significant.

Important: If you have already submitted a paper to arXiv containing active credentials or editable links to sensitive documents, cleaning the source file is not enough. You must revoke the credential or remove the sharing permission at the source. The published version of the source file is permanent.

🔑 Key Takeaways

Audit every file in your LaTeX project directory before submission — not just the .tex files. Check for bundled Git repositories, configuration files, old drafts, and any file your paper does not actively reference
Run ALC-NG before submission, not arxiv_latex_cleaner — the existing recommended tool misses metadata and fails on common LaTeX comment patterns
Grep your source files for credential patterns — key=, token=, password=, docs.google.com/, ftp://, and ssh:// take 60 seconds to scan and catch the highest-risk exposures
Treat LaTeX comments as public — every % comment line in your source file is visible to anyone who downloads the submission archive; write comments accordingly
Revoke before you submit, not after — rotating a credential or removing edit access from a Google Doc after publication does not remove the risk from already-indexed source files
Institutions managing research data subject to GDPR or HIPAA should establish a pre-submission checklist as a mandatory part of their research data management policy, aligned with ISO 27001 Annex A.8.10 (Information Deletion) and CIS Control 3 (Data Protection)

Conclusion

The arXiv LaTeX source file problem is a clean demonstration of how security failures happen: not through sophisticated attacks, but through the accumulation of small, unconsidered decisions. A researcher pastes a test key into a comment. A collaborator links to a shared doc without checking permissions. A build script gets bundled because no one cleaned the folder before zipping. None of these decisions feels dangerous in the moment. All of them become permanently public the instant the submission processes.

What makes this incident category particularly pointed is the population affected. Security researchers — people whose professional work involves finding exactly these kinds of exposure patterns — are leaking credentials and PII at higher rates than other academic disciplines. The knowledge gap is real, and it exists even among experts.

The mitigation path is not complex: adopt ALC-NG, build a pre-submission checklist into your workflow, and treat every LaTeX comment as something a stranger will read. For institutions, this belongs in your research data management policy, not as a footnote in a department email. Start with a one-time audit of your past arXiv submissions and revoke any credentials or shared document links you find.

Frequently Asked Questions

Q: Are my old arXiv papers affected if I submitted them years ago? Yes, and this is the most important practical point in the entire story. arXiv never removes old versions of submissions. A source file uploaded in 2018 containing an API key or editable Google Doc link is still publicly downloadable today. The fix is not to update the paper — it is to revoke the credential or remove the sharing permission directly in the external service.

Q: What kinds of secrets are most commonly found in LaTeX source files? The most frequently found sensitive items include API tokens and private keys in 82 confirmed submissions, links to Google Docs and Google Drive folders with edit access granted to anyone with the link, GPS coordinates embedded in image metadata, leftover Git repository histories containing full code and commit logs, and LaTeX comments containing passwords, internal system paths, or frank assessments of peer reviewers.

Q: Does arXiv scan submissions for sensitive information before publishing? No. arXiv publishes source files as submitted. The platform does warn authors in its FAQ that LaTeX comments are public, but it does not perform automated credential scanning or metadata sanitization on incoming submissions. When approached for coordinated disclosure by the RWTH Aachen research team, arXiv declined to participate and referred responsibility back to individual authors.

Q: Is arxiv_latex_cleaner sufficient to remove sensitive content before submitting? No. The RWTH Aachen study evaluated six sanitization tools, including arxiv_latex_cleaner, and found none that handled all exposure categories. arxiv_latex_cleaner does not strip metadata from images and produces false positives and false negatives on LaTeX comment patterns. The research team's replacement tool, ALC-NG, is available as open source and outperforms every existing tool evaluated.

Q: Does this create a compliance problem for institutions? Potentially, yes. Research that includes survey responses, participant data, or any PII from EU residents falls under GDPR. If that data ended up in an editable Google Doc linked from a LaTeX source file — as it did in at least 18 of the submissions identified — that constitutes a data breach under GDPR Article 4(12) and triggers notification obligations under Article 33. Institutions with active IRB-approved studies or research programs handling health data under HIPAA should treat this as an immediate compliance review item.