ArXiv Declares Independence from Cornell

(science.org)

333 points | by bookstore-romeo 6 hours ago

22 comments

  • frankling_ 3 hours ago
    The recent announcement to reject review articles and position papers already smelled like a shift towards a more "opinionated" stance, and this move smells worse.

    The vacuum that arXiv originally filled was one of a glorified PDF hosting service with just enough of a reputation to allow some preprints to be cited in a formally published paper, and with just enough moderation to not devolve into spam and chaos. It has also been instrumental in pushing publishers towards open access (i.e., to finally give up).

    Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right. Consider the impression you get when seeing a reference to an arXiv preprint vs. a link to an author's institutional website.

    In my view, arXiv fulfills its function better the less power it has as an institution, and I thus have exactly zero trust that the split from Cornell is driven by that function. We've seen the kind of appeasement prose from their statement and FAQ [1] countless times before, and it's now time for the usual routine of snapshotting the site to watch the inevitable amendments to the mission statement.

    "What positive changes should users expect to see?" - I guess the negative ones we'll have to see for ourselves.

    [1] https://tech.cornell.edu/arxiv/

    • hijodelsol 2 hours ago
      I came here to say something similar. As someone who works in a field that applies machine learning but is not purely focused on it, I interact with people who think that arXiv is the only relevant platform and that they don't need to submit their work to any journal, as well as people who still think that preprints don't count at all and that data isn't published until it's printed in an academic journal. It can feel like a clash of worlds.

      I think both sides could learn from the other. In the case of ML, I understand the desire to move fast and that average time to publication of 250-300 days in some of the top-tier journals can feel like an unnecessary burden. But having been on both sides of peer review, there is value to the system and it has made for better work.

      Not doing any of it follows the same spirit as not benchmarking your approach against more than maybe one alternative and that already as an after-thought. Or benchmaxxing but not exploring the actual real-world consequences, time and cost trade offs, etc.

      Now, is academic publishing perfect? Of course not, very very far from it. It desperately needs to be reformed to keep it economically accessible, time efficient for both authors, editors and peer reviewers and to prevent the "hot topic of the day" from dominating journals and making sure that peer review aligns with the needs of the community and actually improves the quality of the work, rather than having "malicious peer review" to get some citations or pet peeves in.

      Given the power that the ML field holds and the interesting experiments with open review, I would wish for the field to engage more with the scientific system at large and perhaps try to drive reforms and improve it, rather than completely abandoning it and treating a PDF hosting service as a journal (ofc, preprints would still be desirable and are important, but they can not carry the entire field alone).

      • bonoboTP 1 hour ago
        Simply anticipating basic push backs from reviewers makes sure that you do a somewhat thorough job. Not 100% thorough and the reviews are sometimes frivolous and lazy and stupid. But just knowing that what you put out there has to pass the admittedly noisily gatekept gate of peer review overall improves papers in my estimation. There is also a negative side because people try to hide limitations and honest assessments and cherry pick and curate their tables more in anticipation of knee jerk reviewers but overall I think without any peer review, author culture would become much more lax and bombastic and generally trend toward engagement bait and social media attention optimized stuff.

        The current balance where people wrote a paper with reviers in mind, upload it to Arxiv before the review concludes and keep it on Arxiv even if rejected is a nice balance. People get to form their own opinion on it but there is also enough self-imposed quality control on it just due to wanting it to pass peer review, that even if it doesn't pass peer review, it is still better than if people write it in a way that doesn't care or anticipate peer review. And this works because people are somewhat incentivized to get peer reviewed official publications too. But being rejected is not the end of the world either because people can already read it and build on it based on Arxiv.

    • stared 1 hour ago
      > arXiv fulfills its function better the less power it has as an institution

      It is an interesting instance of the rule of least power, https://en.wikipedia.org/wiki/Rule_of_least_power.

    • light_hue_1 43 minutes ago
      > Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right. Consider the impression you get when seeing a reference to an arXiv preprint vs. a link to an author's institutional website.

      This just isn't true. arXiv is not a venue. There's no place that gives you credit for arXiv papers. No one cares if you cite an arXiv paper or some random website. The vast vast majority of papers that have any kind of attention or citations are published in another venue.

    • ph4rsikal 2 hours ago
      My observation is that research, especially in AI has left universities, which are now focusing their research to a lesser degree on STEM. It appears research is now done by companies like Meta, OpenAI, Anthropic, Tencent, Alibaba, among many others.
      • bonoboTP 1 hour ago
        Universities (outside a few) just have much weaker PR machines so you never hear what they do. Also their work is not user facing products so regular people, even tech power users won't see them.
  • swiftcoder 1 hour ago
    > raised concerns about the proposed $300,000 salary for arXiv’s new CEO, saying it seemed high

    Is a mid-to-high engineering salary outlandish for a CEO of what is likely to be a fairly major non-profit? Even non-profits have to be somewhat competitive when it comes to salary, and the ideal candidate is likely someone who would be balancing this against a tenured position at a major university

    • mort96 49 minutes ago
      Salaries in the US are so bonkers. Everywhere else outside of the US, $300,000 is an outlandish high salary. To call it "mid to high" is insane.
      • swiftcoder 27 minutes ago
        Even in the states, it’s more a distortion caused by the big tech centres. A software engineer in Ohio doesn’t command that kind of salary, but in San Francisco or Seattle that’ll buy you a moderately-senior engineer.

        And while academic salaries are generally not great, tenured professors at big universities tend to make a fair bit (plus a lot more vacation time and perks than is normal in the US)

      • HappyPanacea 40 minutes ago
        Yes the obvious play is to move human labor to cheaper countries like France (including CEO of course).
        • renewiltord 2 minutes ago
          The reason the French can’t build these things is the same reason they shouldn’t be allowed to be in charge. It’s a preprint PDF host. Just make your own if you can run this one.
    • Hendrikto 7 minutes ago
      For anybody outside the SV, and especially outside the US, this seems high, yes.

      arXiv does not need to and should not optimize for “shareholder value”, which is at least nominally the justification for outlandish CEO pay packages.

    • HappyPanacea 42 minutes ago
      arXiv's CEO doesn't need to be a tenured professor equivalent it is a preprint repository ffs.
  • halperter 5 hours ago
    • reed1234 5 hours ago
      Should be the main link. The original article is based on the CEO job posting.
  • psalminen 5 hours ago
    I might be missing something, but I still don't get the why. I don't see any "problem" that needs to be solved.
    • kolinko 4 hours ago
      The article lists the reasons quite clearly.
      • binsquare 3 hours ago
        For everyone else,

        The reason is because arxiv is growing significantly leading to 297,000 deficit in operating costs for 2025 alone. Corenell has helped with donation a long with other organizations that pay membership fees.

        As a result, donors + leaders of arxiv think it's best to spin off to increase funding.

        • vl 2 hours ago
          What is unclear why they need stuff of 27 and 6.7 million to operate essentially static hosting website in 2026.
          • swiftcoder 1 hour ago
            The "essentially static hosting" isn't the cost centre (although with 5 million MAU, it's nothing to sneeze at). The real costs are on the input side - they have an ingestion pipeline that ensures standardised paper formatting and so on, plus at least some degree of human review.
            • bonoboTP 1 hour ago
              Do you mean that the CPU compute cost of turning latex into pdf/HTML is the main cost?
              • swiftcoder 1 hour ago
                No, I mean that the pipeline requires software engineers to build/maintain, and salaries are (as in basically every tech organisation) the dominant cost
                • bonoboTP 1 hour ago
                  Then drop it and make people upload a pdf and a zip of the latex sources.

                  Most people I talk to hate that pipeline and spend a lot of debug hours on it when Arxiv can't compile what overleaf and your local latex install can.

            • lou1306 1 hour ago
              The PDF formatting is all but standardised. They ingest LaTeX sources, which is formatted according to the authors' whims (most likely, according to whatever journal or conference they just submitted the manuscript to). I'll concede that the (relatively novel) HTML formatter gives paper a more uniform appearance. They also integrate a bunch of external services for e.g., citation metrics and cross-references. Still hard to justify such a high cost to operate, but eh.

              Also, the "human review" is a simple moderation process [1]. It usually does not dig into the submission's scientific merits.

              [1] https://info.arxiv.org/help/moderation/index.html

          • OtherShrezzing 37 minutes ago
            I don't see it as an especially exuberant structure or budget. I've seen larger teams with bigger budgets struggle to maintain smaller applications.

            I've contracted into some consultancy teams which you could uncharitably describe as "15 people and $4mn/yr to create one PDF per month".

    • u1hcw9nx 3 hours ago
      I think the problem described in 6th paragraph needs to be solved.
  • vedantxn 4 minutes ago
    we got this before gta 6
  • ACCount37 21 minutes ago
    Frankly, the only beef I have with arXiv as is: its insistence on blocking AI access.

    I had to tell my AI to set up an MCP for "fetch while bypassing arXiv's rate limit" so that it doesn't burn 40k tokens looking for workarounds every time it wants to look at a paper and gets hit with a "sorry, meatbags only" wall.

    Very annoying, given how relevant arXiv papers are for ML specifically, and how many of papers there are. Can't "human flesh search" through all of them to pick the relevant ones for your work, and they just had to insist on making it harder for AIs to do it too.

  • dataflow 5 hours ago
    This sounds terrible. Of course there's a huge risk of it becoming made for-profit. It almost makes you wonder if the academic publishers are behind this push somehow.

    Could they not have made it into some legal structure that puts universities at the top? Say, with a bunch of universities owning shares that comprise the entirety of the ownership of arXiv, but that would allow arXiv to independently raise funds?

    • gucci-on-fleek 5 hours ago
      > Of course there's a huge risk of it becoming made for-profit.

      The article says that "it will become an independent nonprofit corporation", and as OpenAI's failed attempt showed, converting a non-profit to a for-profit organization is either really hard or impossible.

      > Could they not have made it into some legal structure that puts universities at the top?

      As a corporation (even a non-profit one), it will have a board of directors. I have no idea what their charter will look like, but I would be surprised if at least one seat wasn't reserved for a university representative, and more than that seems quite likely as well.

      • MostlyStable 5 hours ago
        OpenAI didn't get everything that they wanted, but I very much disagree with calling it a "failed attempt". The non-profit went from owning the entirety of OpenAI to having ~25% stake.
        • ronsor 5 hours ago
          Sam Altman is a special kind of person; not many could pull off the schemes he does.
          • gentleman11 4 hours ago
            I doubt it was him who architected it. A team of lawful evil lawyers more likely
        • cbolton 2 hours ago
          The non-profit still controls the board doesn't it?
          • weedhopper 1 hour ago
            As shown by Altman, not really.
        • gucci-on-fleek 5 hours ago
          Ah, thanks for the correction.
      • mort96 19 minutes ago
        Is your argument really that "OpenAI was an independent nonprofit corporation and it worked out great, Arxiv will remain just as non-profit as OpenAI"?
        • gucci-on-fleek 9 minutes ago
          No, my argument is that OpenAI could make billions of dollars if they converted from a non-profit to a for-profit, and they only succeeded after years of effort and because they had already structured the company into separate for-profit and non-profit entities. And even after all this, the non-profit still controls the majority of the for-profit entity.

          So if OpenAI with billions of dollars only partially succeeded at converting to a for-profit business, then that suggests that organizations with fewer resources (like arXiv) have much worse odds.

  • asimpleusecase 3 hours ago
    I wonder if there are plans to licence the content for AI training
  • Aerolfos 2 hours ago
    And they hired a LinkedIn business idiot to run the new organization - so the aim is for an infinite growth tech startup in terms of governance, despite the technical legal status of non-profit. It shows in the language they use in the announcement, too ("improved financial viability in the long run")

    OpenAI shows exactly how well that works and what that kind of governance does to a company and to its support of science and the commons.

    TL;DR, it's fucked.

  • bonoboTP 1 hour ago
    I fear their Mozilla-ification and Wikipedia-ification. Scope creep, various outreach feel-good programs, ballooning costs, lost focus etc. And other types of enshittification.

    Any change to the basic premise will be a negative step.

    They should just be boring quiet unopininionated neutral background infrastructure.

    • kergonath 35 minutes ago
      > They should just be quiet unopininionated neutral background infrastructure.

      Exactly. It should be a utility. Not quite dumb pipe, but not too far either.

  • tornikeo 5 hours ago
    Now the question is, will arxiv wage a decade long bloody war with Cornell, using heavy infantry (PhD students), archers (reviewers) and field artillery (AI slop papers), or will the independence be mostly peaceful? Only time can tell.
    • alansaber 5 hours ago
      PhD students are levy infantry at best with Postdocs being the armoured levies.
      • dmos62 3 hours ago
        Is this Gondor or Mordor?
  • Peteragain 4 hours ago
    .. and soon to be dependent on US military funding? Controlled by someone who has run-ins with universities? This'll end in tears.
  • Garlef 3 hours ago
    Maybe they should implement a graph based trust system:

    You need your favourite academic gatekeeper (= thesis advisor) to vouch for you in order to be allowed to upload.

    Then AI slop gets flagged and the shame spreads through the graph. And flaggings need to have evidence attached that can again be flagged.

  • shevy-java 2 hours ago
    "Recently arXiv’s growth has accelerated. Since 2022, it has expanded its staff to 27, in large part to deal with a 50% increase in submitted manuscripts."

    I am wary of that. IMO the business model is damaged therein. You can say in 2022 we had 27; bankrupt in 2030.

  • OutOfHere 4 hours ago
    With 300K for the CEO, its enshittification will commence imminently. It will now serve to maximize revenue. Just wait and watch while they issue a premium membership, payment requirements for authors, and other revenue generators to please their investors.
    • exe34 4 hours ago
      they'll just turn into a shitty journal at this point, they just need to introduce peer review and they can start competing with the real journals on price point.

      another will need to rise to take its place.

  • adamnemecek 5 hours ago
    Good call, ArXiv seems like one of the most important institutions out there right now.
    • kergonath 28 minutes ago
      The French government put a bit of money on the table to help researchers fulfil their open science requirements for government and EU grants, and funded the HAL repository ( https://hal.science/ ). It’s much smaller than arXiv, but it exists. In other countries like the UK there are clusters of smaller repositories as well, but it’s not as well centralised.
    • p-e-w 5 hours ago
      It’s so important, in fact, that there should be more than one such institution.

      People keep falling into the same trap. They love monopolies, then are shocked when those monopolies jerk them around.

      • freehorse 3 hours ago
        It is just a preprint repository. It is pretty open (the stories where a preprint was rejected or delayed unreasonably are extremely rare). It offers the basic services for a math/compsci/physics themed preprint repository.

        I don't see much of a monopoly, nor any "moat" apart from it being recognised. You can already post preprints on a personal website or on github, and there are "alternatives" such as researchgate that can also host preprints, or zenodo. There are also some lesser known alternatives even. I do not see anything special in hosting preprints online apart from the convenience of being able to have a centralised place to place them and search for them (which you call "monopoly"). If anything, the recognisability and centrality of arxiv helped a lot the old, darker days to establish open access to papers. There was a time when many journals would not let you publish a preprint, or have all kinds of weird rules when you can and when you can't. Probably still to some degree.

      • auggierose 5 hours ago
        I am using Zenodo for a while now instead. It is more user friendly, as well.
        • Al-Khwarizmi 2 hours ago
          I like it as well, it works great. But I wonder if it would scale if at some point there were a massive exodus from arXiv.
          • auggierose 1 hour ago
            I think it already hosts much more data than arXiv, given that they also host large datasets.
        • mastermage 3 hours ago
          Zenodo is more for IT Papers and also datasets isn't it?
          • auggierose 3 hours ago
            It can host large datasets as well, yes. It is hosted by CERN, so it is not specifically IT in any way. It also allows you to restrict access to the files of your submission. It has no requirements to submit your LaTeX sources, any PDF will be fine. There are also no restrictions on who can publish. You'll get a DOI, of course.

            Everything published on arXiv could also be published on Zenodo, but not the other way around.

      • andbberger 5 hours ago
        there is. bioarxiv.
    • koakuma-chan 4 hours ago
      it just hosts pdfs, no?
      • aragilar 3 hours ago
        It does do a fair amount of filtering of submissions, and it's a long term archive (e.g. for the next 100+ years). I suspect both (but with the former dominating) are the issue.
        • bonoboTP 1 hour ago
          Just put out a torrent and people of the sort at r/DataHoarder will keep it alive for longer than bureaucrats.
      • freehorse 3 hours ago
        Well, technically, it can also compile your tex file if you upload the tex file instead of the pdf directly, which helps a lot in standardizing the stylistic structure between preprints. Most other repositories are wild west and inconsistent. I really appreciate the similarity in style applied to most preprints there. Moreover, this means you can also download not just the pdf, but the source tex file to, which can be very useful.
        • bonoboTP 1 hour ago
          The similarity in style comes from conference and journal templates, not from Arxiv. You can style your paper with latex in any style, Arxiv doesn't care. On Arxiv you mostly see preprints that people submit to conferences and journals and they enforce the style.
      • pfortuny 3 hours ago
        Also the sources and has a very tame but useful pre-acceptance process.
      • IshKebab 2 hours ago
        Technically yes, socially no.
  • bobokaytop 3 hours ago
    [dead]
  • Ghengeaua 2 hours ago
    [dead]
  • unit149 5 hours ago
    [dead]
  • eastern-sun 1 hour ago
    [dead]
  • tgtracing 5 hours ago
    [dead]
  • davnicwil 4 hours ago
    Very unrelated to the article, but I think 'arXiv' as a brand is bad, and really detrimental to what the institution aims to accomplish.

    That is, it's not readily parseable, it really gives an insider term vibe - like this isn't for you if you don't already know what it means or how you should read or say it. It sort of reminds me of the overuse of latin and latinate terms generally in the old professions and, well, the academy.

    Just always struck me as being somewhat at odds with the goal.

    • john-titor 4 hours ago
      I wonder what makes you feel that. I've been publishing preprints close to a decade on arxiv now and never had any particular feelings about it.

      To me it's just a way to get out your work fast, so that there is already a trace of it on the Internets - nothing more and nothing less.

      > That is, it's not readily parseable, it really gives an insider term vibe...

      Isn't that normal with highly specialized research fields? I agree many papers could benefit from clearer wording, but working in a niche means you sometimes don't reach a broader audience

      • davnicwil 4 hours ago
        It's an opinion, and you feeling no particular way about it is equally valid.

        But I did justify and maybe to reword slightly, surely if one of the main drivers is opening up research, the brand name should be something that's less obscure and more accessible / understandable as to what it is on first sight?

        Maybe arXiv evoking the word 'archive' with an ancient Greek twist does that for some, but it's clearly a bit cryptic for many, and if the point is to open up probably the brand should just be something much plainer.

        • Cordiali 40 minutes ago
          I've never even connected the 'X' to the Greek letter chi. I just kinda accepted it as one of many groovy web 2.0 misspellings in search of a domain and trademark.
        • aragilar 3 hours ago
          No, it's to be a pre-print server. If someone doesn't know what that means, then they shouldn't be using arXiv.
          • davnicwil 3 hours ago
            everyone has a first time they see a thing and don't yet know what it is.

            Using a brand as a filter where you have to already know what it means to get it is exactly the opposite of what it's supposed to achieve.

            Consider the most exclusive (successful) brands that exist. Even there, where exclusivity is a brand goal, none of them have this property of being obscure on first contact.

            • bonoboTP 51 minutes ago
              You usually get introduced to it by your academic supervisor or collaborators as a masters or PhD student. If you're a solo researcher who has made a significant contribution on the frontier of science, I'm sure you'll be able to understand how Arxiv works as well. Because I assume you have had some conversations with other experts in the field. If you're a full on autodidact with no contact to any other researchers in the field, well, maybe it's better if you chat with some other people in that field.

              Its reasonable to have a tradeoff here to avoid cranks and now AI psychosis slop. You can still post on research gate and academia.edu or you own github page or webhosting.

    • jltsiren 3 hours ago
      It's a classic story of someone having to pick a name quickly, which then gets established long before anyone who cares about branding is aware of its existence.

      The original service didn't even have a name, only a description, and it was amusingly hosted at xxx.lanl.gov. But LANL wasn't really interested in it, and the founder eventually left for Cornell. At that point, the service needed a domain name, but archive.org was already taken.

      And besides, the name has Ancient Greek influences. A similar Latinate term might be something like "archive".

      • bonoboTP 49 minutes ago
        I thought the X was an allusion to LaTeX.
      • davnicwil 3 hours ago
        Interesting, thanks for the context! Makes it more understandable as a choice.
    • nixon_why69 4 hours ago
      > like this isn't for you if you don't already know what it means

      Isn't that actually kindof a good brand signal for a repo of very specialized papers? "Fun with learning" in comic sans wouldn't help credibility.

    • vasco 4 hours ago
      This the type of guy that will suggest paper.ly as a better name with a straight face and then we wonder why the internet is turning to shit