A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty—e.g., more complex geometries and higher Reynolds numbers—along which problems become (1) harder for classical solvers and thus (2) more likely to benefit from neural speedups. Towards addressing this chicken-and-egg challenge, we study difficulty transfer on 2D incompressible Navier-Stokes, systematically varying task complexity along geometry (number and placement of obstacles), physics (Reynolds number), and their combination.
Similar to how it is possible to spend compute to pre-train foundation models and improve their performance on downstream tasks, we find that by classically solving (analogously pre-generating) many low and medium difficulty examples and including them in the training set, it is possible to learn high-difficulty physics from far fewer samples. Furthermore, we show that by combining low and high difficulty data, we can spend 8.9× less compute on pre-generating a dataset to achieve the same error as using only high difficulty examples. Our results highlight that how we allocate classical-solver compute across difficulty levels is as important as how much we allocate overall, and suggest substantial gains from principled curation of pre-generated PDE data for neural solvers.
Just a small fraction of lower difficulty examples recovers most performance when training neural PDE solvers. Mixing in easy and medium difficulty data dramatically improves performance on hard examples while reducing data generation costs.
Neural PDE solvers promise to accelerate classical numerical methods, but require training data generated by those same classical solvers. This creates a fundamental challenge: the hardest problems we want to solve are exactly those for which it's most expensive to generate training data.
We identify two primary axes along which PDE problems become more difficult:
Number and placement of obstacles in the flow domain
Reynolds number (Re) - ratio of inertial to viscous forces
We study two canonical incompressible Navier-Stokes problem types:
Adding easy-to-medium difficulty data substantially improves performance on hard distributions. For Poseidon-B fine-tuned on hard FPO data, replacing 90% of the hard examples with easier ones reduces data generation time by 8.9× while maintaining similar accuracy.
The benefits of difficulty mixing extend to geometric complexity. Training on mixtures of simple and complex geometries improves generalization to complex multi-obstacle configurations.
When varying both physics and geometry difficulty simultaneously, the benefits of mixed-difficulty training compound, demonstrating the generality of the approach.
Medium difficulty data is more sample-efficient than easy data. For most pre-generation budgets, training on fewer medium-difficulty examples outperforms training on more easy examples.
Classical solver costs increase dramatically with difficulty, making strategic data curation essential for efficient neural PDE solver training.
We are releasing a comprehensive dataset of 2D incompressible Navier-Stokes simulations with systematic variation across difficulty axes. Access the dataset on Hugging Face. The dataset enables research on few-shot learning, transfer learning, and foundation models for neural PDE solvers.
All simulations were generated using OpenFOAM, a leading open-source CFD toolkit. Our preprocessing pipeline includes:
The dataset is provided in HDF5 format with the following structure:
If you find our work useful, please consider citing:
@article{pregen2025,
title={Pre-Generating Multi-Difficulty PDE Data For Few-Shot Neural PDE Solvers},
author={Choudhary, Naman and Singh, Vedant and Talwalkar, Ameet and Boffi, Nicholas Matthew and Khodak, Mikhail and Marwah, Tanya},
journal={arXiv preprint},
year={2025}
}
This research was conducted at Carnegie Mellon University's Machine Learning Department. We thank the CMU computing resources and the broader scientific machine learning community for their support and feedback.