Pre-Generating Multi-Difficulty PDE Data For Few-Shot Neural PDE Solvers

Abstract

A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty—e.g., more complex geometries and higher Reynolds numbers—along which problems become (1) harder for classical solvers and thus (2) more likely to benefit from neural speedups. Towards addressing this chicken-and-egg challenge, we study difficulty transfer on 2D incompressible Navier-Stokes, systematically varying task complexity along geometry (number and placement of obstacles), physics (Reynolds number), and their combination.

Similar to how it is possible to spend compute to pre-train foundation models and improve their performance on downstream tasks, we find that by classically solving (analogously pre-generating) many low and medium difficulty examples and including them in the training set, it is possible to learn high-difficulty physics from far fewer samples. Furthermore, we show that by combining low and high difficulty data, we can spend 8.9× less compute on pre-generating a dataset to achieve the same error as using only high difficulty examples. Our results highlight that how we allocate classical-solver compute across difficulty levels is as important as how much we allocate overall, and suggest substantial gains from principled curation of pre-generated PDE data for neural solvers.

Key Insight

Just a small fraction of lower difficulty examples recovers most performance when training neural PDE solvers. Mixing in easy and medium difficulty data dramatically improves performance on hard examples while reducing data generation costs.

Alpha mixing for physics difficulty - FPO

Alpha mixing for physics difficulty - LDC

Figure: Performance on hard (high Reynolds number) examples while varying data composition. We fix the total number of training examples to 800 and vary the fraction consisting of high Re examples. Adding lower-difficulty examples substantially improves performance while reducing expensive data generation.

Problem Setup

The Chicken-and-Egg Challenge

Neural PDE solvers promise to accelerate classical numerical methods, but require training data generated by those same classical solvers. This creates a fundamental challenge: the hardest problems we want to solve are exactly those for which it's most expensive to generate training data.

Difficulty Axes

We identify two primary axes along which PDE problems become more difficult:

Geometry Complexity

Number and placement of obstacles in the flow domain

Easy: No obstacles (regular channel/cavity)
Medium: Single obstacle
Hard: Multiple obstacles with complex arrangements

Physics Complexity

Reynolds number (Re) - ratio of inertial to viscous forces

Easy: Re ∈ [100, 1000] - laminar flow
Medium: Re ∈ [2000, 4000] - transitional
Hard: Re ∈ [8000, 10000] - turbulent

Figure: Visualizations showing increasing geometry complexity (left) and physics complexity via Reynolds number (right) in Flow Past Object (FPO) simulations.

Problem Families

We study two canonical incompressible Navier-Stokes problem types:

Flow Past Object (FPO): External flow around obstacles in a channel
Lid-Driven Cavity (LDC): Cavity flow with moving top wall

Main Results

1. Mixing Lower Difficulty Data Works

Adding easy-to-medium difficulty data substantially improves performance on hard distributions. For Poseidon-B fine-tuned on hard FPO data, replacing 90% of the hard examples with easier ones reduces data generation time by 8.9× while maintaining similar accuracy.

Figure: Performance on multi-obstacle FPO with varying physics difficulty mixing. Left: CNO and FFNO. Right: Poseidon variants.

2. Geometry Difficulty Transfer

The benefits of difficulty mixing extend to geometric complexity. Training on mixtures of simple and complex geometries improves generalization to complex multi-obstacle configurations.

Figure: Performance on multi-obstacle examples with varying geometry difficulty mixing. Including simpler geometries improves performance on complex configurations.

3. Combined Physics and Geometry Difficulty

When varying both physics and geometry difficulty simultaneously, the benefits of mixed-difficulty training compound, demonstrating the generality of the approach.

Figure: Performance with both physics and geometry difficulty varying. The benefits of difficulty transfer hold across multiple axes simultaneously.

4. Data Scaling Laws

Medium difficulty data is more sample-efficient than easy data. For most pre-generation budgets, training on fewer medium-difficulty examples outperforms training on more easy examples.

Figure: Scaling behavior with fixed compute budget. Medium difficulty examples are more efficient than easy ones for achieving target performance on hard distributions.

5. Simulation Cost Analysis

Classical solver costs increase dramatically with difficulty, making strategic data curation essential for efficient neural PDE solver training.

Figure: Simulation times for different difficulty levels. High-difficulty examples can be 10× more expensive to generate, motivating careful data composition strategies.

Dataset

We are releasing a comprehensive dataset of 2D incompressible Navier-Stokes simulations with systematic variation across difficulty axes. Access the dataset on Hugging Face. The dataset enables research on few-shot learning, transfer learning, and foundation models for neural PDE solvers.

Dataset Overview

2

Problem Families

Flow Past Object (FPO) & Lid-Driven Cavity (LDC)

3×3

Difficulty Levels

Easy, Medium, Hard across Geometry & Physics

1000s

Simulations

Diverse initial conditions and configurations

128×128

Resolution

High-fidelity spatial discretization

Data Generation

All simulations were generated using OpenFOAM, a leading open-source CFD toolkit. Our preprocessing pipeline includes:

NURBS-based obstacle generation for smooth, varied geometries
Signed distance field (SDF) computation for obstacle representation
Physical channel extraction (velocity, pressure, vorticity)
Temporal trajectory recording with 50+ timesteps per simulation

Figure: Example physical channels from our dataset: velocity components, pressure, and vorticity fields with geometry mask and SDF.

Data Format

The dataset is provided in HDF5 format with the following structure:

Input channels: Initial velocity (u, v), pressure, geometry mask, SDF
Output channels: Velocity evolution (u, v), pressure evolution
Metadata: Reynolds number, obstacle configurations, boundary conditions
Splits: Pre-defined train/validation/test splits for reproducibility

Citation

If you find our work useful, please consider citing:

@article{pregen2025,
  title={Pre-Generating Multi-Difficulty PDE Data For Few-Shot Neural PDE Solvers},
  author={Choudhary, Naman and Singh, Vedant and Talwalkar, Ameet and Boffi, Nicholas Matthew and Khodak, Mikhail and Marwah, Tanya},
  journal={arXiv preprint},
  year={2025}
}

Acknowledgments

This research was conducted at Carnegie Mellon University's Machine Learning Department. We thank the CMU computing resources and the broader scientific machine learning community for their support and feedback.

Pre-Generating Multi-Difficulty PDE DataFor Few-Shot Neural PDE Solvers

Abstract

Key Insight

Problem Setup

The Chicken-and-Egg Challenge

Difficulty Axes

Geometry Complexity

Physics Complexity

Problem Families

Main Results

1. Mixing Lower Difficulty Data Works

2. Geometry Difficulty Transfer

3. Combined Physics and Geometry Difficulty

4. Data Scaling Laws

5. Simulation Cost Analysis

Dataset

Dataset Overview

Data Generation

Data Format

Citation

Acknowledgments

Pre-Generating Multi-Difficulty PDE Data
For Few-Shot Neural PDE Solvers