Pre-Generating Multi-Difficulty PDE Data
For Few-Shot Neural PDE Solvers

Machine Learning Department, Carnegie Mellon University
*Equal contribution. †Equal advising. Author order determined alphabetically.
Achieve 8.9× compute reduction in neural PDE solver training by strategically pre-generating multi-difficulty data

Abstract

A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty—e.g., more complex geometries and higher Reynolds numbers—along which problems become (1) harder for classical solvers and thus (2) more likely to benefit from neural speedups. Towards addressing this chicken-and-egg challenge, we study difficulty transfer on 2D incompressible Navier-Stokes, systematically varying task complexity along geometry (number and placement of obstacles), physics (Reynolds number), and their combination.

Similar to how it is possible to spend compute to pre-train foundation models and improve their performance on downstream tasks, we find that by classically solving (analogously pre-generating) many low and medium difficulty examples and including them in the training set, it is possible to learn high-difficulty physics from far fewer samples. Furthermore, we show that by combining low and high difficulty data, we can spend 8.9× less compute on pre-generating a dataset to achieve the same error as using only high difficulty examples. Our results highlight that how we allocate classical-solver compute across difficulty levels is as important as how much we allocate overall, and suggest substantial gains from principled curation of pre-generated PDE data for neural solvers.

Key Insight

Just a small fraction of lower difficulty examples recovers most performance when training neural PDE solvers. Mixing in easy and medium difficulty data dramatically improves performance on hard examples while reducing data generation costs.

Alpha mixing for physics difficulty - FPO Alpha mixing for physics difficulty - LDC
Figure: Performance on hard (high Reynolds number) examples while varying data composition. We fix the total number of training examples to 800 and vary the fraction consisting of high Re examples. Adding lower-difficulty examples substantially improves performance while reducing expensive data generation.

Problem Setup

The Chicken-and-Egg Challenge

Neural PDE solvers promise to accelerate classical numerical methods, but require training data generated by those same classical solvers. This creates a fundamental challenge: the hardest problems we want to solve are exactly those for which it's most expensive to generate training data.

Difficulty Axes

We identify two primary axes along which PDE problems become more difficult:

Geometry Complexity

Number and placement of obstacles in the flow domain

  • Easy: No obstacles (regular channel/cavity)
  • Medium: Single obstacle
  • Hard: Multiple obstacles with complex arrangements

Physics Complexity

Reynolds number (Re) - ratio of inertial to viscous forces

  • Easy: Re ∈ [100, 1000] - laminar flow
  • Medium: Re ∈ [2000, 4000] - transitional
  • Hard: Re ∈ [8000, 10000] - turbulent
Geometry complexity example Physics complexity example
Figure: Visualizations showing increasing geometry complexity (left) and physics complexity via Reynolds number (right) in Flow Past Object (FPO) simulations.

Problem Families

We study two canonical incompressible Navier-Stokes problem types:

  • Flow Past Object (FPO): External flow around obstacles in a channel
  • Lid-Driven Cavity (LDC): Cavity flow with moving top wall

Main Results

1. Mixing Lower Difficulty Data Works

Adding easy-to-medium difficulty data substantially improves performance on hard distributions. For Poseidon-B fine-tuned on hard FPO data, replacing 90% of the hard examples with easier ones reduces data generation time by 8.9× while maintaining similar accuracy.

Physics mixing - baselines Physics mixing - Poseidon
Figure: Performance on multi-obstacle FPO with varying physics difficulty mixing. Left: CNO and FFNO. Right: Poseidon variants.

2. Geometry Difficulty Transfer

The benefits of difficulty mixing extend to geometric complexity. Training on mixtures of simple and complex geometries improves generalization to complex multi-obstacle configurations.

Geometry mixing - baselines Geometry mixing - Poseidon
Figure: Performance on multi-obstacle examples with varying geometry difficulty mixing. Including simpler geometries improves performance on complex configurations.

3. Combined Physics and Geometry Difficulty

When varying both physics and geometry difficulty simultaneously, the benefits of mixed-difficulty training compound, demonstrating the generality of the approach.

Combined mixing - baselines Combined mixing - Poseidon
Figure: Performance with both physics and geometry difficulty varying. The benefits of difficulty transfer hold across multiple axes simultaneously.

4. Data Scaling Laws

Medium difficulty data is more sample-efficient than easy data. For most pre-generation budgets, training on fewer medium-difficulty examples outperforms training on more easy examples.

Physics scaling - baselines Physics scaling - Poseidon
Figure: Scaling behavior with fixed compute budget. Medium difficulty examples are more efficient than easy ones for achieving target performance on hard distributions.

5. Simulation Cost Analysis

Classical solver costs increase dramatically with difficulty, making strategic data curation essential for efficient neural PDE solver training.

Simulation cost comparison
Figure: Simulation times for different difficulty levels. High-difficulty examples can be 10× more expensive to generate, motivating careful data composition strategies.

Dataset

We are releasing a comprehensive dataset of 2D incompressible Navier-Stokes simulations with systematic variation across difficulty axes. Access the dataset on Hugging Face. The dataset enables research on few-shot learning, transfer learning, and foundation models for neural PDE solvers.

Dataset Overview

2
Problem Families
Flow Past Object (FPO) & Lid-Driven Cavity (LDC)
3×3
Difficulty Levels
Easy, Medium, Hard across Geometry & Physics
1000s
Simulations
Diverse initial conditions and configurations
128×128
Resolution
High-fidelity spatial discretization

Data Generation

All simulations were generated using OpenFOAM, a leading open-source CFD toolkit. Our preprocessing pipeline includes:

  • NURBS-based obstacle generation for smooth, varied geometries
  • Signed distance field (SDF) computation for obstacle representation
  • Physical channel extraction (velocity, pressure, vorticity)
  • Temporal trajectory recording with 50+ timesteps per simulation
Physical channels visualization
Figure: Example physical channels from our dataset: velocity components, pressure, and vorticity fields with geometry mask and SDF.

Data Format

The dataset is provided in HDF5 format with the following structure:

  • Input channels: Initial velocity (u, v), pressure, geometry mask, SDF
  • Output channels: Velocity evolution (u, v), pressure evolution
  • Metadata: Reynolds number, obstacle configurations, boundary conditions
  • Splits: Pre-defined train/validation/test splits for reproducibility

Citation

If you find our work useful, please consider citing:

@article{pregen2025,
  title={Pre-Generating Multi-Difficulty PDE Data For Few-Shot Neural PDE Solvers},
  author={Choudhary, Naman and Singh, Vedant and Talwalkar, Ameet and Boffi, Nicholas Matthew and Khodak, Mikhail and Marwah, Tanya},
  journal={arXiv preprint},
  year={2025}
}

Acknowledgments

This research was conducted at Carnegie Mellon University's Machine Learning Department. We thank the CMU computing resources and the broader scientific machine learning community for their support and feedback.