2025-09-17 21:58:14,196 - root - INFO - --------------- Versions --------------- 2025-09-17 21:58:14,313 - root - INFO - git branch: b'* multicheckpoint' 2025-09-17 21:58:14,331 - root - INFO - git hash: b'4b7dcddea9084b60c41957440a9f3e14f7d2567d' 2025-09-17 21:58:14,331 - root - INFO - Torch: 2.2.0a0+6a974be 2025-09-17 21:58:14,331 - root - INFO - ---------------------------------------- 2025-09-17 21:58:14,331 - root - INFO - ------------------ Configuration ------------------ 2025-09-17 21:58:14,331 - root - INFO - Configuration file: /global/u2/a/amahesh/ms_finetune/modulus-makani-fork/config/sfnonet.yaml 2025-09-17 21:58:14,331 - root - INFO - Configuration name: multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2 2025-09-17 21:58:14,331 - root - INFO - wandb_group multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2-0.1.0 2025-09-17 21:58:14,331 - root - INFO - scheduler CosineAnnealingLR 2025-09-17 21:58:14,331 - root - INFO - max_epochs 20 2025-09-17 21:58:14,331 - root - INFO - scheduler_T_max 20 2025-09-17 21:58:14,331 - root - INFO - lr 0.0001 2025-09-17 21:58:14,331 - root - INFO - load_counters False 2025-09-17 21:58:14,331 - root - INFO - load_optimizer False 2025-09-17 21:58:14,332 - root - INFO - load_scheduler False 2025-09-17 21:58:14,332 - root - INFO - finetune True 2025-09-17 21:58:14,332 - root - INFO - pretrained_checkpoint_path /pscratch/sd/a/amahesh/recovered_fcn_training/modulus-makani_runs-0.1.0-fcndev_stats/sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp0.tar 2025-09-17 21:58:14,332 - root - INFO - embed_dim 620 2025-09-17 21:58:14,332 - root - INFO - num_layers 8 2025-09-17 21:58:14,332 - root - INFO - scale_factor 2 2025-09-17 21:58:14,332 - root - INFO - hard_thresholding_fraction 1.0 2025-09-17 21:58:14,332 - root - INFO - loss weighted squared temp-std geometric l2 2025-09-17 21:58:14,332 - root - INFO - valid_autoreg_steps 1 2025-09-17 21:58:14,332 - root - INFO - metadata_json_path /pscratch/sd/p/pharring/74var-6hourly/staging/data.json 2025-09-17 21:58:14,332 - root - INFO - train_data_path /pscratch/sd/p/pharring/74var-6hourly/staging/train 2025-09-17 21:58:14,332 - root - INFO - valid_data_path /pscratch/sd/p/pharring/74var-6hourly/staging/valid 2025-09-17 21:58:14,332 - root - INFO - exp_dir /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/ 2025-09-17 21:58:14,332 - root - INFO - n_years 1 2025-09-17 21:58:14,332 - root - INFO - img_shape_x 721 2025-09-17 21:58:14,332 - root - INFO - img_shape_y 1440 2025-09-17 21:58:14,332 - root - INFO - min_path /pscratch/sd/p/pharring/74var-6hourly/staging/stats/mins.npy 2025-09-17 21:58:14,332 - root - INFO - max_path /pscratch/sd/p/pharring/74var-6hourly/staging/stats/maxs.npy 2025-09-17 21:58:14,332 - root - INFO - time_means_path /pscratch/sd/p/pharring/74var-6hourly/staging/stats_fcndev/time_means.npy 2025-09-17 21:58:14,332 - root - INFO - global_means_path /pscratch/sd/p/pharring/74var-6hourly/staging/stats_fcndev/global_means.npy 2025-09-17 21:58:14,332 - root - INFO - global_stds_path /pscratch/sd/p/pharring/74var-6hourly/staging/stats_fcndev/global_stds.npy 2025-09-17 21:58:14,332 - root - INFO - time_diff_means_path /pscratch/sd/p/pharring/74var-6hourly/staging/stats_fcndev/time_diff_means.npy 2025-09-17 21:58:14,332 - root - INFO - time_diff_stds_path /pscratch/sd/p/pharring/74var-6hourly/staging/stats_fcndev/time_diff_stds.npy 2025-09-17 21:58:14,333 - root - INFO - nettype SFNO 2025-09-17 21:58:14,333 - root - INFO - model_grid_type equiangular 2025-09-17 21:58:14,333 - root - INFO - sht_grid_type legendre-gauss 2025-09-17 21:58:14,333 - root - INFO - filter_type linear 2025-09-17 21:58:14,333 - root - INFO - complex_activation real 2025-09-17 21:58:14,333 - root - INFO - normalization_layer instance_norm 2025-09-17 21:58:14,333 - root - INFO - use_mlp True 2025-09-17 21:58:14,333 - root - INFO - mlp_mode serial 2025-09-17 21:58:14,333 - root - INFO - mlp_ratio 2 2025-09-17 21:58:14,333 - root - INFO - separable False 2025-09-17 21:58:14,333 - root - INFO - operator_type dhconv 2025-09-17 21:58:14,333 - root - INFO - activation_function gelu 2025-09-17 21:58:14,333 - root - INFO - pos_embed none 2025-09-17 21:58:14,333 - root - INFO - channel_weights auto 2025-09-17 21:58:14,333 - root - INFO - n_eval_samples 8760 2025-09-17 21:58:14,333 - root - INFO - batch_size 1 2025-09-17 21:58:14,333 - root - INFO - weight_decay 0.0 2025-09-17 21:58:14,333 - root - INFO - scheduler_factor 0.1 2025-09-17 21:58:14,333 - root - INFO - scheduler_patience 10 2025-09-17 21:58:14,333 - root - INFO - scheduler_step_size 100 2025-09-17 21:58:14,333 - root - INFO - scheduler_gamma 0.5 2025-09-17 21:58:14,333 - root - INFO - lr_warmup_steps 0 2025-09-17 21:58:14,334 - root - INFO - verbose False 2025-09-17 21:58:14,334 - root - INFO - wireup_info mpi 2025-09-17 21:58:14,334 - root - INFO - wireup_store tcp 2025-09-17 21:58:14,334 - root - INFO - num_data_workers 2 2025-09-17 21:58:14,334 - root - INFO - num_visualization_workers 2 2025-09-17 21:58:14,334 - root - INFO - dt 1 2025-09-17 21:58:14,334 - root - INFO - n_history 0 2025-09-17 21:58:14,334 - root - INFO - prediction_type iterative 2025-09-17 21:58:14,334 - root - INFO - prediction_length 35 2025-09-17 21:58:14,334 - root - INFO - n_initial_conditions 5 2025-09-17 21:58:14,334 - root - INFO - n_train_samples_per_epoch 54000 2025-09-17 21:58:14,334 - root - INFO - ics_type specify_number 2025-09-17 21:58:14,334 - root - INFO - save_raw_forecasts True 2025-09-17 21:58:14,334 - root - INFO - save_channel False 2025-09-17 21:58:14,334 - root - INFO - masked_acc False 2025-09-17 21:58:14,334 - root - INFO - maskpath None 2025-09-17 21:58:14,334 - root - INFO - perturb False 2025-09-17 21:58:14,334 - root - INFO - add_noise False 2025-09-17 21:58:14,334 - root - INFO - noise_std 0.0 2025-09-17 21:58:14,334 - root - INFO - target default 2025-09-17 21:58:14,334 - root - INFO - normalize_residual False 2025-09-17 21:58:14,334 - root - INFO - channel_names ['u10m', 'v10m', 'u100m', 'v100m', 't2m', 'sp', 'msl', 'tcwv', 'd2m', 'u50', 'u100', 'u150', 'u200', 'u250', 'u300', 'u400', 'u500', 'u600', 'u700', 'u850', 'u925', 'u1000', 'v50', 'v100', 'v150', 'v200', 'v250', 'v300', 'v400', 'v500', 'v600', 'v700', 'v850', 'v925', 'v1000', 'z50', 'z100', 'z150', 'z200', 'z250', 'z300', 'z400', 'z500', 'z600', 'z700', 'z850', 'z925', 'z1000', 't50', 't100', 't150', 't200', 't250', 't300', 't400', 't500', 't600', 't700', 't850', 't925', 't1000', 'q50', 'q100', 'q150', 'q200', 'q250', 'q300', 'q400', 'q500', 'q600', 'q700', 'q850', 'q925', 'q1000'] 2025-09-17 21:58:14,334 - root - INFO - normalization zscore 2025-09-17 21:58:14,334 - root - INFO - add_grid True 2025-09-17 21:58:14,335 - root - INFO - gridtype sinusoidal 2025-09-17 21:58:14,335 - root - INFO - grid_num_frequencies 16 2025-09-17 21:58:14,335 - root - INFO - roll False 2025-09-17 21:58:14,335 - root - INFO - add_zenith True 2025-09-17 21:58:14,335 - root - INFO - add_orography True 2025-09-17 21:58:14,335 - root - INFO - orography_path /global/cfs/cdirs/m3522/cmip6/ERA5/e5.oper.invariant/197901/e5.oper.invariant.128_129_z.ll025sc.1979010100_1979010100.nc 2025-09-17 21:58:14,335 - root - INFO - add_landmask True 2025-09-17 21:58:14,335 - root - INFO - landmask_path /global/cfs/cdirs/m3522/cmip6/ERA5/e5.oper.invariant/197901/e5.oper.invariant.128_172_lsm.ll025sc.1979010100_1979010100.nc 2025-09-17 21:58:14,335 - root - INFO - log_to_screen True 2025-09-17 21:58:14,335 - root - INFO - log_to_wandb True 2025-09-17 21:58:14,335 - root - INFO - log_video 20 2025-09-17 21:58:14,335 - root - INFO - save_checkpoint legacy 2025-09-17 21:58:14,335 - root - INFO - optimizer_type AdamW 2025-09-17 21:58:14,335 - root - INFO - optimizer_beta1 0.9 2025-09-17 21:58:14,335 - root - INFO - optimizer_beta2 0.95 2025-09-17 21:58:14,335 - root - INFO - optimizer_max_grad_norm 32 2025-09-17 21:58:14,335 - root - INFO - crop_size_x None 2025-09-17 21:58:14,335 - root - INFO - crop_size_y None 2025-09-17 21:58:14,335 - root - INFO - inf_data_path /pscratch/sd/p/pharring/74var-6hourly/staging/out_of_sample 2025-09-17 21:58:14,335 - root - INFO - wandb_name None 2025-09-17 21:58:14,335 - root - INFO - wandb_project ERA5_sfno 2025-09-17 21:58:14,335 - root - INFO - wandb_entity weatherbenching 2025-09-17 21:58:14,336 - root - INFO - pos_drop_rate 0.1 2025-09-17 21:58:14,336 - root - INFO - initialization_seed 77 2025-09-17 21:58:14,336 - root - INFO - epsilon_factor 0 2025-09-17 21:58:14,336 - root - INFO - fin_parallel_size 1 2025-09-17 21:58:14,336 - root - INFO - fout_parallel_size 1 2025-09-17 21:58:14,336 - root - INFO - h_parallel_size 4 2025-09-17 21:58:14,336 - root - INFO - w_parallel_size 1 2025-09-17 21:58:14,336 - root - INFO - model_parallel_sizes [4, 1, 1, 1] 2025-09-17 21:58:14,336 - root - INFO - model_parallel_names ['h', 'w', 'fin', 'fout'] 2025-09-17 21:58:14,336 - root - INFO - parameters_reduction_buffer_count 1 2025-09-17 21:58:14,336 - root - INFO - load_checkpoint legacy 2025-09-17 21:58:14,336 - root - INFO - world_size 128 2025-09-17 21:58:14,336 - root - INFO - global_batch_size 32 2025-09-17 21:58:14,336 - root - INFO - experiment_dir /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77 2025-09-17 21:58:14,336 - root - INFO - checkpoint_path /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar 2025-09-17 21:58:14,336 - root - INFO - best_checkpoint_path /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar 2025-09-17 21:58:14,336 - root - INFO - resuming False 2025-09-17 21:58:14,336 - root - INFO - amp_mode bf16 2025-09-17 21:58:14,336 - root - INFO - jit_mode none 2025-09-17 21:58:14,336 - root - INFO - cuda_graph_mode none 2025-09-17 21:58:14,336 - root - INFO - skip_validation False 2025-09-17 21:58:14,336 - root - INFO - enable_odirect False 2025-09-17 21:58:14,336 - root - INFO - checkpointing 0 2025-09-17 21:58:14,337 - root - INFO - enable_synthetic_data False 2025-09-17 21:58:14,337 - root - INFO - split_data_channels False 2025-09-17 21:58:14,337 - root - INFO - print_timings_frequency -1 2025-09-17 21:58:14,337 - root - INFO - multistep_count 2 2025-09-17 21:58:14,337 - root - INFO - n_future 1 2025-09-17 21:58:14,337 - root - INFO - enable_benchy False 2025-09-17 21:58:14,337 - root - INFO - disable_ddp False 2025-09-17 21:58:14,337 - root - INFO - enable_grad_anomaly_detection False 2025-09-17 21:58:14,337 - root - INFO - wandb_dir /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77 2025-09-17 21:58:14,337 - root - INFO - _yaml_filename /global/u2/a/amahesh/ms_finetune/modulus-makani-fork/config/sfnonet.yaml 2025-09-17 21:58:14,337 - root - INFO - _config_name multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2 2025-09-17 21:58:14,337 - root - INFO - --------------------------------------------------- 2025-09-17 21:58:14,338 - root - INFO - Using seed 77 2025-09-17 21:58:15,535 - root - INFO - Enabling automatic mixed precision in bf16. 2025-09-17 21:58:19,627 - root - INFO - Using channel names: ['u10m', 'v10m', 'u100m', 'v100m', 't2m', 'sp', 'msl', 'tcwv', 'd2m', 'u50', 'u100', 'u150', 'u200', 'u250', 'u300', 'u400', 'u500', 'u600', 'u700', 'u850', 'u925', 'u1000', 'v50', 'v100', 'v150', 'v200', 'v250', 'v300', 'v400', 'v500', 'v600', 'v700', 'v850', 'v925', 'v1000', 'z50', 'z100', 'z150', 'z200', 'z250', 'z300', 'z400', 'z500', 'z600', 'z700', 'z850', 'z925', 'z1000', 't50', 't100', 't150', 't200', 't250', 't300', 't400', 't500', 't600', 't700', 't850', 't925', 't1000', 'q50', 'q100', 'q150', 'q200', 'q250', 'q300', 'q400', 'q500', 'q600', 'q700', 'q850', 'q925', 'q1000'] 2025-09-17 21:58:19,628 - root - INFO - initializing data loader 2025-09-17 21:58:22,129 - root - INFO - Getting file stats from /pscratch/sd/p/pharring/74var-6hourly/staging/train/1979.h5 2025-09-17 21:58:22,171 - root - INFO - Average number of samples per year: 1461.0 2025-09-17 21:58:22,171 - root - INFO - Found data at path ['/pscratch/sd/p/pharring/74var-6hourly/staging/train']. Number of examples: 54056. Full image Shape: 721 x 1440 x 74. Read Shape: 181 x 1440 x 74 2025-09-17 21:58:22,171 - root - INFO - Using 54056 from the total number of available samples with 54000 samples per epoch (corresponds to 1687 steps for 32 shards with local batch size 1) 2025-09-17 21:58:22,172 - root - INFO - Delta t: 6 hours 2025-09-17 21:58:22,172 - root - INFO - Including 6 hours of past history in training at a frequency of 6 hours 2025-09-17 21:58:22,172 - root - INFO - Including 12 hours of future targets in training at a frequency of 6 hours 2025-09-17 21:58:52,246 - root - INFO - Getting file stats from /pscratch/sd/p/pharring/74var-6hourly/staging/valid/2016.h5 2025-09-17 21:58:52,248 - root - INFO - Average number of samples per year: 1462.0 2025-09-17 21:58:52,248 - root - INFO - Found data at path ['/pscratch/sd/p/pharring/74var-6hourly/staging/valid']. Number of examples: 2924. Full image Shape: 721 x 1440 x 74. Read Shape: 181 x 1440 x 74 2025-09-17 21:58:52,249 - root - INFO - Using 2924 from the total number of available samples with 2924 samples per epoch (corresponds to 91 steps for 32 shards with local batch size 1) 2025-09-17 21:58:52,249 - root - INFO - Delta t: 6 hours 2025-09-17 21:58:52,249 - root - INFO - Including 6 hours of past history in training at a frequency of 6 hours 2025-09-17 21:58:52,249 - root - INFO - Including 12 hours of future targets in training at a frequency of 6 hours 2025-09-17 21:59:23,221 - root - INFO - data loader initialized 2025-09-17 21:59:30,000 - root - INFO - Auxiliary channel names: ['xzen', 'xgrlat', 'xgrlon', 'xoro', 'xlsml', 'xlsms'] 2025-09-17 21:59:33,643 - root - INFO - MultiStepWrapper( (preprocessor): Preprocessor2D() (model): SphericalFourierNeuralOperatorNet( (trans_down): DistributedRealSHT( nlat=721, nlon=1440, lmax=360, mmax=361, grid=equiangular, csphase=True ) (itrans_up): DistributedInverseRealSHT( nlat=721, nlon=1440, lmax=360, mmax=361, grid=equiangular, csphase=True ) (trans): DistributedRealSHT( nlat=360, nlon=720, lmax=360, mmax=361, grid=legendre-gauss, csphase=True ) (itrans): DistributedInverseRealSHT( nlat=360, nlon=720, lmax=360, mmax=361, grid=legendre-gauss, csphase=True ) (encoder): EncoderDecoder( (fwd): Sequential( (0): Conv2d(110, 620, kernel_size=(1, 1), stride=(1, 1)) (1): GELU(approximate='none') (2): Conv2d(620, 620, kernel_size=(1, 1), stride=(1, 1), bias=False) ) ) (pos_drop): Dropout(p=0.1, inplace=False) (blocks): ModuleList( (0): FourierNeuralOperatorBlock( (norm0): DistributedInstanceNorm2d() (filter): SpectralFilterLayer( (filter): SpectralConv( (forward_transform): DistributedRealSHT( nlat=721, nlon=1440, lmax=360, mmax=361, grid=equiangular, csphase=True ) (inverse_transform): DistributedInverseRealSHT( nlat=360, nlon=720, lmax=360, mmax=361, grid=legendre-gauss, csphase=True ) ) ) (act_layer0): GELU(approximate='none') (norm1): DistributedInstanceNorm2d() (outer_skip): Conv2d(620, 620, kernel_size=(1, 1), stride=(1, 1), bias=False) (mlp): MLP( (fwd): Sequential( (0): Conv2d(620, 1240, kernel_size=(1, 1), stride=(1, 1)) (1): GELU(approximate='none') (2): Identity() (3): Conv2d(1240, 620, kernel_size=(1, 1), stride=(1, 1)) (4): Identity() ) ) (drop_path): Identity() ) (1-6): 6 x FourierNeuralOperatorBlock( (norm0): DistributedInstanceNorm2d() (filter): SpectralFilterLayer( (filter): SpectralConv( (forward_transform): DistributedRealSHT( nlat=360, nlon=720, lmax=360, mmax=361, grid=legendre-gauss, csphase=True ) (inverse_transform): DistributedInverseRealSHT( nlat=360, nlon=720, lmax=360, mmax=361, grid=legendre-gauss, csphase=True ) ) ) (act_layer0): GELU(approximate='none') (norm1): DistributedInstanceNorm2d() (outer_skip): Conv2d(620, 620, kernel_size=(1, 1), stride=(1, 1), bias=False) (mlp): MLP( (fwd): Sequential( (0): Conv2d(620, 1240, kernel_size=(1, 1), stride=(1, 1)) (1): GELU(approximate='none') (2): Identity() (3): Conv2d(1240, 620, kernel_size=(1, 1), stride=(1, 1)) (4): Identity() ) ) (drop_path): Identity() ) (7): FourierNeuralOperatorBlock( (norm0): DistributedInstanceNorm2d() (filter): SpectralFilterLayer( (filter): SpectralConv( (forward_transform): DistributedRealSHT( nlat=360, nlon=720, lmax=360, mmax=361, grid=legendre-gauss, csphase=True ) (inverse_transform): DistributedInverseRealSHT( nlat=721, nlon=1440, lmax=360, mmax=361, grid=equiangular, csphase=True ) ) ) (act_layer0): GELU(approximate='none') (norm1): DistributedInstanceNorm2d() (outer_skip): Conv2d(620, 620, kernel_size=(1, 1), stride=(1, 1), bias=False) (mlp): MLP( (fwd): Sequential( (0): Conv2d(620, 1240, kernel_size=(1, 1), stride=(1, 1)) (1): GELU(approximate='none') (2): Identity() (3): Conv2d(1240, 620, kernel_size=(1, 1), stride=(1, 1)) (4): Identity() ) ) (drop_path): Identity() ) ) (decoder): EncoderDecoder( (fwd): Sequential( (0): Conv2d(620, 620, kernel_size=(1, 1), stride=(1, 1)) (1): GELU(approximate='none') (2): Conv2d(620, 74, kernel_size=(1, 1), stride=(1, 1), bias=False) ) ) (residual_transform): Conv2d(110, 74, kernel_size=(1, 1), stride=(1, 1), bias=False) ) ) 2025-09-17 21:59:37,302 - root - INFO - using AdamW 2025-09-17 21:59:39,219 - root - INFO - Loading checkpoint /pscratch/sd/a/amahesh/recovered_fcn_training/modulus-makani_runs-0.1.0-fcndev_stats/sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp0.tar in legacy mode 2025-09-17 21:59:47,338 - root - INFO - Number of trainable model parameters: 1123374980 2025-09-17 21:59:47,340 - root - INFO - Scaffolding memory high watermark: 12.662841796875 GB (6.0722336769104 GB for pytorch) 2025-09-17 21:59:47,340 - root - INFO - Starting Training Loop... 2025-09-17 22:23:03,442 - py.warnings - WARNING - /usr/local/lib/python3.10/dist-packages/wandb/wandb_torch.py:191: UserWarning: Casting complex values to real discards the imaginary part (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/Copy.cpp:299.) flat = flat.type(torch.cuda.FloatTensor) 2025-09-17 22:39:36,626 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-17 22:39:47,166 - root - INFO - Save checkpoint (legacy): 10.54 sec (3.3527612686157227e-07) GB 2025-09-17 22:39:47,167 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-17 22:39:57,342 - root - INFO - Save checkpoint (legacy): 10.17 sec (3.3527612686157227e-07) GB 2025-09-17 22:39:57,343 - root - INFO - -------------------------------------------------- 2025-09-17 22:39:57,343 - root - INFO - Epoch 1 summary: 2025-09-17 22:39:57,344 - root - INFO - Performance Parameters: 2025-09-17 22:39:57,344 - root - INFO - training steps: 1686 2025-09-17 22:39:57,344 - root - INFO - validation steps: 91 2025-09-17 22:39:57,344 - root - INFO - memory footprint [GB]: 29.36 2025-09-17 22:39:57,344 - root - INFO - epoch time [s]: 2408.96 2025-09-17 22:39:57,345 - root - INFO - training time [s]: 2344.08 2025-09-17 22:39:57,345 - root - INFO - validation time [s]: 43.88 2025-09-17 22:39:57,345 - root - INFO - visualization time [s]: 0.41 2025-09-17 22:39:57,345 - root - INFO - training step time [ms]: 1390.32 2025-09-17 22:39:57,345 - root - INFO - minimal IO rate [GB/s]: 19.84 2025-09-17 22:39:57,346 - root - INFO - Metrics: 2025-09-17 22:39:57,346 - root - INFO - training loss: 0.15044062875608388 2025-09-17 22:39:57,346 - root - INFO - validation loss: 0.12296061217784882 2025-09-17 22:39:57,346 - root - INFO - validation L1: 0.07514476776123047 2025-09-17 22:39:57,346 - root - INFO - validation u10m: 0.7479037046432495 2025-09-17 22:39:57,347 - root - INFO - validation t2m: 0.6713765859603882 2025-09-17 22:39:57,347 - root - INFO - validation u500: 1.3867844343185425 2025-09-17 22:39:57,347 - root - INFO - validation z500: 34.04854202270508 2025-09-17 22:39:57,347 - root - INFO - validation q500: 0.0002480344264768064 2025-09-17 22:39:57,347 - root - INFO - ACC AUC u10m: 0.48983699083328247 2025-09-17 22:39:57,347 - root - INFO - ACC AUC t2m: 0.4950104355812073 2025-09-17 22:39:57,348 - root - INFO - ACC AUC u500: 0.4928547441959381 2025-09-17 22:39:57,348 - root - INFO - ACC AUC z500: 0.4995981454849243 2025-09-17 22:39:57,348 - root - INFO - ACC AUC q500: 0.4800533652305603 2025-09-17 22:39:57,349 - root - INFO - -------------------------------------------------- 2025-09-17 23:19:28,526 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-17 23:19:39,363 - root - INFO - Save checkpoint (legacy): 10.84 sec (3.3527612686157227e-07) GB 2025-09-17 23:19:39,364 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-17 23:19:50,008 - root - INFO - Save checkpoint (legacy): 10.64 sec (3.3527612686157227e-07) GB 2025-09-17 23:19:50,010 - root - INFO - -------------------------------------------------- 2025-09-17 23:19:50,011 - root - INFO - Epoch 2 summary: 2025-09-17 23:19:50,011 - root - INFO - Performance Parameters: 2025-09-17 23:19:50,011 - root - INFO - training steps: 1687 2025-09-17 23:19:50,012 - root - INFO - validation steps: 91 2025-09-17 23:19:50,012 - root - INFO - memory footprint [GB]: 28.88 2025-09-17 23:19:50,012 - root - INFO - epoch time [s]: 2392.14 2025-09-17 23:19:50,013 - root - INFO - training time [s]: 2328.73 2025-09-17 23:19:50,013 - root - INFO - validation time [s]: 41.68 2025-09-17 23:19:50,013 - root - INFO - visualization time [s]: 0.00 2025-09-17 23:19:50,013 - root - INFO - training step time [ms]: 1380.40 2025-09-17 23:19:50,014 - root - INFO - minimal IO rate [GB/s]: 19.99 2025-09-17 23:19:50,014 - root - INFO - Metrics: 2025-09-17 23:19:50,014 - root - INFO - training loss: 0.12428157719183838 2025-09-17 23:19:50,014 - root - INFO - validation loss: 0.1060069352388382 2025-09-17 23:19:50,014 - root - INFO - validation L1: 0.07054098695516586 2025-09-17 23:19:50,015 - root - INFO - validation u10m: 0.6762495636940002 2025-09-17 23:19:50,015 - root - INFO - validation t2m: 0.6309836506843567 2025-09-17 23:19:50,015 - root - INFO - validation u500: 1.311978816986084 2025-09-17 23:19:50,015 - root - INFO - validation z500: 29.244613647460938 2025-09-17 23:19:50,016 - root - INFO - validation q500: 0.00023181294091045856 2025-09-17 23:19:50,016 - root - INFO - ACC AUC u10m: 0.49162358045578003 2025-09-17 23:19:50,016 - root - INFO - ACC AUC t2m: 0.49572324752807617 2025-09-17 23:19:50,016 - root - INFO - ACC AUC u500: 0.4936563968658447 2025-09-17 23:19:50,016 - root - INFO - ACC AUC z500: 0.49972766637802124 2025-09-17 23:19:50,017 - root - INFO - ACC AUC q500: 0.4823915362358093 2025-09-17 23:19:50,017 - root - INFO - -------------------------------------------------- 2025-09-17 23:59:27,565 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-17 23:59:38,315 - root - INFO - Save checkpoint (legacy): 10.75 sec (3.3527612686157227e-07) GB 2025-09-17 23:59:38,317 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-17 23:59:48,813 - root - INFO - Save checkpoint (legacy): 10.50 sec (3.3527612686157227e-07) GB 2025-09-17 23:59:49,334 - root - INFO - -------------------------------------------------- 2025-09-17 23:59:49,334 - root - INFO - Epoch 3 summary: 2025-09-17 23:59:49,335 - root - INFO - Performance Parameters: 2025-09-17 23:59:49,335 - root - INFO - training steps: 1687 2025-09-17 23:59:49,335 - root - INFO - validation steps: 91 2025-09-17 23:59:49,335 - root - INFO - memory footprint [GB]: 28.88 2025-09-17 23:59:49,336 - root - INFO - epoch time [s]: 2399.01 2025-09-17 23:59:49,336 - root - INFO - training time [s]: 2335.76 2025-09-17 23:59:49,336 - root - INFO - validation time [s]: 41.29 2025-09-17 23:59:49,336 - root - INFO - visualization time [s]: 0.00 2025-09-17 23:59:49,336 - root - INFO - training step time [ms]: 1384.56 2025-09-17 23:59:49,337 - root - INFO - minimal IO rate [GB/s]: 19.93 2025-09-17 23:59:49,337 - root - INFO - Metrics: 2025-09-17 23:59:49,337 - root - INFO - training loss: 0.11152564129232427 2025-09-17 23:59:49,337 - root - INFO - validation loss: 0.09961038827896118 2025-09-17 23:59:49,337 - root - INFO - validation L1: 0.0686555951833725 2025-09-17 23:59:49,338 - root - INFO - validation u10m: 0.6436118483543396 2025-09-17 23:59:49,338 - root - INFO - validation t2m: 0.6147364974021912 2025-09-17 23:59:49,338 - root - INFO - validation u500: 1.2801556587219238 2025-09-17 23:59:49,338 - root - INFO - validation z500: 28.882726669311523 2025-09-17 23:59:49,338 - root - INFO - validation q500: 0.0002243449998786673 2025-09-17 23:59:49,338 - root - INFO - ACC AUC u10m: 0.4923299252986908 2025-09-17 23:59:49,339 - root - INFO - ACC AUC t2m: 0.4959581792354584 2025-09-17 23:59:49,339 - root - INFO - ACC AUC u500: 0.4939085841178894 2025-09-17 23:59:49,339 - root - INFO - ACC AUC z500: 0.49972957372665405 2025-09-17 23:59:49,339 - root - INFO - ACC AUC q500: 0.48319730162620544 2025-09-17 23:59:49,339 - root - INFO - -------------------------------------------------- 2025-09-18 00:39:35,108 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 00:39:45,846 - root - INFO - Save checkpoint (legacy): 10.74 sec (3.3527612686157227e-07) GB 2025-09-18 00:39:45,849 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 00:39:57,633 - root - INFO - Save checkpoint (legacy): 11.78 sec (3.3527612686157227e-07) GB 2025-09-18 00:39:57,634 - root - INFO - -------------------------------------------------- 2025-09-18 00:39:57,635 - root - INFO - Epoch 4 summary: 2025-09-18 00:39:57,635 - root - INFO - Performance Parameters: 2025-09-18 00:39:57,635 - root - INFO - training steps: 1687 2025-09-18 00:39:57,635 - root - INFO - validation steps: 91 2025-09-18 00:39:57,635 - root - INFO - memory footprint [GB]: 28.88 2025-09-18 00:39:57,636 - root - INFO - epoch time [s]: 2406.35 2025-09-18 00:39:57,636 - root - INFO - training time [s]: 2342.38 2025-09-18 00:39:57,636 - root - INFO - validation time [s]: 41.32 2025-09-18 00:39:57,636 - root - INFO - visualization time [s]: 0.00 2025-09-18 00:39:57,636 - root - INFO - training step time [ms]: 1388.49 2025-09-18 00:39:57,637 - root - INFO - minimal IO rate [GB/s]: 19.87 2025-09-18 00:39:57,637 - root - INFO - Metrics: 2025-09-18 00:39:57,637 - root - INFO - training loss: 0.1091360353181534 2025-09-18 00:39:57,637 - root - INFO - validation loss: 0.09674004465341568 2025-09-18 00:39:57,637 - root - INFO - validation L1: 0.06783125549554825 2025-09-18 00:39:57,638 - root - INFO - validation u10m: 0.6307315826416016 2025-09-18 00:39:57,638 - root - INFO - validation t2m: 0.6084843873977661 2025-09-18 00:39:57,638 - root - INFO - validation u500: 1.2606196403503418 2025-09-18 00:39:57,638 - root - INFO - validation z500: 28.273496627807617 2025-09-18 00:39:57,639 - root - INFO - validation q500: 0.00022008558153174818 2025-09-18 00:39:57,639 - root - INFO - ACC AUC u10m: 0.49256986379623413 2025-09-18 00:39:57,639 - root - INFO - ACC AUC t2m: 0.4960508346557617 2025-09-18 00:39:57,639 - root - INFO - ACC AUC u500: 0.49406835436820984 2025-09-18 00:39:57,639 - root - INFO - ACC AUC z500: 0.499744713306427 2025-09-18 00:39:57,639 - root - INFO - ACC AUC q500: 0.4836815595626831 2025-09-18 00:39:57,640 - root - INFO - -------------------------------------------------- 2025-09-18 01:20:02,046 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 01:20:12,888 - root - INFO - Save checkpoint (legacy): 10.84 sec (3.3527612686157227e-07) GB 2025-09-18 01:20:12,890 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 01:20:24,645 - root - INFO - Save checkpoint (legacy): 11.75 sec (3.3527612686157227e-07) GB 2025-09-18 01:20:24,809 - root - INFO - -------------------------------------------------- 2025-09-18 01:20:24,810 - root - INFO - Epoch 5 summary: 2025-09-18 01:20:24,810 - root - INFO - Performance Parameters: 2025-09-18 01:20:24,810 - root - INFO - training steps: 1687 2025-09-18 01:20:24,810 - root - INFO - validation steps: 91 2025-09-18 01:20:24,811 - root - INFO - memory footprint [GB]: 28.11 2025-09-18 01:20:24,811 - root - INFO - epoch time [s]: 2426.86 2025-09-18 01:20:24,811 - root - INFO - training time [s]: 2345.46 2025-09-18 01:20:24,811 - root - INFO - validation time [s]: 58.32 2025-09-18 01:20:24,811 - root - INFO - visualization time [s]: 0.00 2025-09-18 01:20:24,812 - root - INFO - training step time [ms]: 1390.31 2025-09-18 01:20:24,812 - root - INFO - minimal IO rate [GB/s]: 19.85 2025-09-18 01:20:24,812 - root - INFO - Metrics: 2025-09-18 01:20:24,812 - root - INFO - training loss: 0.10862162365588932 2025-09-18 01:20:24,812 - root - INFO - validation loss: 0.09491350501775742 2025-09-18 01:20:24,813 - root - INFO - validation L1: 0.06727277487516403 2025-09-18 01:20:24,813 - root - INFO - validation u10m: 0.6217071413993835 2025-09-18 01:20:24,813 - root - INFO - validation t2m: 0.6042304635047913 2025-09-18 01:20:24,813 - root - INFO - validation u500: 1.2478368282318115 2025-09-18 01:20:24,813 - root - INFO - validation z500: 28.118196487426758 2025-09-18 01:20:24,814 - root - INFO - validation q500: 0.00021792660118080676 2025-09-18 01:20:24,814 - root - INFO - ACC AUC u10m: 0.49276506900787354 2025-09-18 01:20:24,814 - root - INFO - ACC AUC t2m: 0.4961049556732178 2025-09-18 01:20:24,814 - root - INFO - ACC AUC u500: 0.49417078495025635 2025-09-18 01:20:24,814 - root - INFO - ACC AUC z500: 0.49974820017814636 2025-09-18 01:20:24,815 - root - INFO - ACC AUC q500: 0.4839036166667938 2025-09-18 01:20:24,815 - root - INFO - -------------------------------------------------- 2025-09-18 02:00:07,851 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 02:00:19,658 - root - INFO - Save checkpoint (legacy): 11.81 sec (3.3527612686157227e-07) GB 2025-09-18 02:00:19,660 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 02:00:31,313 - root - INFO - Save checkpoint (legacy): 11.65 sec (3.3527612686157227e-07) GB 2025-09-18 02:00:31,314 - root - INFO - -------------------------------------------------- 2025-09-18 02:00:31,314 - root - INFO - Epoch 6 summary: 2025-09-18 02:00:31,314 - root - INFO - Performance Parameters: 2025-09-18 02:00:31,315 - root - INFO - training steps: 1687 2025-09-18 02:00:31,315 - root - INFO - validation steps: 91 2025-09-18 02:00:31,315 - root - INFO - memory footprint [GB]: 28.49 2025-09-18 02:00:31,315 - root - INFO - epoch time [s]: 2406.21 2025-09-18 02:00:31,315 - root - INFO - training time [s]: 2338.22 2025-09-18 02:00:31,316 - root - INFO - validation time [s]: 44.31 2025-09-18 02:00:31,316 - root - INFO - visualization time [s]: 0.00 2025-09-18 02:00:31,316 - root - INFO - training step time [ms]: 1386.02 2025-09-18 02:00:31,316 - root - INFO - minimal IO rate [GB/s]: 19.91 2025-09-18 02:00:31,316 - root - INFO - Metrics: 2025-09-18 02:00:31,317 - root - INFO - training loss: 0.10371404199272369 2025-09-18 02:00:31,317 - root - INFO - validation loss: 0.0937279537320137 2025-09-18 02:00:31,317 - root - INFO - validation L1: 0.06691930443048477 2025-09-18 02:00:31,317 - root - INFO - validation u10m: 0.615944504737854 2025-09-18 02:00:31,317 - root - INFO - validation t2m: 0.5993876457214355 2025-09-18 02:00:31,318 - root - INFO - validation u500: 1.2378160953521729 2025-09-18 02:00:31,318 - root - INFO - validation z500: 28.159040451049805 2025-09-18 02:00:31,318 - root - INFO - validation q500: 0.0002160001895390451 2025-09-18 02:00:31,318 - root - INFO - ACC AUC u10m: 0.49288541078567505 2025-09-18 02:00:31,319 - root - INFO - ACC AUC t2m: 0.4961746335029602 2025-09-18 02:00:31,319 - root - INFO - ACC AUC u500: 0.49425560235977173 2025-09-18 02:00:31,319 - root - INFO - ACC AUC z500: 0.4997476637363434 2025-09-18 02:00:31,319 - root - INFO - ACC AUC q500: 0.48413407802581787 2025-09-18 02:00:31,319 - root - INFO - -------------------------------------------------- 2025-09-18 02:40:12,519 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 02:40:26,600 - root - INFO - Save checkpoint (legacy): 14.08 sec (3.3527612686157227e-07) GB 2025-09-18 02:40:26,602 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 02:40:38,117 - root - INFO - Save checkpoint (legacy): 11.52 sec (3.3527612686157227e-07) GB 2025-09-18 02:40:38,119 - root - INFO - -------------------------------------------------- 2025-09-18 02:40:38,119 - root - INFO - Epoch 7 summary: 2025-09-18 02:40:38,119 - root - INFO - Performance Parameters: 2025-09-18 02:40:38,119 - root - INFO - training steps: 1687 2025-09-18 02:40:38,120 - root - INFO - validation steps: 91 2025-09-18 02:40:38,120 - root - INFO - memory footprint [GB]: 27.69 2025-09-18 02:40:38,120 - root - INFO - epoch time [s]: 2406.51 2025-09-18 02:40:38,120 - root - INFO - training time [s]: 2339.40 2025-09-18 02:40:38,121 - root - INFO - validation time [s]: 41.13 2025-09-18 02:40:38,121 - root - INFO - visualization time [s]: 0.00 2025-09-18 02:40:38,121 - root - INFO - training step time [ms]: 1386.72 2025-09-18 02:40:38,121 - root - INFO - minimal IO rate [GB/s]: 19.90 2025-09-18 02:40:38,121 - root - INFO - Metrics: 2025-09-18 02:40:38,122 - root - INFO - training loss: 0.1020412554105188 2025-09-18 02:40:38,122 - root - INFO - validation loss: 0.09286925196647644 2025-09-18 02:40:38,122 - root - INFO - validation L1: 0.06662742793560028 2025-09-18 02:40:38,122 - root - INFO - validation u10m: 0.6123279333114624 2025-09-18 02:40:38,123 - root - INFO - validation t2m: 0.5979657173156738 2025-09-18 02:40:38,123 - root - INFO - validation u500: 1.2313544750213623 2025-09-18 02:40:38,123 - root - INFO - validation z500: 28.252819061279297 2025-09-18 02:40:38,123 - root - INFO - validation q500: 0.0002150951768271625 2025-09-18 02:40:38,123 - root - INFO - ACC AUC u10m: 0.4929358959197998 2025-09-18 02:40:38,124 - root - INFO - ACC AUC t2m: 0.4961931109428406 2025-09-18 02:40:38,124 - root - INFO - ACC AUC u500: 0.49429306387901306 2025-09-18 02:40:38,124 - root - INFO - ACC AUC z500: 0.49974435567855835 2025-09-18 02:40:38,124 - root - INFO - ACC AUC q500: 0.48420870304107666 2025-09-18 02:40:38,124 - root - INFO - -------------------------------------------------- 2025-09-18 03:20:18,544 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 03:20:32,634 - root - INFO - Save checkpoint (legacy): 14.09 sec (3.3527612686157227e-07) GB 2025-09-18 03:20:32,635 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 03:20:46,658 - root - INFO - Save checkpoint (legacy): 14.02 sec (3.3527612686157227e-07) GB 2025-09-18 03:20:46,659 - root - INFO - -------------------------------------------------- 2025-09-18 03:20:46,660 - root - INFO - Epoch 8 summary: 2025-09-18 03:20:46,660 - root - INFO - Performance Parameters: 2025-09-18 03:20:46,661 - root - INFO - training steps: 1687 2025-09-18 03:20:46,661 - root - INFO - validation steps: 91 2025-09-18 03:20:46,661 - root - INFO - memory footprint [GB]: 28.88 2025-09-18 03:20:46,662 - root - INFO - epoch time [s]: 2408.23 2025-09-18 03:20:46,662 - root - INFO - training time [s]: 2336.69 2025-09-18 03:20:46,662 - root - INFO - validation time [s]: 43.30 2025-09-18 03:20:46,663 - root - INFO - visualization time [s]: 0.00 2025-09-18 03:20:46,663 - root - INFO - training step time [ms]: 1385.11 2025-09-18 03:20:46,663 - root - INFO - minimal IO rate [GB/s]: 19.92 2025-09-18 03:20:46,664 - root - INFO - Metrics: 2025-09-18 03:20:46,664 - root - INFO - training loss: 0.09904173568106216 2025-09-18 03:20:46,665 - root - INFO - validation loss: 0.09222850203514099 2025-09-18 03:20:46,665 - root - INFO - validation L1: 0.06641560047864914 2025-09-18 03:20:46,666 - root - INFO - validation u10m: 0.6090171933174133 2025-09-18 03:20:46,666 - root - INFO - validation t2m: 0.5970665812492371 2025-09-18 03:20:46,666 - root - INFO - validation u500: 1.2254042625427246 2025-09-18 03:20:46,667 - root - INFO - validation z500: 28.22374725341797 2025-09-18 03:20:46,667 - root - INFO - validation q500: 0.00021421107521746308 2025-09-18 03:20:46,667 - root - INFO - ACC AUC u10m: 0.4930022656917572 2025-09-18 03:20:46,668 - root - INFO - ACC AUC t2m: 0.49619948863983154 2025-09-18 03:20:46,668 - root - INFO - ACC AUC u500: 0.494342178106308 2025-09-18 03:20:46,668 - root - INFO - ACC AUC z500: 0.4997457265853882 2025-09-18 03:20:46,669 - root - INFO - ACC AUC q500: 0.4842933714389801 2025-09-18 03:20:46,669 - root - INFO - -------------------------------------------------- 2025-09-18 04:00:23,811 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 04:00:35,727 - root - INFO - Save checkpoint (legacy): 11.92 sec (3.3527612686157227e-07) GB 2025-09-18 04:00:35,728 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 04:00:47,087 - root - INFO - Save checkpoint (legacy): 11.36 sec (3.3527612686157227e-07) GB 2025-09-18 04:00:47,088 - root - INFO - -------------------------------------------------- 2025-09-18 04:00:47,088 - root - INFO - Epoch 9 summary: 2025-09-18 04:00:47,088 - root - INFO - Performance Parameters: 2025-09-18 04:00:47,088 - root - INFO - training steps: 1687 2025-09-18 04:00:47,089 - root - INFO - validation steps: 91 2025-09-18 04:00:47,089 - root - INFO - memory footprint [GB]: 28.77 2025-09-18 04:00:47,089 - root - INFO - epoch time [s]: 2400.13 2025-09-18 04:00:47,089 - root - INFO - training time [s]: 2335.61 2025-09-18 04:00:47,090 - root - INFO - validation time [s]: 41.04 2025-09-18 04:00:47,090 - root - INFO - visualization time [s]: 0.00 2025-09-18 04:00:47,090 - root - INFO - training step time [ms]: 1384.48 2025-09-18 04:00:47,090 - root - INFO - minimal IO rate [GB/s]: 19.93 2025-09-18 04:00:47,090 - root - INFO - Metrics: 2025-09-18 04:00:47,091 - root - INFO - training loss: 0.09855784919109123 2025-09-18 04:00:47,091 - root - INFO - validation loss: 0.09157219529151917 2025-09-18 04:00:47,091 - root - INFO - validation L1: 0.06621820479631424 2025-09-18 04:00:47,091 - root - INFO - validation u10m: 0.606751561164856 2025-09-18 04:00:47,091 - root - INFO - validation t2m: 0.5951929092407227 2025-09-18 04:00:47,092 - root - INFO - validation u500: 1.219635248184204 2025-09-18 04:00:47,092 - root - INFO - validation z500: 27.80276870727539 2025-09-18 04:00:47,092 - root - INFO - validation q500: 0.00021331339667085558 2025-09-18 04:00:47,092 - root - INFO - ACC AUC u10m: 0.49304747581481934 2025-09-18 04:00:47,092 - root - INFO - ACC AUC t2m: 0.49623435735702515 2025-09-18 04:00:47,093 - root - INFO - ACC AUC u500: 0.4943862855434418 2025-09-18 04:00:47,093 - root - INFO - ACC AUC z500: 0.4997541904449463 2025-09-18 04:00:47,093 - root - INFO - ACC AUC q500: 0.4844130277633667 2025-09-18 04:00:47,093 - root - INFO - -------------------------------------------------- 2025-09-18 04:40:28,200 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 04:40:40,043 - root - INFO - Save checkpoint (legacy): 11.84 sec (3.3527612686157227e-07) GB 2025-09-18 04:40:40,045 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 04:40:50,940 - root - INFO - Save checkpoint (legacy): 10.89 sec (3.3527612686157227e-07) GB 2025-09-18 04:40:51,572 - root - INFO - -------------------------------------------------- 2025-09-18 04:40:51,572 - root - INFO - Epoch 10 summary: 2025-09-18 04:40:51,572 - root - INFO - Performance Parameters: 2025-09-18 04:40:51,573 - root - INFO - training steps: 1687 2025-09-18 04:40:51,573 - root - INFO - validation steps: 91 2025-09-18 04:40:51,573 - root - INFO - memory footprint [GB]: 28.41 2025-09-18 04:40:51,573 - root - INFO - epoch time [s]: 2404.18 2025-09-18 04:40:51,574 - root - INFO - training time [s]: 2334.61 2025-09-18 04:40:51,574 - root - INFO - validation time [s]: 45.93 2025-09-18 04:40:51,574 - root - INFO - visualization time [s]: 0.00 2025-09-18 04:40:51,574 - root - INFO - training step time [ms]: 1383.88 2025-09-18 04:40:51,574 - root - INFO - minimal IO rate [GB/s]: 19.94 2025-09-18 04:40:51,574 - root - INFO - Metrics: 2025-09-18 04:40:51,575 - root - INFO - training loss: 0.09929041014860611 2025-09-18 04:40:51,575 - root - INFO - validation loss: 0.09105746448040009 2025-09-18 04:40:51,575 - root - INFO - validation L1: 0.0660429373383522 2025-09-18 04:40:51,575 - root - INFO - validation u10m: 0.6045577526092529 2025-09-18 04:40:51,576 - root - INFO - validation t2m: 0.5937237739562988 2025-09-18 04:40:51,576 - root - INFO - validation u500: 1.2150465250015259 2025-09-18 04:40:51,576 - root - INFO - validation z500: 27.803077697753906 2025-09-18 04:40:51,576 - root - INFO - validation q500: 0.00021270185243338346 2025-09-18 04:40:51,576 - root - INFO - ACC AUC u10m: 0.49308812618255615 2025-09-18 04:40:51,577 - root - INFO - ACC AUC t2m: 0.4962506890296936 2025-09-18 04:40:51,577 - root - INFO - ACC AUC u500: 0.49442556500434875 2025-09-18 04:40:51,577 - root - INFO - ACC AUC z500: 0.49975329637527466 2025-09-18 04:40:51,577 - root - INFO - ACC AUC q500: 0.48448070883750916 2025-09-18 04:40:51,577 - root - INFO - -------------------------------------------------- 2025-09-18 05:20:43,968 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 05:20:55,916 - root - INFO - Save checkpoint (legacy): 11.95 sec (3.3527612686157227e-07) GB 2025-09-18 05:20:55,918 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 05:21:06,530 - root - INFO - Save checkpoint (legacy): 10.61 sec (3.3527612686157227e-07) GB 2025-09-18 05:21:08,444 - root - INFO - -------------------------------------------------- 2025-09-18 05:21:08,444 - root - INFO - Epoch 11 summary: 2025-09-18 05:21:08,445 - root - INFO - Performance Parameters: 2025-09-18 05:21:08,445 - root - INFO - training steps: 1687 2025-09-18 05:21:08,445 - root - INFO - validation steps: 91 2025-09-18 05:21:08,445 - root - INFO - memory footprint [GB]: 28.06 2025-09-18 05:21:08,446 - root - INFO - epoch time [s]: 2416.59 2025-09-18 05:21:08,446 - root - INFO - training time [s]: 2341.24 2025-09-18 05:21:08,446 - root - INFO - validation time [s]: 50.65 2025-09-18 05:21:08,446 - root - INFO - visualization time [s]: 0.00 2025-09-18 05:21:08,446 - root - INFO - training step time [ms]: 1387.81 2025-09-18 05:21:08,447 - root - INFO - minimal IO rate [GB/s]: 19.88 2025-09-18 05:21:08,447 - root - INFO - Metrics: 2025-09-18 05:21:08,447 - root - INFO - training loss: 0.09780498898815987 2025-09-18 05:21:08,447 - root - INFO - validation loss: 0.09065824002027512 2025-09-18 05:21:08,447 - root - INFO - validation L1: 0.0658884048461914 2025-09-18 05:21:08,448 - root - INFO - validation u10m: 0.6026096343994141 2025-09-18 05:21:08,448 - root - INFO - validation t2m: 0.592208206653595 2025-09-18 05:21:08,448 - root - INFO - validation u500: 1.211417555809021 2025-09-18 05:21:08,448 - root - INFO - validation z500: 28.215715408325195 2025-09-18 05:21:08,448 - root - INFO - validation q500: 0.00021235937310848385 2025-09-18 05:21:08,449 - root - INFO - ACC AUC u10m: 0.49312904477119446 2025-09-18 05:21:08,449 - root - INFO - ACC AUC t2m: 0.49627119302749634 2025-09-18 05:21:08,449 - root - INFO - ACC AUC u500: 0.49445101618766785 2025-09-18 05:21:08,449 - root - INFO - ACC AUC z500: 0.4997497797012329 2025-09-18 05:21:08,449 - root - INFO - ACC AUC q500: 0.4845098555088043 2025-09-18 05:21:08,449 - root - INFO - -------------------------------------------------- 2025-09-18 06:00:46,211 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 06:00:57,237 - root - INFO - Save checkpoint (legacy): 11.03 sec (3.3527612686157227e-07) GB 2025-09-18 06:00:57,239 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 06:01:08,951 - root - INFO - Save checkpoint (legacy): 11.71 sec (3.3527612686157227e-07) GB 2025-09-18 06:01:08,953 - root - INFO - -------------------------------------------------- 2025-09-18 06:01:08,953 - root - INFO - Epoch 12 summary: 2025-09-18 06:01:08,953 - root - INFO - Performance Parameters: 2025-09-18 06:01:08,953 - root - INFO - training steps: 1687 2025-09-18 06:01:08,953 - root - INFO - validation steps: 91 2025-09-18 06:01:08,954 - root - INFO - memory footprint [GB]: 28.88 2025-09-18 06:01:08,954 - root - INFO - epoch time [s]: 2400.08 2025-09-18 06:01:08,954 - root - INFO - training time [s]: 2335.78 2025-09-18 06:01:08,954 - root - INFO - validation time [s]: 41.36 2025-09-18 06:01:08,954 - root - INFO - visualization time [s]: 0.00 2025-09-18 06:01:08,955 - root - INFO - training step time [ms]: 1384.58 2025-09-18 06:01:08,955 - root - INFO - minimal IO rate [GB/s]: 19.93 2025-09-18 06:01:08,955 - root - INFO - Metrics: 2025-09-18 06:01:08,955 - root - INFO - training loss: 0.10193682922025317 2025-09-18 06:01:08,955 - root - INFO - validation loss: 0.09033828973770142 2025-09-18 06:01:08,956 - root - INFO - validation L1: 0.06579142063856125 2025-09-18 06:01:08,956 - root - INFO - validation u10m: 0.601457417011261 2025-09-18 06:01:08,956 - root - INFO - validation t2m: 0.5925062298774719 2025-09-18 06:01:08,956 - root - INFO - validation u500: 1.208477258682251 2025-09-18 06:01:08,956 - root - INFO - validation z500: 27.78553009033203 2025-09-18 06:01:08,956 - root - INFO - validation q500: 0.0002118794509442523 2025-09-18 06:01:08,957 - root - INFO - ACC AUC u10m: 0.4931487441062927 2025-09-18 06:01:08,957 - root - INFO - ACC AUC t2m: 0.4962646961212158 2025-09-18 06:01:08,957 - root - INFO - ACC AUC u500: 0.4944727420806885 2025-09-18 06:01:08,957 - root - INFO - ACC AUC z500: 0.49975454807281494 2025-09-18 06:01:08,957 - root - INFO - ACC AUC q500: 0.48457983136177063 2025-09-18 06:01:08,958 - root - INFO - -------------------------------------------------- 2025-09-18 06:40:54,688 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 06:41:05,586 - root - INFO - Save checkpoint (legacy): 10.90 sec (3.3527612686157227e-07) GB 2025-09-18 06:41:05,588 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 06:41:17,350 - root - INFO - Save checkpoint (legacy): 11.76 sec (3.3527612686157227e-07) GB 2025-09-18 06:41:17,709 - root - INFO - -------------------------------------------------- 2025-09-18 06:41:17,710 - root - INFO - Epoch 13 summary: 2025-09-18 06:41:17,710 - root - INFO - Performance Parameters: 2025-09-18 06:41:17,710 - root - INFO - training steps: 1687 2025-09-18 06:41:17,710 - root - INFO - validation steps: 91 2025-09-18 06:41:17,711 - root - INFO - memory footprint [GB]: 28.29 2025-09-18 06:41:17,711 - root - INFO - epoch time [s]: 2408.43 2025-09-18 06:41:17,711 - root - INFO - training time [s]: 2344.03 2025-09-18 06:41:17,711 - root - INFO - validation time [s]: 41.10 2025-09-18 06:41:17,712 - root - INFO - visualization time [s]: 0.00 2025-09-18 06:41:17,712 - root - INFO - training step time [ms]: 1389.47 2025-09-18 06:41:17,712 - root - INFO - minimal IO rate [GB/s]: 19.86 2025-09-18 06:41:17,712 - root - INFO - Metrics: 2025-09-18 06:41:17,712 - root - INFO - training loss: 0.09963397687581212 2025-09-18 06:41:17,713 - root - INFO - validation loss: 0.09017101675271988 2025-09-18 06:41:17,713 - root - INFO - validation L1: 0.06571623682975769 2025-09-18 06:41:17,713 - root - INFO - validation u10m: 0.6004416942596436 2025-09-18 06:41:17,713 - root - INFO - validation t2m: 0.5918598771095276 2025-09-18 06:41:17,714 - root - INFO - validation u500: 1.2065443992614746 2025-09-18 06:41:17,714 - root - INFO - validation z500: 27.779489517211914 2025-09-18 06:41:17,714 - root - INFO - validation q500: 0.0002118429692927748 2025-09-18 06:41:17,714 - root - INFO - ACC AUC u10m: 0.4931648373603821 2025-09-18 06:41:17,715 - root - INFO - ACC AUC t2m: 0.4962761402130127 2025-09-18 06:41:17,715 - root - INFO - ACC AUC u500: 0.49448859691619873 2025-09-18 06:41:17,715 - root - INFO - ACC AUC z500: 0.49975496530532837 2025-09-18 06:41:17,715 - root - INFO - ACC AUC q500: 0.4845523238182068 2025-09-18 06:41:17,716 - root - INFO - -------------------------------------------------- 2025-09-18 07:20:56,655 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 07:21:07,748 - root - INFO - Save checkpoint (legacy): 11.09 sec (3.3527612686157227e-07) GB 2025-09-18 07:21:07,750 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 07:21:18,193 - root - INFO - Save checkpoint (legacy): 10.44 sec (3.3527612686157227e-07) GB 2025-09-18 07:21:18,683 - root - INFO - -------------------------------------------------- 2025-09-18 07:21:18,684 - root - INFO - Epoch 14 summary: 2025-09-18 07:21:18,684 - root - INFO - Performance Parameters: 2025-09-18 07:21:18,684 - root - INFO - training steps: 1687 2025-09-18 07:21:18,685 - root - INFO - validation steps: 91 2025-09-18 07:21:18,685 - root - INFO - memory footprint [GB]: 29.02 2025-09-18 07:21:18,685 - root - INFO - epoch time [s]: 2400.61 2025-09-18 07:21:18,685 - root - INFO - training time [s]: 2336.95 2025-09-18 07:21:18,686 - root - INFO - validation time [s]: 41.46 2025-09-18 07:21:18,686 - root - INFO - visualization time [s]: 0.00 2025-09-18 07:21:18,686 - root - INFO - training step time [ms]: 1385.27 2025-09-18 07:21:18,686 - root - INFO - minimal IO rate [GB/s]: 19.92 2025-09-18 07:21:18,686 - root - INFO - Metrics: 2025-09-18 07:21:18,687 - root - INFO - training loss: 0.09681213091192313 2025-09-18 07:21:18,687 - root - INFO - validation loss: 0.0898866206407547 2025-09-18 07:21:18,687 - root - INFO - validation L1: 0.06563083082437515 2025-09-18 07:21:18,687 - root - INFO - validation u10m: 0.5995796322822571 2025-09-18 07:21:18,687 - root - INFO - validation t2m: 0.5913733839988708 2025-09-18 07:21:18,688 - root - INFO - validation u500: 1.2045432329177856 2025-09-18 07:21:18,688 - root - INFO - validation z500: 27.62720489501953 2025-09-18 07:21:18,688 - root - INFO - validation q500: 0.00021140927856322378 2025-09-18 07:21:18,688 - root - INFO - ACC AUC u10m: 0.4931819438934326 2025-09-18 07:21:18,688 - root - INFO - ACC AUC t2m: 0.49627935886383057 2025-09-18 07:21:18,689 - root - INFO - ACC AUC u500: 0.4945030212402344 2025-09-18 07:21:18,689 - root - INFO - ACC AUC z500: 0.49975788593292236 2025-09-18 07:21:18,689 - root - INFO - ACC AUC q500: 0.48462191224098206 2025-09-18 07:21:18,689 - root - INFO - -------------------------------------------------- 2025-09-18 08:00:57,755 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 08:01:08,664 - root - INFO - Save checkpoint (legacy): 10.91 sec (3.3527612686157227e-07) GB 2025-09-18 08:01:08,666 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 08:01:22,638 - root - INFO - Save checkpoint (legacy): 13.97 sec (3.3527612686157227e-07) GB 2025-09-18 08:01:22,639 - root - INFO - -------------------------------------------------- 2025-09-18 08:01:22,639 - root - INFO - Epoch 15 summary: 2025-09-18 08:01:22,639 - root - INFO - Performance Parameters: 2025-09-18 08:01:22,640 - root - INFO - training steps: 1687 2025-09-18 08:01:22,640 - root - INFO - validation steps: 91 2025-09-18 08:01:22,640 - root - INFO - memory footprint [GB]: 28.67 2025-09-18 08:01:22,641 - root - INFO - epoch time [s]: 2403.65 2025-09-18 08:01:22,641 - root - INFO - training time [s]: 2337.35 2025-09-18 08:01:22,641 - root - INFO - validation time [s]: 41.11 2025-09-18 08:01:22,641 - root - INFO - visualization time [s]: 0.00 2025-09-18 08:01:22,642 - root - INFO - training step time [ms]: 1385.51 2025-09-18 08:01:22,642 - root - INFO - minimal IO rate [GB/s]: 19.91 2025-09-18 08:01:22,642 - root - INFO - Metrics: 2025-09-18 08:01:22,642 - root - INFO - training loss: 0.0987971585749107 2025-09-18 08:01:22,643 - root - INFO - validation loss: 0.08970776200294495 2025-09-18 08:01:22,643 - root - INFO - validation L1: 0.06555227190256119 2025-09-18 08:01:22,643 - root - INFO - validation u10m: 0.5990988612174988 2025-09-18 08:01:22,643 - root - INFO - validation t2m: 0.5901705026626587 2025-09-18 08:01:22,644 - root - INFO - validation u500: 1.202216386795044 2025-09-18 08:01:22,644 - root - INFO - validation z500: 27.656984329223633 2025-09-18 08:01:22,644 - root - INFO - validation q500: 0.0002111522771883756 2025-09-18 08:01:22,644 - root - INFO - ACC AUC u10m: 0.4931897521018982 2025-09-18 08:01:22,645 - root - INFO - ACC AUC t2m: 0.49630147218704224 2025-09-18 08:01:22,645 - root - INFO - ACC AUC u500: 0.49452435970306396 2025-09-18 08:01:22,645 - root - INFO - ACC AUC z500: 0.4997579753398895 2025-09-18 08:01:22,645 - root - INFO - ACC AUC q500: 0.484661340713501 2025-09-18 08:01:22,646 - root - INFO - -------------------------------------------------- 2025-09-18 08:41:01,531 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 08:41:15,679 - root - INFO - Save checkpoint (legacy): 14.15 sec (3.3527612686157227e-07) GB 2025-09-18 08:41:15,681 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 08:41:28,672 - root - INFO - Save checkpoint (legacy): 12.99 sec (3.3527612686157227e-07) GB 2025-09-18 08:41:28,673 - root - INFO - -------------------------------------------------- 2025-09-18 08:41:28,674 - root - INFO - Epoch 16 summary: 2025-09-18 08:41:28,674 - root - INFO - Performance Parameters: 2025-09-18 08:41:28,674 - root - INFO - training steps: 1687 2025-09-18 08:41:28,674 - root - INFO - validation steps: 91 2025-09-18 08:41:28,675 - root - INFO - memory footprint [GB]: 28.06 2025-09-18 08:41:28,675 - root - INFO - epoch time [s]: 2405.70 2025-09-18 08:41:28,675 - root - INFO - training time [s]: 2337.08 2025-09-18 08:41:28,675 - root - INFO - validation time [s]: 41.28 2025-09-18 08:41:28,675 - root - INFO - visualization time [s]: 0.00 2025-09-18 08:41:28,676 - root - INFO - training step time [ms]: 1385.35 2025-09-18 08:41:28,676 - root - INFO - minimal IO rate [GB/s]: 19.92 2025-09-18 08:41:28,676 - root - INFO - Metrics: 2025-09-18 08:41:28,676 - root - INFO - training loss: 0.0985690209645616 2025-09-18 08:41:28,676 - root - INFO - validation loss: 0.0895804762840271 2025-09-18 08:41:28,677 - root - INFO - validation L1: 0.06551310420036316 2025-09-18 08:41:28,677 - root - INFO - validation u10m: 0.5984992384910583 2025-09-18 08:41:28,677 - root - INFO - validation t2m: 0.5905047655105591 2025-09-18 08:41:28,677 - root - INFO - validation u500: 1.2009402513504028 2025-09-18 08:41:28,677 - root - INFO - validation z500: 27.650171279907227 2025-09-18 08:41:28,678 - root - INFO - validation q500: 0.0002109717024723068 2025-09-18 08:41:28,678 - root - INFO - ACC AUC u10m: 0.49320343136787415 2025-09-18 08:41:28,678 - root - INFO - ACC AUC t2m: 0.4962948262691498 2025-09-18 08:41:28,678 - root - INFO - ACC AUC u500: 0.4945349097251892 2025-09-18 08:41:28,678 - root - INFO - ACC AUC z500: 0.49975818395614624 2025-09-18 08:41:28,679 - root - INFO - ACC AUC q500: 0.48468446731567383 2025-09-18 08:41:28,679 - root - INFO - -------------------------------------------------- 2025-09-18 09:21:08,077 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 09:21:19,821 - root - INFO - Save checkpoint (legacy): 11.74 sec (3.3527612686157227e-07) GB 2025-09-18 09:21:19,823 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 09:21:33,944 - root - INFO - Save checkpoint (legacy): 14.12 sec (3.3527612686157227e-07) GB 2025-09-18 09:21:33,945 - root - INFO - -------------------------------------------------- 2025-09-18 09:21:33,945 - root - INFO - Epoch 17 summary: 2025-09-18 09:21:33,946 - root - INFO - Performance Parameters: 2025-09-18 09:21:33,946 - root - INFO - training steps: 1687 2025-09-18 09:21:33,946 - root - INFO - validation steps: 91 2025-09-18 09:21:33,946 - root - INFO - memory footprint [GB]: 28.67 2025-09-18 09:21:33,947 - root - INFO - epoch time [s]: 2404.94 2025-09-18 09:21:33,947 - root - INFO - training time [s]: 2337.83 2025-09-18 09:21:33,947 - root - INFO - validation time [s]: 41.10 2025-09-18 09:21:33,947 - root - INFO - visualization time [s]: 0.00 2025-09-18 09:21:33,947 - root - INFO - training step time [ms]: 1385.79 2025-09-18 09:21:33,948 - root - INFO - minimal IO rate [GB/s]: 19.91 2025-09-18 09:21:33,948 - root - INFO - Metrics: 2025-09-18 09:21:33,948 - root - INFO - training loss: 0.09588575871899346 2025-09-18 09:21:33,948 - root - INFO - validation loss: 0.08950256556272507 2025-09-18 09:21:33,948 - root - INFO - validation L1: 0.06547261029481888 2025-09-18 09:21:33,949 - root - INFO - validation u10m: 0.5980568528175354 2025-09-18 09:21:33,949 - root - INFO - validation t2m: 0.5902761816978455 2025-09-18 09:21:33,949 - root - INFO - validation u500: 1.2003997564315796 2025-09-18 09:21:33,949 - root - INFO - validation z500: 27.689577102661133 2025-09-18 09:21:33,949 - root - INFO - validation q500: 0.00021097184799145907 2025-09-18 09:21:33,950 - root - INFO - ACC AUC u10m: 0.4932096600532532 2025-09-18 09:21:33,950 - root - INFO - ACC AUC t2m: 0.49629926681518555 2025-09-18 09:21:33,950 - root - INFO - ACC AUC u500: 0.4945389926433563 2025-09-18 09:21:33,950 - root - INFO - ACC AUC z500: 0.49975645542144775 2025-09-18 09:21:33,950 - root - INFO - ACC AUC q500: 0.48467838764190674 2025-09-18 09:21:33,951 - root - INFO - -------------------------------------------------- 2025-09-18 10:01:13,585 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 10:01:25,479 - root - INFO - Save checkpoint (legacy): 11.89 sec (3.3527612686157227e-07) GB 2025-09-18 10:01:25,482 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 10:01:37,107 - root - INFO - Save checkpoint (legacy): 11.63 sec (3.3527612686157227e-07) GB 2025-09-18 10:01:37,108 - root - INFO - -------------------------------------------------- 2025-09-18 10:01:37,108 - root - INFO - Epoch 18 summary: 2025-09-18 10:01:37,109 - root - INFO - Performance Parameters: 2025-09-18 10:01:37,109 - root - INFO - training steps: 1687 2025-09-18 10:01:37,109 - root - INFO - validation steps: 91 2025-09-18 10:01:37,110 - root - INFO - memory footprint [GB]: 28.67 2025-09-18 10:01:37,110 - root - INFO - epoch time [s]: 2402.85 2025-09-18 10:01:37,110 - root - INFO - training time [s]: 2336.55 2025-09-18 10:01:37,111 - root - INFO - validation time [s]: 42.09 2025-09-18 10:01:37,111 - root - INFO - visualization time [s]: 0.00 2025-09-18 10:01:37,111 - root - INFO - training step time [ms]: 1385.03 2025-09-18 10:01:37,111 - root - INFO - minimal IO rate [GB/s]: 19.92 2025-09-18 10:01:37,112 - root - INFO - Metrics: 2025-09-18 10:01:37,112 - root - INFO - training loss: 0.09558347896282877 2025-09-18 10:01:37,112 - root - INFO - validation loss: 0.0894259661436081 2025-09-18 10:01:37,112 - root - INFO - validation L1: 0.06544680893421173 2025-09-18 10:01:37,113 - root - INFO - validation u10m: 0.5978723168373108 2025-09-18 10:01:37,113 - root - INFO - validation t2m: 0.590385377407074 2025-09-18 10:01:37,113 - root - INFO - validation u500: 1.1996562480926514 2025-09-18 10:01:37,113 - root - INFO - validation z500: 27.550186157226562 2025-09-18 10:01:37,113 - root - INFO - validation q500: 0.00021085851767566055 2025-09-18 10:01:37,114 - root - INFO - ACC AUC u10m: 0.4932142496109009 2025-09-18 10:01:37,114 - root - INFO - ACC AUC t2m: 0.4962989389896393 2025-09-18 10:01:37,114 - root - INFO - ACC AUC u500: 0.49454644322395325 2025-09-18 10:01:37,114 - root - INFO - ACC AUC z500: 0.4997599124908447 2025-09-18 10:01:37,114 - root - INFO - ACC AUC q500: 0.4846946895122528 2025-09-18 10:01:37,115 - root - INFO - -------------------------------------------------- 2025-09-18 10:41:22,543 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 10:41:34,355 - root - INFO - Save checkpoint (legacy): 11.81 sec (3.3527612686157227e-07) GB 2025-09-18 10:41:34,357 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 10:41:46,254 - root - INFO - Save checkpoint (legacy): 11.90 sec (3.3527612686157227e-07) GB 2025-09-18 10:41:46,388 - root - INFO - -------------------------------------------------- 2025-09-18 10:41:46,389 - root - INFO - Epoch 19 summary: 2025-09-18 10:41:46,389 - root - INFO - Performance Parameters: 2025-09-18 10:41:46,389 - root - INFO - training steps: 1687 2025-09-18 10:41:46,389 - root - INFO - validation steps: 91 2025-09-18 10:41:46,390 - root - INFO - memory footprint [GB]: 28.67 2025-09-18 10:41:46,390 - root - INFO - epoch time [s]: 2408.96 2025-09-18 10:41:46,390 - root - INFO - training time [s]: 2337.39 2025-09-18 10:41:46,390 - root - INFO - validation time [s]: 47.57 2025-09-18 10:41:46,390 - root - INFO - visualization time [s]: 0.00 2025-09-18 10:41:46,391 - root - INFO - training step time [ms]: 1385.53 2025-09-18 10:41:46,391 - root - INFO - minimal IO rate [GB/s]: 19.91 2025-09-18 10:41:46,391 - root - INFO - Metrics: 2025-09-18 10:41:46,391 - root - INFO - training loss: 0.09774000651391136 2025-09-18 10:41:46,391 - root - INFO - validation loss: 0.0893937200307846 2025-09-18 10:41:46,392 - root - INFO - validation L1: 0.06542228162288666 2025-09-18 10:41:46,392 - root - INFO - validation u10m: 0.5976753830909729 2025-09-18 10:41:46,392 - root - INFO - validation t2m: 0.5899984836578369 2025-09-18 10:41:46,392 - root - INFO - validation u500: 1.1993545293807983 2025-09-18 10:41:46,392 - root - INFO - validation z500: 27.551380157470703 2025-09-18 10:41:46,392 - root - INFO - validation q500: 0.0002108381304424256 2025-09-18 10:41:46,393 - root - INFO - ACC AUC u10m: 0.49321871995925903 2025-09-18 10:41:46,393 - root - INFO - ACC AUC t2m: 0.4963035583496094 2025-09-18 10:41:46,393 - root - INFO - ACC AUC u500: 0.4945492744445801 2025-09-18 10:41:46,393 - root - INFO - ACC AUC z500: 0.49975964426994324 2025-09-18 10:41:46,394 - root - INFO - ACC AUC q500: 0.48469778895378113 2025-09-18 10:41:46,394 - root - INFO - -------------------------------------------------- 2025-09-18 11:21:26,765 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 11:21:38,816 - root - INFO - Save checkpoint (legacy): 12.05 sec (3.3527612686157227e-07) GB 2025-09-18 11:21:38,818 - root - INFO - Writing checkpoint to /pscratch/sd/a/amahesh/fcn_training/modulus-makani_runs-0.1.0gmd-fcndev_stats/multistep_sfno_linear_74chq_sc2_layers8_edim620_wstgl2/v0.1.0-seed77/training_checkpoints/best_ckpt_mp{mp_rank}.tar (legacy format) 2025-09-18 11:21:52,938 - root - INFO - Save checkpoint (legacy): 14.12 sec (3.3527612686157227e-07) GB 2025-09-18 11:21:52,939 - root - INFO - -------------------------------------------------- 2025-09-18 11:21:52,939 - root - INFO - Epoch 20 summary: 2025-09-18 11:21:52,939 - root - INFO - Performance Parameters: 2025-09-18 11:21:52,939 - root - INFO - training steps: 1687 2025-09-18 11:21:52,940 - root - INFO - validation steps: 91 2025-09-18 11:21:52,940 - root - INFO - memory footprint [GB]: 29.16 2025-09-18 11:21:52,940 - root - INFO - epoch time [s]: 2406.22 2025-09-18 11:21:52,940 - root - INFO - training time [s]: 2335.98 2025-09-18 11:21:52,941 - root - INFO - validation time [s]: 43.88 2025-09-18 11:21:52,941 - root - INFO - visualization time [s]: 0.00 2025-09-18 11:21:52,941 - root - INFO - training step time [ms]: 1384.69 2025-09-18 11:21:52,941 - root - INFO - minimal IO rate [GB/s]: 19.93 2025-09-18 11:21:52,941 - root - INFO - Metrics: 2025-09-18 11:21:52,942 - root - INFO - training loss: 0.09706210553959851 2025-09-18 11:21:52,942 - root - INFO - validation loss: 0.08937997370958328 2025-09-18 11:21:52,942 - root - INFO - validation L1: 0.06542158871889114 2025-09-18 11:21:52,942 - root - INFO - validation u10m: 0.5976012349128723 2025-09-18 11:21:52,942 - root - INFO - validation t2m: 0.5899701714515686 2025-09-18 11:21:52,943 - root - INFO - validation u500: 1.1992051601409912 2025-09-18 11:21:52,943 - root - INFO - validation z500: 27.532339096069336 2025-09-18 11:21:52,943 - root - INFO - validation q500: 0.00021083370666019619 2025-09-18 11:21:52,943 - root - INFO - ACC AUC u10m: 0.4932200312614441 2025-09-18 11:21:52,943 - root - INFO - ACC AUC t2m: 0.4963042736053467 2025-09-18 11:21:52,943 - root - INFO - ACC AUC u500: 0.494550883769989 2025-09-18 11:21:52,944 - root - INFO - ACC AUC z500: 0.49976015090942383 2025-09-18 11:21:52,944 - root - INFO - ACC AUC q500: 0.48469603061676025 2025-09-18 11:21:52,944 - root - INFO - -------------------------------------------------- 2025-09-18 11:21:53,345 - root - INFO - Total training time is 48124.96 sec