2022-02-13

Snakefile how to mix wildcards and variables

I want to make a rule which for a given number of threads translates files in one directory and format to another directory and format, in parallel. Certain elements of the path are defined by variables and certain are wildcards. I want it to wildcard on phase and sample and ext but take stage, challenge and language from the Python variable environment. I want the copy operation to take file to file. I don't want it to get the entire list of files as input. I'm not using expand here because if I use expand then snakemake will pass the entire list of inputs as {input} and the entire list of outputs as {output} to the function, which is not what I want. Here is the Snakefile:

from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

stage = "/media/catskills/interspeech22"
challenge = "openasr21"
language = "farsi"
sample_rate = 16000

# Resample WAV and SPH files to 16000 kHz WAV

rule sph_to_wav:
     input:
         "{stage}/{challenge}_{language}/{phase}/audio/{sample}.{ext}"
     output:
         "{stage}/{challenge}_{language}/{phase}/wav_{sample_rate}/{sample}.wav"
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

When I run it I get this error:

$ snakemake -c16 
Building DAG of jobs...
WildcardError in line 11 of /home/catskills/is22/interspeech22/scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'stage'

Is there a way to do this in snakemake?

UPDATE: I found a partial solution here, which is to use f-strings and double curly quote the patterns.

from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

# Resample WAV and SPH files to 16000 kHz WAV

rule sph_to_wav:
     input:
         f"{STAGE}/{CHALLENGE}_{LANGUAGE}//audio/."
     output:
     f"{STAGE}/{CHALLENGE}_{LANGUAGE}//wav_/.wav"
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

However the wildcard is not matching the subdirectory name. I'm still getting an error, but it's a little different:

$ snakemake -c16 
Building DAG of jobs...
WildcardError in line 11 of /home/catskills/is22/interspeech22/scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'phase'

This leads to here:

from pathlib import Path
from glob import glob
from copy_sph_to_wav import copy_sph_to_wav

STAGE = "/media/catskills/interspeech22"
CHALLENGE = "openasr21"
LANGUAGE = "farsi"
SAMPLE_RATE = 16000

# Resample WAV and SPH files to 16000 kHz WAV

rule sph_to_wav:
     input:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/audio/{sample}.{ext}" )
     output:
         str( Path(f"{STAGE}/{CHALLENGE}_{LANGUAGE}") / "{phase}/wav_{SAMPLE_RATE}/{sample}.wav" )
     run:
         copy_sph_to_wav({input}, {output}, sample_rate)

However I'm still not done yet:

$ snakemake -c16 
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).


from Recent Questions - Stack Overflow https://ift.tt/E6PMpS8
https://ift.tt/7vxO08M

No comments:

Post a Comment