I assume you're dealing with audio data. Then your problem is that phonemes in actual speech are not discrete units and there is no single right answer to the question where the boundary between a consonant and a following vowel is located. For example, in the syllable /pa/, the articulators move away from the most consonant-like position where airflow is completely blocked (lips are closed) to a fully open position at the mid-point of /a/. The articulators need time for that, so there is a range of mid-points between the two sounds that you might want to consider as potential boundaries.
When dealing with a sequence of an obstruent (a consonant such as /p,t,f,z/) and a vowel, common criteria are:
- onset/offset of stable formant pattern in the vowel
- rapid in-/decrease in intensity
- and sometimes onset/offset of voicing
When dealing with a sequence of a sonorant (a consonant such as /n,m,w/) these criteria might be less or not at all useful. But a change in intensity and change in formant pattern will usually be observable. For example, approximant /r/ (such as in standard American and British English) usually has a low third formant, so the boundary could be set at the midpoint of the trajectory of the third formant in a /r/ + vowel sequence.
Here are two references you might find useful:
- Machac, Pavel and Radek Skarnitzl (2009). Principles of Phonetic Segmentation. Prague:
Epocha.
- Wiget, Klaus, Laurence White, Barbara Schuppler, Izabelle Grenon, Oleysa Rauch, and
Sven L. Mattys (2010). How stable are acoustic metrics of contrastive speech rhythm?
Journal of the Acoustical Society of America 127.3:1559-1569.
The latter is not primarily about your topic but they give a good description of and more references for segmentation criteria.
Before you start writing your own program you might also want to consider using or adapting existing solutions. There are some tools that use phonemic forced alignment, such as HTK. Together with an acoustic model of the language you are working on and an orthographic transcription of the text, this produces a phonemic time-aligned transcription of an audio recording. Together with P2FA, which provides a wrapper and an acoustic model of American English, and outputs Praat TextGrids, I have achieved good results even for other varieties of English. You could also take a look at MAUS, which provides a web interface for a small number of languages and also produces Praat TextGrids.