A unit of work is a behavioral pattern in software development. Martin Fowler has defined it as everything one does during a business transaction which can affect the database. When the unit of work is finished, it will provide everything that needs to be done to change the database as a result of the work. A unit of work encapsulates one or more code repositories[de] and a list of actions to be performed which are necessary for the successful implementation of self-contained and consistent data change. A unit of work is also responsible for handling concurrency issues, and can be used for transactions and stability patterns.[de]
Speech segmentation
Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing. In the field of automatic pronunciation assessment, the process of segmenting an utterance against expected word(s) is called forced alignment. Speech segmentation is a subfield of general speech perception and an important subproblem of the technologically focused field of speech recognition, and cannot be adequately solved in isolation. As in most natural language processing problems, one must take into account context, grammar, and semantics, and even so the result is often a probabilistic division (statistically based on likelihood) rather than a categorical one. Though it seems that coarticulation—a phenomenon which may happen between adjacent words just as easily as within a single word—presents the main challenge in speech segmentation across languages, some other problems and strategies employed in solving those problems can be seen in the following sections. This problem overlaps to some extent with the problem of text segmentation that occurs in some languages which are traditionally written without inter-word spaces, like Chinese and Japanese, compared to writing systems which indicate speech segmentation between words by a word divider, such as the space. However, even for those languages, text segmentation is often much easier than speech segmentation, because the written language usually has little interference between adjacent words, and often contains additional clues not present in speech (such as the use of Chinese characters for word stems in Japanese). == Lexical recognition == In natural languages, the meaning of a complex spoken sentence can be understood by decomposing it into smaller lexical segments (roughly, the words of the language), associating a meaning to each segment, and combining those meanings according to the grammar rules of the language. Though lexical recognition is not thought to be used by infants in their first year, due to their highly limited vocabularies, it is one of the major processes involved in speech segmentation for adults. Three main models of lexical recognition exist in current research: first, whole-word access, which argues that words have a whole-word representation in the lexicon; second, decomposition, which argues that morphologically complex words are broken down into their morphemes (roots, stems, inflections, etc.) and then interpreted and; third, the view that whole-word and decomposition models are both used, but that the whole-word model provides some computational advantages and is therefore dominant in lexical recognition. To give an example, in a whole-word model, the word "cats" might be stored and searched for by letter, first "c", then "ca", "cat", and finally "cats". The same word, in a decompositional model, would likely be stored under the root word "cat" and could be searched for after removing the "s" suffix. "Falling", similarly, would be stored as "fall" and suffixed with the "ing" inflection. Though proponents of the decompositional model recognize that a morpheme-by-morpheme analysis may require significantly more computation, they argue that the unpacking of morphological information is necessary for other processes (such as syntactic structure) which may occur parallel to lexical searches. As a whole, research into systems of human lexical recognition is limited due to little experimental evidence that fully discriminates between the three main models. In any case, lexical recognition likely contributes significantly to speech segmentation through the contextual clues it provides, given that it is a heavily probabilistic system—based on the statistical likelihood of certain words or constituents occurring together. For example, one can imagine a situation where a person might say "I bought my dog at a ____ shop" and the missing word's vowel is pronounced as in "net", "sweat", or "pet". While the probability of "netshop" is extremely low, since "netshop" isn't currently a compound or phrase in English, and "sweatshop" also seems contextually improbable, "pet shop" is a good fit because it is a common phrase and is also related to the word "dog". Moreover, an utterance can have different meanings depending on how it is split into words. A popular example, often quoted in the field, is the phrase "How to wreck a nice beach", which sounds very similar to "How to recognize speech". As this example shows, proper lexical segmentation depends on context and semantics which draws on the whole of human knowledge and experience, and would thus require advanced pattern recognition and artificial intelligence technologies to be implemented on a computer. Lexical recognition is of particular value in the field of computer speech recognition, since the ability to build and search a network of semantically connected ideas would greatly increase the effectiveness of speech-recognition software. Statistical models can be used to segment and align recorded speech to words or phones. Applications include automatic lip-synch timing for cartoon animation, follow-the-bouncing-ball video sub-titling, and linguistic research. Automatic segmentation and alignment software is commercially available. == Phonotactic cues == For most spoken languages, the boundaries between lexical units are difficult to identify; phonotactics are one answer to this issue. One might expect that the inter-word spaces used by many written languages like English or Spanish would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word. The notion that speech is produced like writing, as a sequence of distinct vowels and consonants, may be a relic of alphabetic heritage for some language communities. In fact, the way vowels are produced depends on the surrounding consonants just as consonants are affected by surrounding vowels; this is called coarticulation. For example, in the word "kit", the [k] is farther forward than when we say 'caught'. But also, the vowel in "kick" is phonetically different from the vowel in "kit", though we normally do not hear this. In addition, there are language-specific changes which occur in casual speech which makes it quite different from spelling. For example, in English, the phrase "hit you" could often be more appropriately spelled "hitcha". From a decompositional perspective, in many cases, phonotactics play a part in letting speakers know where to draw word boundaries. In English, the word "strawberry" is perceived by speakers as consisting (phonetically) of two parts: "straw" and "berry". Other interpretations such as "stra" and "wberry" are inhibited by English phonotactics, which does not allow the cluster "wb" word-initially. Other such examples are "day/dream" and "mile/stone" which are unlikely to be interpreted as "da/ydream" or "mil/estone" due to the phonotactic probability or improbability of certain clusters. The sentence "Five women left", which could be phonetically transcribed as [faɪvwɪmɘnlɛft], is marked since neither /vw/ in /faɪvwɪmɘn/ nor /nl/ in /wɪmɘnlɛft/ are allowed as syllable onsets or codas in English phonotactics. These phonotactic cues often allow speakers to easily distinguish the boundaries in words. Vowel harmony in languages like Finnish can also serve to provide phonotactic cues. While the system does not allow front vowels and back vowels to exist together within one morpheme, compounds allow two morphemes to maintain their own vowel harmony while coexisting in a word. Therefore, in compounds such as "selkä/ongelma" ('back problem') where vowel harmony is distinct between two constituents in a compound, the boundary will be wherever the switch in harmony takes place—between the "ä" and the "ö" in this case. Still, there are instances where phonotactics may not aid in segmentation. Words with unclear clusters or uncontrasted vowel harmony as in "opinto/uudistus" ('student reform') do not offer phonotactic clues as to how they are segmented. From the perspective of the whole-word model, however, these words are thought be stored as full words, so the constituent parts would not necessarily be relevant to lexical recognition. == In infants and non-natives == Infants are one major focus of research in speech segmentation. Since infants have not yet acquired a lexicon capable of providing extensive contextual clues or probability-based word searches within their first year, as mentioned above, they must often rely primarily upon phonotactic and rhythmic cues (with prosody being the dominant cue), all
Structured kNN
Structured k-nearest neighbours (SkNN) is a machine learning algorithm that generalizes k-nearest neighbors (k-NN). k-NN supports binary classification, multiclass classification, and regression, whereas SkNN allows training of a classifier for general structured output. For instance, a data sample might be a natural language sentence, and the output could be an annotated parse tree. Training a classifier consists of showing many instances of ground truth sample-output pairs. After training, the SkNN model is able to predict the corresponding output for new, unseen sample instances; that is, given a natural language sentence, the classifier can produce the most likely parse tree. == Training == As a training set, SkNN accepts sequences of elements with class labels. The type of element does not matter; the only requirement is a defined metric function that gives a distance between each pair of elements of a set. SkNN is based on idea of creating a graph, with each node representing a class label. There is an edge between a pair of nodes if there is a sequence of two elements in the training set with corresponding classes. The first step of SkNN training is the construction of such a graph from training sequences. There are two special nodes in the graph corresponding to sentence beginnings and ends: if a sequence starts with class C, the edge between node START and node C should be created. Like regular k-NN, the second part of SkNN training consists of storing the elements of a training sequence in a certain way. Each element of the training sequences is stored in the node related to the class of the previous element in the sequence. Every first element is stored in the START node. == Inference == Labelling input sequences by SkNN consists of finding the sequence of transitions in the graph, starting from node START. Each transition corresponds to a single element of the input sequence. As a result, the label of each element is determined as the target node label of the transition. The cost of the path is defined as the sum of all transitions, with the cost of transition from node A to node B being the distance from the current input sequence element to the nearest element of class B, stored in node A. Determining an optimal path may be performed using a modified Viterbi algorithm (where the sum of the distances is minimized, unlike the original algorithm which maximizes the product of probabilities).
Radial basis function kernel
In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification. The RBF kernel on two samples x , x ′ ∈ R k {\displaystyle \mathbf {x} ,\mathbf {x'} \in \mathbb {R} ^{k}} , represented as feature vectors in some input space, is defined as K ( x , x ′ ) = exp ( − ‖ x − x ′ ‖ 2 2 σ 2 ) {\displaystyle K(\mathbf {x} ,\mathbf {x'} )=\exp \left(-{\frac {\|\mathbf {x} -\mathbf {x'} \|^{2}}{2\sigma ^{2}}}\right)} ‖ x − x ′ ‖ 2 {\displaystyle \textstyle \|\mathbf {x} -\mathbf {x'} \|^{2}} may be recognized as the squared Euclidean distance between the two feature vectors. σ {\displaystyle \sigma } is a free parameter. An equivalent definition involves a parameter γ = 1 2 σ 2 {\displaystyle \textstyle \gamma ={\tfrac {1}{2\sigma ^{2}}}} : K ( x , x ′ ) = exp ( − γ ‖ x − x ′ ‖ 2 ) {\displaystyle K(\mathbf {x} ,\mathbf {x'} )=\exp(-\gamma \|\mathbf {x} -\mathbf {x'} \|^{2})} Since the value of the RBF kernel decreases with distance and ranges between zero (in the infinite-distance limit) and one (when x = x'), it has a ready interpretation as a similarity measure. The feature space of the kernel has an infinite number of dimensions; for σ = 1 {\displaystyle \sigma =1} , its expansion using the multinomial theorem is: exp ( − 1 2 ‖ x − x ′ ‖ 2 ) = exp ( 2 2 x ⊤ x ′ − 1 2 ‖ x ‖ 2 − 1 2 ‖ x ′ ‖ 2 ) = exp ( x ⊤ x ′ ) exp ( − 1 2 ‖ x ‖ 2 ) exp ( − 1 2 ‖ x ′ ‖ 2 ) = ∑ j = 0 ∞ ( x ⊤ x ′ ) j j ! exp ( − 1 2 ‖ x ‖ 2 ) exp ( − 1 2 ‖ x ′ ‖ 2 ) = ∑ j = 0 ∞ ∑ n 1 + n 2 + ⋯ + n k = j exp ( − 1 2 ‖ x ‖ 2 ) x 1 n 1 ⋯ x k n k n 1 ! ⋯ n k ! exp ( − 1 2 ‖ x ′ ‖ 2 ) x ′ 1 n 1 ⋯ x ′ k n k n 1 ! ⋯ n k ! = ⟨ φ ( x ) , φ ( x ′ ) ⟩ {\displaystyle {\begin{alignedat}{2}\exp \left(-{\frac {1}{2}}\|\mathbf {x} -\mathbf {x'} \|^{2}\right)&=\exp \left({\frac {2}{2}}\mathbf {x} ^{\top }\mathbf {x'} -{\frac {1}{2}}\|\mathbf {x} \|^{2}-{\frac {1}{2}}\|\mathbf {x'} \|^{2}\right)\\[5pt]&=\exp \left(\mathbf {x} ^{\top }\mathbf {x'} \right)\exp \left(-{\frac {1}{2}}\|\mathbf {x} \|^{2}\right)\exp \left(-{\frac {1}{2}}\|\mathbf {x'} \|^{2}\right)\\[5pt]&=\sum _{j=0}^{\infty }{\frac {(\mathbf {x} ^{\top }\mathbf {x'} )^{j}}{j!}}\exp \left(-{\frac {1}{2}}\|\mathbf {x} \|^{2}\right)\exp \left(-{\frac {1}{2}}\|\mathbf {x'} \|^{2}\right)\\[5pt]&=\sum _{j=0}^{\infty }\quad \sum _{n_{1}+n_{2}+\dots +n_{k}=j}\exp \left(-{\frac {1}{2}}\|\mathbf {x} \|^{2}\right){\frac {x_{1}^{n_{1}}\cdots x_{k}^{n_{k}}}{\sqrt {n_{1}!\cdots n_{k}!}}}\exp \left(-{\frac {1}{2}}\|\mathbf {x'} \|^{2}\right){\frac {{x'}_{1}^{n_{1}}\cdots {x'}_{k}^{n_{k}}}{\sqrt {n_{1}!\cdots n_{k}!}}}\\[5pt]&=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {x'} )\rangle \end{alignedat}}} φ ( x ) = exp ( − 1 2 ‖ x ‖ 2 ) ( a ℓ 0 ( 0 ) , a 1 ( 1 ) , … , a ℓ 1 ( 1 ) , … , a 1 ( j ) , … , a ℓ j ( j ) , … ) {\displaystyle \varphi (\mathbf {x} )=\exp \left(-{\frac {1}{2}}\|\mathbf {x} \|^{2}\right)\left(a_{\ell _{0}}^{(0)},a_{1}^{(1)},\dots ,a_{\ell _{1}}^{(1)},\dots ,a_{1}^{(j)},\dots ,a_{\ell _{j}}^{(j)},\dots \right)} where ℓ j = ( k + j − 1 j ) {\displaystyle \ell _{j}={\tbinom {k+j-1}{j}}} , a ℓ ( j ) = x 1 n 1 ⋯ x k n k n 1 ! ⋯ n k ! | n 1 + n 2 + ⋯ + n k = j ∧ 1 ≤ ℓ ≤ ℓ j {\displaystyle a_{\ell }^{(j)}={\frac {x_{1}^{n_{1}}\cdots x_{k}^{n_{k}}}{\sqrt {n_{1}!\cdots n_{k}!}}}\quad |\quad n_{1}+n_{2}+\dots +n_{k}=j\wedge 1\leq \ell \leq \ell _{j}} == Approximations == Because support vector machines and other models employing the kernel trick do not scale well to large numbers of training samples or large numbers of features in the input space, several approximations to the RBF kernel (and similar kernels) have been introduced. Typically, these take the form of a function z that maps a single vector to a vector of higher dimensionality, approximating the kernel: ⟨ z ( x ) , z ( x ′ ) ⟩ ≈ ⟨ φ ( x ) , φ ( x ′ ) ⟩ = K ( x , x ′ ) {\displaystyle \langle z(\mathbf {x} ),z(\mathbf {x'} )\rangle \approx \langle \varphi (\mathbf {x} ),\varphi (\mathbf {x'} )\rangle =K(\mathbf {x} ,\mathbf {x'} )} where φ {\displaystyle \textstyle \varphi } is the implicit mapping embedded in the RBF kernel. === Fourier random features === One way to construct such a z is to randomly sample from the Fourier transformation of the kernel φ ( x ) = 1 D [ cos ⟨ w 1 , x ⟩ , sin ⟨ w 1 , x ⟩ , … , cos ⟨ w D , x ⟩ , sin ⟨ w D , x ⟩ ] T {\displaystyle \varphi (x)={\frac {1}{\sqrt {D}}}[\cos \langle w_{1},x\rangle ,\sin \langle w_{1},x\rangle ,\ldots ,\cos \langle w_{D},x\rangle ,\sin \langle w_{D},x\rangle ]^{T}} where w 1 , . . . , w D {\displaystyle w_{1},...,w_{D}} are independent samples from the normal distribution N ( 0 , σ − 2 I ) {\displaystyle N(0,\sigma ^{-2}I)} . Theorem: E [ ⟨ φ ( x ) , φ ( y ) ⟩ ] = e ‖ x − y ‖ 2 / ( 2 σ 2 ) . {\displaystyle \operatorname {E} [\langle \varphi (x),\varphi (y)\rangle ]=e^{\|x-y\|^{2}/(2\sigma ^{2})}.} Proof: It suffices to prove the case of D = 1 {\displaystyle D=1} . Use the trigonometric identity cos ( a − b ) = cos ( a ) cos ( b ) + sin ( a ) sin ( b ) {\displaystyle \cos(a-b)=\cos(a)\cos(b)+\sin(a)\sin(b)} , the spherical symmetry of Gaussian distribution, then evaluate the integral ∫ − ∞ ∞ cos ( k x ) e − x 2 / 2 2 π d x = e − k 2 / 2 . {\displaystyle \int _{-\infty }^{\infty }{\frac {\cos(kx)e^{-x^{2}/2}}{\sqrt {2\pi }}}dx=e^{-k^{2}/2}.} Theorem: Var [ ⟨ φ ( x ) , φ ( y ) ⟩ ] = O ( D − 1 ) {\displaystyle \operatorname {Var} [\langle \varphi (x),\varphi (y)\rangle ]=O(D^{-1})} . (Appendix A.2). === Nyström method === Another approach uses the Nyström method to approximate the eigendecomposition of the Gram matrix K, using only a random sample of the training set.
VIGRA
VIGRA is the abbreviation for "Vision with Generic Algorithms". It is a free open-source computer vision library which focuses on customizable algorithms and data structures. VIGRA component can be easily adapted to specific needs of target application without compromising execution speed, by using template techniques similar to those in the C++ Standard Template Library. == Features == VIGRA is cross-platform, with working builds on Microsoft Windows, Mac OS X, Linux, and OpenBSD. Since version 1.7.1, VIGRA provides Python bindings based on numpy framework. == History == VIGRA was originally designed and implemented by scientists at University of Hamburg faculty of computer science; its core maintainers are now working at Heidelberg Collaboratory for Image Processing (HCI) University of Heidelberg. In the meantime, many developers have contributed to the project. == Application == CellCognition and ilastik uses VIGRA computer vision library. OpenOffice.org uses VIGRA as part of its headless software rendering backend; LibreOffice does so until version 5.2.
Scan line
A scan line (also scanline) is one line, or row, in a raster scanning pattern, such as a line of video on a cathode-ray tube (CRT) display of a television set or computer monitor. On CRT screens the horizontal scan lines are visually discernible, even when viewed from a distance, as alternating colored lines and black lines, especially when a progressive scan signal with below maximum vertical resolution is displayed. This is sometimes used today as a visual effect in computer graphics. The term is used, by analogy, for a single row of pixels in a raster graphics image. Scan lines are important in representations of image data, because many image file formats have special rules for data at the end of a scan line. For example, there may be a rule that each scan line starts on a particular boundary (such as a byte or word; see for example BMP file format). This means that even otherwise compatible raster data may need to be analyzed at the level of scan lines in order to convert between formats.
Persian Speech Corpus
The Persian Speech Corpus is a Modern Persian speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of about 2.5 hours of Persian speech aligned with recorded speech on the phoneme level, including annotations of word boundaries. Previous spoken corpora of Persian include FARSDAT, which consists of read aloud speech from newspaper texts from 100 Persian speakers and the Telephone FARsi Spoken language DATabase (TFARSDAT) which comprises seven hours of read and spontaneous speech produced by 60 native speakers of Persian from ten regions of Iran. The Persian Speech Corpus was built using the same methodologies laid out in the doctoral project on Modern Standard Arabic of Nawar Halabi at the University of Southampton. The work was funded by MicroLinkPC, who own an exclusive license to commercialise the corpus, though the corpus is available for non-commercial use through the corpus' website. It is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The corpus was built for speech synthesis purposes, but has been used for building HMM based voices in Persian. It can also be used to automatically align other speech corpora with their phonetic transcript and could be used as part of a larger corpus for training speech recognition systems. == Contents == The corpus is downloadable from its website, and contains the following: 396 .wav files containing spoken utterances 396 .lab files containing text utterances 396 .TextGrid files containing the phoneme labels with time stamps of the boundaries where these occur in the .wav files. phonetic-transcript.txt which has the form "[wav_filename]" "[Phoneme Sequence]" in every line orthographic-transcript.txt which has the form "[wav_filename]" "[Orthographic Transcript]" in every line