DuPont Enterprise Financial Analysis

With the fast pace of modern society, the competitions between companies are becoming more fierce gradually. In order to catch the tide of financial progress, rational analyses are required for enterprise to understand a company’s financial situation and operational efficiency. As a result, entrepreneurs can judge their enterprises’ competitive position in the industry and sustainable development ability based on these analyses. DuPont analysis and factor analysis have been widely applied in enterprise financial analysis. Using these analysis methods can accurately calculate the various influence factors on the direction and extent of the influence of financial indicators, help enterprises to plan in advance, provide matter control and afterwards supervision, promote enterprises’ goal management and improve enterprise management level (Casella & Berger, 2002). Which analysis method is more informative for the analysis of corporate financial information? Admittedly, DuPont analysis plays a necessary role in financial analysis. While some experts are against this idea by claiming that factor analysis has a wider range of applications. This essay is aiming to explore the application of DuPont analysis methods in corporate financial management and whether this kind of analysis is more feasible than factor analysis in terms of enterprise development.

There is no doubt that DuPont analysis will be introduced to find its feasibility. Under the condition of considering the inner link of financial indicators, DuPont analysis uses the relationship between several major financial ratios to synthetically analyze the financial position of the enterprise. It is a classical method to evaluate the company profitability and shareholders’ equity returns level and evaluate enterprise performance from a financial perspective(Angelico & Nikbakht, 2000). The basic idea of DuPont analysis is to decompose the enterprise net assets yield to the product of a number of financial ratios, thus it can help to make an in-depth analysis of business performance. The most significant feature of DuPont model is to connect several ratios that are used to evaluate corporate efficiency and financial conditions according to their inner links, then form a complete index system, and finally reflect the enterprise by return on equity comprehensively (Angelico & Nikbakht, 2000). This method can make the level of financial ratio analysis more clear, organized and outstanding, to provide the operation and profitability of enterprises for financial statement analysts. DuPont analysis takes related values in place according to their inner links by DuPont chart and the core value is the return on equity. There are three key points that need to be noted when people utilize DuPont analysis (Bartholomew; Steele, et al, 2008): first, sales net interest rate reflects the relationship of net profit and sales income, and it depends on the sales revenue and total cost. Second, total assets can be referred as an important factor influencing asset turnover ratio and return on equity. Third, equity multiplier is influenced by asset-liability ratio index. To sum up, DuPont analysis system can explain the reason and trend of factor changes.

Though DuPont analysis has a lot of advantages and it’s widely applied, it also has some limitations. From the perspective of performance evaluation, DuPont analysis can only show financial information and cannot reflect the strength of enterprise (Harman, 1976). Primarily, DuPont analysis focuses on short-term financial results but ignore the long-term value creation. Moreover, financial indicators reflect the enterprise operating performance in the past, to measure industrial enterprises to meet the requirements of the times. But in the current information age, customers, suppliers, employees, technology innovators have more and more influence on the enterprise operating performance, and DuPont analysis is powerless in these aspects. In addition, DuPont analysis cannot solve the problem of intangible assets valuation that is very important to enhance the competitiveness of enterprises in a long term.

Despite all of these drawbacks, DuPont analyses are still the most prevalent tactics in enterprises around the world. The main reason is that enterprises nowadays combine classical DuPont analysis theories with the modern financial management goal. Enterprises design new DuPont analysis method based on the combination of the enterprise value maximization goal and the stakeholders’ interest maximization goal. In this way, stakeholders not only include the shareholders of an enterprise, but also consists creditors, business operators, customers, suppliers, employees and government. All these factors are essential for corporate financial management. The damage in either party of enterprise stakeholders’ interest is not conductive to the sustainable development of the company, also not conductive to reach the maximization of enterprise value. In other words, terminal aim of new DuPont analysis is within the framework of law and morality, under the premise of harmonious development, effectively balance the corporate stakeholders’ interest, realize the maximization of enterprise value. On the top that, new DuPont

However, factor analysis is feasible in the field that DuPont analysis cannot. Factor analysis is mainly used for determining the influence direction and degree of every factor in the total change in some kind of economic phenomenon affected by many factors (Bartholomew; Steele, et al, 2008). Factor analysis is the application and development of index method principle. It’s based on the index method principle. In the analysis of things change influenced by many factors, in order to observe the effects of some factors change, it will make other factors be fixed, and then analyze and replace item by item, so this method is also known as sequential substitution method (Harman, 1976). Based on comparative analysis, factor analysis is frequently used to find differences in the process of comparing and fatherly explore the cause (Larsen; Warne, 2010). Using factor analysis method, the first step is to study the formation process of the object and find various factors of analysis object; then to compare factors with the corresponding criterion item by item to determine the influence degree of differences of every factors, to help find the main contradiction and indicate the main direction of solving the problem for the next step. For instance, the relationship of a financial value and related factors can be represented as: Actual value: P1= A1xB2xC1; Standard value: P2=A2xB2xC2. The overall variance between the actual value and standard value is P1-P2, and it’s affected by three factors, namely A, B and C. The degree of influence of every factor can be calculated as: Influence of factor A: (A1-A2) xB2xC2; Influence of factor B: A1x (B1-B2) xC2; Influence of factor C: A1xB1x (C1-C2). Plus the above influence value, it is the overall variance: P1-P2. From the above analysis, it can be seen that factor analysis can be used for the detailed analysis of the degree of influence and can be more beneficial to guide the decision makers to find financial issues and propose solutions.

In conclusion, DuPont analysis and factor analysis have their own range of application. Through DuPont analysis system can provide better reasons and trends of financial index changes, factor analysis is better in enterprises’ financial analysis. Factor analysis can be used for more detailed analysis of the degree of influence and can be more beneficial to guide the decision makers to find financial issues ultimately and propose solutions fundamentally. In sum, factor analysis method has more extensive scope of application.

Design Thinking and Decision Analysis

Topic:

How can decision analysis support the decision making process in design thinking in selecting the most promising properties during the transition fromdivergent toconvergent thinking phases?

Executive Summary

Table of Content

Executive Summary

List of Figures

List of Tables

Index of Abbreviations

1.Introduction

2.Overview of Design Thinking and Decision Analysis

2.1.A New Approach to Problem Solving

2.1.1.What is a Design Thinker?

2.1.2.The Iterative Steps of Design Thinking

2.2.Decision Analysis

2.2.1.Decision Analysis Process

2.2.2.Multi Attribute Decision Making

3.Application Based on a Case Study

3.1.The Design Challenge

3.2.The Static Model

3.2.1.The Alternatives

3.2.2.Objectives and Measures of Effectiveness

3.2.3.Multi Attribute Decision Making

3.2.4.Sensitivity Analysis of the SAW Method

3.3.The Case Study’s solution

4.Conclusion

List of Literature

Statutory Declaration

Appendix

List of Figures

Figure 1: The IDEO process – a five step method (Kelley and Littman, 2004: 6-7)

Figure 2: Figure 3: The HPI process – a six step model (Plattner et al., 2009)

Figure 3: Fundamentals of Decision Analysis (Ralph L. Keeney, 1982)

Figure 4: Schematic form of a decision tree (Keeney and Raiffa, 1993)

Figure 5: A choice problem between two lotteries (Keeney and Raiffa, 1993)

Figure 6: MADM Decision Matrix

Figure 7: The tree main idea clusters

Figure 8: Decision Making Matrix

Figure 9: Decision Maker Matrix for the Design Challenge

List of Tables

Table 1: Different ways of describing design thinking (Lucy Kimbell, 2011)

Table 2: Realization of attributes in alternatives scale

Index of Abbreviations

DADecision Analysis

DCDesign Challenge

DM Decision Maker

DTDesign Thinking

HPIHasso-Plattner-Institute

HMWHow Might We

IWITMI Wonder If That Means

MADMMulti Attribute Decision Making

MCDMMulti Criteria Decision Making

SAWSimple Additive Weighting

1. Introduction

Everyone is in the need to make decisions every day. Those decisions may be shaped by an outstanding problem which just needs to be solved or it may just be the question whether to buy a new pair of shoes or not. Moreover, the problem may easily be solvable by a simple equation or there might be the necessity to formulate the problem in the first place since the difficulty is too diffuse to be absorbed. Due to the huge variety of different problems our society faces every day and with all divergent needs for a solution process, there is a constant need to draft and identify methods that support everyone in making decision. Undoubtedly, there are many methodologies and approaches out there that support the decision making process for daily small decisions that need to be made to life changing decisions. Decision analysis (DA), which is one of the formal methods and design thinking (DT), which is one of the innovative methodologies out there, are two instances of problem solving methods.

Both methods have been applied in similar fields, such as business, technology, and personal life but with divergent intentions. On the on hand, there is DT which is one of the more recent methodologies that helps to get from a problem to a solution with the support from a finite number of iterative steps that the design thinker will follow. Brown, who is the CEO of IDEO, describes DT as a method that is so powerful and implicit that can be used from teams broadly across the globe to create impactful innovative ideas that can be realized by the society or companies (Brown, 2010: 3). On the other hand, there is DA which is an approach that includes a variety of procedures that helps to find a formal solution to an identified problem and creates a more structured solution procedure. Howard was the person who shaped the term DA in 1964 and has been irreplaceable for the development of DA (Roebuck, 2011: 99).

This paper will combine DA and DT to investigate whether DA can leverage the DT process in order to find the most viable solution to a problem. Moreover, this paper will find out whether or not those two approaches can profit from each other. Selected procedures of DA will be integrated in the DT process by reference to a case study. Over and above, the solution generated by the DA technique will be compared with the chosen alternative in the case study that followed the regular DT process. Comparing those two outcomes, this paper will work out whether or not DA can support the DT process.

The second chapter is descriptive of the fundamentals of DA and DT. After the outline of the foundations, the third chapter will apply chosen DA procedures into the DT process on the basis of a case study. Moreover, the chosen alternative by the design thinking team in the case study will be analysed. In the final chapter, the major finding will be summarized and evaluated.

2. Overview of Design Thinking and Decision Analysis
2.1. A New Approach to Problem Solving

Design Thinking is an iterative and innovative approach to solve problems of all kinds that the society is facing. Moreover, it is a human-centred and at the same time investigatory process that puts its emphasis on collaboration, prototyping, and field research (Lockwood, 2010: xi). It is a set of fundamentals than can be applied by different people and to a huge range of problems (Brown, 2010: 7). DT is not a linear, but an iterative process in which the designers are constantly learning from mistakes and improving their ideas. Designers hope to find a linear model that will help them to understand the logic behind a design process; therefore, it is a constant search for decision making protocols that would support the designers’ processes (Buchanan, 1992). In sum, DT is a user-centred approach to solve a variety of problems with the aim to integrate people from various fields; ranging from consumers and business people to designers.

There are a variety of ways to describe DT, as illustrated in Table 1. According to Brown, DT is an organisational resource with the goal to create innovation. Cross describes the method as a cognitive style with the purpose of problem solving. Another famous definition concludes that “Design Thinking means thinking like a designer would” (Roger, 2009). However, the purpose and aim of DT is in its core identical, whether one is applying the processes modified by Cross or Brown (Plattner et al., 2009: 113).

Design thinking as a cognitive style

Design Thinking as a general theory of design

Design Thinking as an organizational resource

Key texts

Cross 1982; Schon 1986; Rowe [1987] 1998; Lawson 1997; Cross 2006; Dorst 2016

Buchanan 1992

Dunne and Martin 2006; Bauer and Eagan 2008; Brown 2009; Martin 2009

Focus

Individual designers, especially experts

Design as a field or discipline

Businesses and other organizations in need of innovation

Design’s purposes

Problem solving

Taming wicked problems

Innovations

Key concepts

Design ability as a form of intelligence; reflection-in-action, abductive thinking

Design has no special subject matter of its own

Visualization, prototyping, empathy, integrative thinking, abductive thinking

Nature of design problems

Design problems are ill-structured, problem and solution co-evolve

Design problems are wicked problems

Organizational problems are design problems

Sites of design expertise and activity

Traditional design disciplines

Four orders of design

Any context from healthcare to access to clean water (Brown and Wyatt 2010)

Table 1: Different ways of describing design thinking (Lucy Kimbell, 2011)

Over the last five years, the term DT has become very present in our society. On top of that, DT is a new term in design and management circles, which shows the demand for creative and innovative methods across various sectors (Kimbell, 2011). Nevertheless, this method is still underdeveloped when it comes to applying design methods at the management level (Dunne and Roger, 2006). But why is the interest in design growing and the term has become ubiquitous? The society is facing a lot of challenges; from educational problems to global warming and economic crisis. Brown sees DT as a powerful approach that can be applied to a huge variety of problems and as a consequence creates impactful solutions to these challenges. On top of that he argues that design has become nothing short of a tactic of viability (Brown, 2010: 3). The method is not limited to the creation and design of a physical product, but it can also result in the conception of a process, tools to communicate, or a service (Brown, 2010: 7). Therefore, it is a method that helps to learn from mistakes and to find impactful and sustainable solutions.

2.1.1. What is a Design Thinker?

Many individuals have their own personal picture of what a designer is and mostly, would not associate themselves with such a term. Nevertheless, the expression designer is not only limited to creative graphic designers that are working in agencies. There are many professionals who would fall under the term designer, from people that are working in corporations and are trying to implement a new innovative way of thinking to people who are creating a new customer experience (Porcini, 2009). Mauro Porcini puts a lot of emphasis on the fact that describing design is a huge challenge, since design can be anything from recognizing impactful solutions to the personal experience that the answers will originate (Porcini, 2009).

According to Brown design thinkers have four characteristics in common (Brown, 2008):

Empathy

Design thinkers have the ability to walk in the shoes of someone else; they view situations from the perspective of other people. This talent allows them to see a lot of thing that others are not able to observe which leads to solutions that are especially tailor-made for the users.

Integrative thinking

Integrative thinking allows the design thinker to go beyond simple solutions by seeing and assembling all the noticeable coalitions to a solution. The ability to not confide on the processes that are characterized by an A or B choice, allows them to involve even antithetic solutions.

Optimism and Experimentalism

Design thinkers are individuals who are confident that to each existing solution there is another one which is more impactful and feasible for the corresponding stakeholders. By experimenting with new information and the existing circumstances and moreover, by asking the most powerful questions, design thinkers are able to ascertain long-lasting innovations.

Collaboration

Another key aspect of the design thinking process is the ability to collaborate with experts from a variety of fields. This talent allows to not only integrating the designers and producers, but also the end user. Moreover, a design thinker him/herself has experience in many different fields and is not only an expert in DT.

2.1.2. The Iterative Steps of Design Thinking

As already mentioned above, there are many ways to describe DT. On top of that, the process may sometimes be described in three, five or six steps in literature. For example, at IDEO, which is one of the leading design consultancies in the world, the designers are working with a five step model (Kelley and Littman, 2004: 6-7).

Figure 1: The IDEO process – a five step method (Kelley and Littman, 2004: 6-7)

However, at the Hasso-Plattner-Institute in Potsdam, the process consists of six steps. The two processes consisting of a different amount of steps only differ in their emphasis on the overall process and a different description but not in their principles (Plattner et al., 2009: 113). In order to describe the process which will later be applied to a case study, the thesis will focus on the six steps process described by Plattner et. al. (Plattner et al., 2009: 114).

Figure 2: Figure 3: The HPI process – a six step model (Plattner et al., 2009)

Understand

The iterative DT process starts with a phase called understanding, which includes defining the problem and explaining the scope. Defining the so called Design Challenge (DC) is crucial for the success of the method since the whole team working on the challenge needs to have the same understanding of the problem to be solved. Moreover, the target group needs to be identified by the team in order to be able to move to the next phase. In the first phase, the emphasis is put on obtaining the knowledge that is required to solve the formulated DC.

Observe

The aim of the second phase is to become an expert. The DT team observes all the existing solutions to the identified problem and challenges them; more specifically the team tries to improve their understanding why there has not been an adequate solution up to that point. The team tries to get a 360° degree view on the problem, integrating all participants and people affected. One of the main activities in this phase is the direct contact with the future users or clients of the product/service for the intended solution. It is crucial to involve the future users since those people are building the target group and know what their wishes, requirements, way of behaviour and needs are. In addition, the team needs to examine carefully the processes and ways of behaviour. In order to do so the team needs to walk in the shoes of the end users. In sum, the second phase emphasises the need to reproduce the end user’s ways of behaviour while being able to fully understand the end user’s perspective.

Point of View

The third phase, called point of view, is the stage where all the findings from the previous phases are interpreted and evaluated. Since in most cases the team has branched out in the second phase, this phase brings everyone together in order to exchange findings. The team will segregate the relevant facts from the dispensable information. This separation helps to define the point of view more precisely which will lighten the fourth phase for the whole team. A method which is often used at this stage is the creation of a persona. A persona is a fictive and ideal-typical end user of the product or service. During this exercise the whole team deploys their findings from the second stage, the observing phase, with the aim to find the right viewing angle on the DC. For the purpose of finding the right perspective, it is important to question and realign the problem from a huge variety of different viewing angles. Recapitulatory, during the third phase the team assembles the key aspects from the end users in order to be able to start finding ideas in the next phase.

Ideate

The ideation phase is characterized by the reorientation of the team’s thinking process from divergent to convergent thinking. In the beginning of the phase, the team is still in a divergent thought process – the group of people is generating as many ideas for a solution as possible. All these concepts should contain a potential solution to the DC and should not be debated by the team in the beginning. It is a phase during which the team experiments with a variety of ideas and invests in the creative thinking process by leaving as much room as possible for everyone to generate constructive ideas.

In contrast to the first half of the ideation phase, the second half is shaped by the convergent method. During the convergent thought process, the team’s goal is to identify the one solution or the best solutions to the DC. This process consists of logical steps towards the exploration of solution/s. There are some creative techniques on how to narrow down the ideas in the ideation phase, for example (Center for Care Innovation, 2013-2014):

Sticky note voting: Every team member gets three stickers and places those next to the ideas that are most viable and feasible to him/her. The ideas with the most stickers will be prototyped in the next phase.
Idea morphing: Each idea will be presented in front of the whole team. After each presentation the team is looking out for synergies to merge some ideas or mixing some elements.

In sum, during this phase the team generates ideas for the exploration of solutions with the help of the information gathered during the last three phases.

Prototype

This phase appears for many people to be really different to what they have been used to during solution oriented processes. The aim of this phase is to visualize the ideas for the users; thereby, the users are able to give feedback more easily and may also be able test the idea. The prototype should not be the perfect visible idea, but the preproduction model should be able to transfer the message, show the strengths and weaknesses of the idea, and moreover, it should help the team to improve the idea even further. It is a visualization of the idea with the use of, for example, modelling clay, paper, Lego figures, and any material that might be within reach. If the solution is a service function, the prototype might be a theatrical performance. Moreover, some teams create a virtual prototype if the idea that cannot be visualized in a real model. All in all, the intention of the phase is to make an idea come alive and visible to the users.

Test

During the testing phase the idea will be tried out with the user. The most important part of this step is that the idea will be tested with the end users and not only within the DT team itself. The testing phase is about identifying the idea’s strengths and weaknesses together with the end user. It is about identifying mistakes because only from these misconceptions the team can learn and further improve the idea, since it is all about the user who will be making use of the idea. Therefore, the team has to put a lot of emphasis on learning from that experience.

2.2. Decision Analysis

Every human being constantly takes decisions throughout the day. On the one hand, there are many minor decisions from the preference of food each day, the question if one should stay in bed or not, to the colour of clothes someone wants to wear. On the other hand, people face situations where they have to choose whether to take a job or not or which car they would like to purchase. Some decisions have a larger and more significant impact than others; therefore, it is important to understand the consequences of the decisions that are being made (Gregory, 1988: 2).

Decision Analysis is designed to help when dealing with difficult decisions by offering more structure and guidance (Clemen, 1996: 4). DA supports the decision making process: it helps to better and fully understand the obstacles that are connected with having to make a decision and, on top of that, helps to make better choices (Clemen, 1996: 3). Moreover, DA permits the operator to make any decisions in a more effective and consistent way (Clemen, 1996: 4). In consequence, DA is a framework as well as a tool kit for approaching various decisions. Nevertheless, the judgement of each DM differs from person to person. One DM may have a preference which manifests itself in the chosen attributes and alternatives. Another DM may not have a preference and, on top of that, the judgement skills may vary from DM to DM as well (Hwang and Yoon, 1981: 8).

According to Keeney, the DA approach concentrates on five fundamental issues that are elementary for all decisions (Keeney, 1982):

Figure 3: Fundamentals of Decision Analysis (Ralph L. Keeney, 1982)

In order to be able to address multidisciplinary problems, the decision problem is divided into several parts which are analysed and integrated during the DA process (Keeney, 1982). Over the last years, various approaches have been identified, such as the shaped DA process by Keeney or the Multi Attribute Decision Making (MADM) method. The later one supports the decision making when a finite number of alternatives have been identified with various, mostly conflicting attributes.

2.2.1. Decision Analysis Process

Over the last decades, many analysts have been working on modifying and improving the DA steps included in the process; therefore, there are many procedures out there with a common purpose: Choosing the best alternative. Keeney describes the DA process in five major steps (Keeney and Raiffa, 1993: 5-6):

Preanalysis

During the first phase the focus is on gathering the alternatives and clarifying the objectives. The decision maker (DM) faces a situation where there is indecisiveness about any steps that are relevant in order to solve the problem. At this stage the problem is already at hand.

Structural analysis

At this stage the DM is confronted with structuring the problem. There are several questions that the DM will need to answer; for example, what call can be made? What are the decisions that can be delayed? Is there specific information that supports the choices that could be made? Figure 4 shows a decision tree in which the abovementioned questions are systematically put into place. The decision nodes which are displayed as 1 and 3 (squared) are the nodes that are controlled by the DM and the chance nodes, shown as 2 and 4 (circled), are the nodes which are not fully controlled by the DM.

Figure 4: Schematic form of a decision tree (Keeney and Raiffa, 1993)

Uncertainty Analysis

The third phase, called the Uncertainty Analysis, starts with assigning the probabilities to each path that is branching off from the chance nodes (in Figure 4, these are the paths left and right from points 2 and 4). The assignment of the probabilities to the branches of the decision tree is a subjective procedure (Keeney and Raiffa, 1993: 6; Gregory, 1988: 172). Nevertheless, the DM makes the assignments by using a variety of techniques based on experimental data. These assignments will be checked for conformity.

Utility or Value Analysis

The objective of the fourth step is the assignment of so called utility values to each path of the decision tree, whereas these represent the consequences connected to that path. The decision path that is shown in Figure 4 represents only one plausible path. In a real problem, many factors will be associated with the path; such as economical costs, psychological costs as well as benefits that the DM considers r

Construction of a Research Questionnaire

Construction of appropriate questionnaire itemsSection 2, Question 3

Describe what is involved in testing and validating a research questionnaire. (The answer to question 3 should be no fewer than 6 pages, including references)

The following criteria will be used in assessing question 3:

Construction of appropriate questionnaire items
Sophistication of understanding of crucial design issues
Plan for use of appropriate sampling method and sample
Plan to address validity and reliability in a manner appropriate to methodology

In order to construct an appropriate research questionnaire, it is imperative to first have a clear understanding of the scope of the research project. It would be most beneficial to solidify these research goals in written form, and then focus the direction of the study to address the research questions. After developing the research questions, the researcher would further read the related literature regarding the research topic, specifically searching for ideas and theories based on the analysis of the construct(s) to be measured. Constructs are essentially “mathematical descriptions or theories of how our test behavior is either likely to change following or during certain situations” (Kubiszyn & Borich, 2007, p. 311). It is important to know what the literature says about these construct(s) and the most accurate, concise ways to measure them. Constructs are psychological in nature and are not tangible, concrete variables because they cannot be observed directly (Gay & Airasian, 2003). Hopkins (1998) explains that “psychological constructs are unobservable, postulated variables that have evolved either informally or from psychological theory” (p. 99). Hopkins also maintains that when developing the items to measure the construct(s), it is imperative to ask multiple items per construct to ensure they are being adequately measured. Another important aspect in developing items for a questionnaire is to find an appropriate scale for all the items to be measured (Gay & Airasian, 2003). Again, this requires researching survey instruments similar to the one being developed for the current study and also determining what the literature says about how to best measure these constructs.

The next step in designing the research questionnaire is to validate it-to ensure it is measuring what it is intended to measure. In this case, the researcher would first establish construct validity evidence, which is ensuring that the research questionnaire is measuring the ideas and theories related to the research project. An instrument has construct validity evidence if “its relationship to other information corresponds well with some theory” (Kubiszyn & Borich, 2007, p. 309). Another reason to go through the validation process is to minimize factors that can weaken the validity of a research instrument, including unclear test directions, confusing and/or ambiguous test items, and vocabulary and sentence structures too difficult for test takers (Gay & Airasian, 2003).

After developing a rough draft of the questionnaire, including the items that measure the construct(s) for this study, the researcher should then gather a small focus group that is representative of the population to be studied (Johnson, 2007). The purpose of this focus group is to discuss the research topic, to gain additional perspectives about the study, and to consider new ideas about how to improve the research questionnaire so it is measuring the constructs accurately. This focus group provides the researcher with insight on what questions to revise and what questions should be added or deleted, if any. The focus group can also provide important information as to what type of language and vocabulary is appropriate for the group to be studied and how to best approach them (Krueger & Casey, 2009). All of this group’s feedback would be recorded and used to make changes, edits, and revisions to the research questionnaire.

Another step in the validation process is to let a panel of experts (fellow researchers, professors, those who have expertise in the field of study) read and review the survey instrument, checking it for grammatical errors, wording issues, unclear items (loaded questions, biased questions), and offer their feedback. Also, their input regarding the validity of the items is vital. As with the other focus group, any feedback should be recorded and used to make changes, edits, and revisions to the research questionnaire (Johnson, 2007).

The next step entails referring to the feedback received from the focus group and panel of experts. Any issues detected by the groups must be addressed so the research questionnaire can serve its purpose (Johnson, 2007). Next, the researcher should revise the questions and research questionnaire, considering all the input obtained and make any other changes that would improve the instrument. Any feedback obtained regarding the wording of items must be carefully considered, because the participants in the study must understand exactly what the questions are asking so they can respond accurately and honestly. It is also imperative to consider the feedback regarding the directions and wording of the research questionnaire. The directions of the questionnaire should be clear and concise, leaving nothing to personal interpretation (Suskie, 1996). The goal is that all participants should be able to read the directions and know precisely how to respond and complete the questionnaire. To better ensure honesty of responses, it is imperative to state in the directions that answers are anonymous (if applicable), and if they mistakenly write any identifying marks on the questionnaire, those marks will be immediately erased. If that type of scenario is not possible in the design of the study, the researcher should still communicate the confidentiality of the information obtained in this study and how their personal answers and other information will not be shared with anyone. Whatever the case or research design, the idea is to have participants answer the questions honestly so the most accurate results are obtained. Assuring anonymity and/or confidentiality to participants is another way to help ensure that valid data are collected.

The next phase entails pilot-testing the research questionnaire on a sample of people similar to the population on which the survey will ultimately be administered. This group should be comprised of approximately 20 people (Johnson, 2007), and the instrument should be administered under similar conditions as it will be during the actual study. The purpose of this pilot-test is two-fold; the first reason is to once again check the validity of the instrument by obtaining feedback from this group, and the second reason is to do a reliability analysis. Reliability is basically “the degree to which a test consistently measure whatever it is measuring” (Gay & Airasian, 2003, p. 141). A reliability analysis is essential when developing a research questionnaire because a research instrument lacking reliability cannot measure any variable better than chance alone (Hopkins, 1998). Hopkins goes on to say that reliability is an essential prerequisite to validity because a research instrument must consistently yield reliable scores to have any confidence in validity. After administering the research questionnaire to this small group, a reliability analysis of the results must be done. The reliability analysis to be used is Cronbach’s alpha (Hopkins, 1998), which allows an overall reliability coefficient to be calculated, as well as coefficients for each of the sub-constructs (if any). The overall instrument, as well as the sub-constructs, should yield alpha statistics greater than .70 (Johnson, 2007). This analysis would decide if the researcher needs to revise the items or proceed with administering the instrument to the target population. The researcher should also use the feedback obtained from this group to ensure that the questions are clear and present no ambiguity. Any other feedback obtained should be used to address any problems with the research questionnaire. Should there be any problems with particular items, then necessary changes would be made to ensure the item is measuring what it is supposed to be measuring. However, should there be issues with an entire construct(s) that is yielding reliability and/or validity problems, then the instrument would have to be revised, reviewed again by the panel of experts, and retested on another small group. After the instrument goes through this process and has been corrected and refined with acceptable validity and reliability, it is time to begin planning to administer it to the target population.

After the research questionnaire has established validity and reliability, the next step is to begin planning how to administer it to the participants of the study. To begin this process, it is imperative to define who the target population of the study is. Unfortunately, it is often impossible to gather data from everyone in a population due to feasibility and costs. Therefore, sampling must be used to collect data. According to Gay and Airasian (2003), “Sampling is the process of selecting a number of participants for a study in such a way that they represent the larger group from which they were selected” (p. 101). This larger group that the authors refer to is the population, and the population is the group to which the results will ideally generalize. However, out of any population, the researcher will have to determine those who are accessible or available. In most studies, the chosen population for study is usually a realistic choice and not always the target one (Gay & Airasian, 2003). After choosing the population to be studied, it is important to define that population so the reader will know how to apply the findings to that population.

The next step in the research study is to select a sample, and the quality of this sample will ultimately determine the integrity and generalizability of the results. Ultimately, the researcher should desire a sample that is representative of the defined population to be studied. Ideally, the researcher wants to minimize sampling error by using random sampling techniques. Random sampling techniques include simple random sampling, stratified sampling, cluster sampling, and systematic sampling (Gay & Airasian, 2003). According to the authors, these sampling techniques operate just as they are named: simple random sampling is using a means to randomly select an adequate sample of participants from a population; stratified random sampling allows a researcher to sample subgroups in such a way that they are proportional in the same way they exist in the population; and cluster sampling randomly selects groups from a larger population (Gay & Airasian, 2003). Systematic sampling is a form of simple random sampling, where the researcher simply selects every tenth person, for example. These four random sampling techniques, or variations thereof, are the most widely used random sampling procedures. While random sampling allows for the best chance to obtained unbiased samples, sometimes it is not always possible. Therefore, the researcher resorts to nonrandom sampling techniques. These techniques include convenience sampling, purposive sampling, and quota sampling (Gay & Airasian, 2003). Convenience sampling is simply sampling whoever happens to be available, while purposive sampling is where the researcher selects a sample based on knowledge of the group to be sampled (Gay & Airasian, 2003). Lastly, quota sampling is a technique used in large-scale surveys when a population of interest is too large to define. With quota sampling, the researcher usually will have a specific number of participants to target with specific demographics (Gay & Airasian, 2003).

The sampling method ultimately chosen will depend upon the population determined to be studied. In an ideal scenario, random sampling would be employed, which improves the strength and generalizability of the results. However, should random sampling not be possible, the researcher would mostly likely resort to convenience sampling. Although not as powerful as random sampling, convenience sampling is used quite a bit and can be useful in educational research (Johnson, 2007). Of course, whatever sampling means is employed, it is imperative to have an adequate sample size. As a general rule, the larger the population size, the smaller the percentage of the population required to get a representative sample (Gay & Airasian, 2003). The researcher would determine the size of the population being studied (if possible) and then determine an adequate sample size (Krejcie & Morgan, 1970, p. 608). Ultimately, it is desirable to obtain as many participants as possible and not merely to achieve a minimum (Gay & Airasian, 2003). Lastly, after an adequate sample size for the study has been determined, the researcher should proceed with the administration of the research questionnaire until the desired sample size is obtained. The research questionnaire should be administered in similar conditions, and potential participants should know and understand that they are not obligated in any way to participate and that they will not be penalized for not participating (Suskie, 1996). Also, participants should know how to contact the research should they have questions about the research project, including the ultimate dissemination of the data and the results of the study. The researcher should exhaust all efforts to ensure participants understand what is being asked so they can make a clear judgment regarding their consent to participate in the study. Should any of the potential participants be under the age of 18, the researcher would need to obtain parental permission in order for them to participate. Lastly, it is imperative that the researcher obtain approval from the Institutional Review Board (IRB) before the instrument is field-tested and administered to the participants. People who participate in the study should understand that the research project has been approved through the university’s IRB process.

References

Gay, L. R., & Airasian, P. (2003). Educational research: Competencies for analysis and Applications (7th ed.). Upper Saddle River, NJ: Pearson Education, Inc.

Hopkins, K. D. (1998). Educational and psychological measurement and evaluation (8th ed.). Boston: Allyn & Bacon.

Johnson, J. T. (2007). Instrument development and validation [Class handout]. Department of Educational Leadership & Research, The University of Southern Mississippi.

Krejcie, R. V., & Morgan, D. W. (1970). Determining sample size for research activities. Educational and Psychological Measurement, 30, 607-610.k

Krueger, R. A., & Casey, M. A. (2009). Focus groups: A practical guide for applied research (4th ed.). Thousand Oaks, SA: Sage Publications, Inc.

Kubiszyn, T., & Borich, B. (2007). Educational testing and measurement: Classroom application and practice (8th ed.). Hoboken, NJ: John Wiley & Sons.

Suskie, L. A. (1996). Questionnaire survey research: What works (2nd ed.). Tallahassee, FL: Association for Institutional Research.

What is churn? An overview

Churn is the phenomenon where a customer switches from one service to a competitor’s service (Tsai & Chen, 2009:2). There are two main types of churn, namely voluntary churn and involuntary churn. Voluntary churn is when the customer initiated the service termination. Involuntary churn means the company suspended the customer’s service and this is usually because of non-payment or service abuse.

Companies, in various industries, have recently started to realise that their client set is their most valuable asset. Retaining the existing clients is the best marketing strategy. Numerous studies have confirmed this by showing that it is more profitable to keep your existing clients satisfied than to constantly attract new clients (Van Den Poel & Lariviere, 2004:197; Coussement & Van Den Poel, 2008:313).

According to Van Den Poel and Lariviere (2004:197) successful customer retention has more than just financial benefits:

Successful customer retention programs free the organisation to focus on existing customers’ needs and the building of relationships.
It lowers the need to find new customers with uncertain levels of risk.
Long term customers tend to buy more and provide positive advertising through word-of-mouth.
The company has better knowledge of long term customers and they are less expensive with lower uncertainty and risk.
Customers with longer tenures are less likely to be influenced by competitive marketing strategies.
Sales may decrease if customers churn, due to lost opportunities. These customers also need to be replaced, which can cost five to six times more than simply retaining the customer.
1.1.Growth in Fixed-line Markets

According to Agrawal (2009) the high growth phase in the telecommunications market is over. In the future, wealth in the industry will be split between the companies. Revenues (of telecommunication companies) are declining around the world. Figure 2 shows Telkom’s fixed-line customer base and customer growth rate for the previous seven years. The number of lines is used as an estimate for the number of fixed-line customers.

Figure 2-Telkom’s fixed-line annual customer base (Idea adopted from Ahn, Han & Lee (2006:554))

With the lower customer growth worldwide, it is becoming vital to prevent customers from churning.

1.2.Preventing Customer Churn

The two basic approaches to churn management are divided into untargeted and targeted approaches. Untargeted approaches rely on superior products and mass advertising to decrease churn (Neslin, Gupta, Kamakura, Lu & Mason, 2004:3).

Targeted approaches rely on identifying customers who are likely to churn and then customising a service plan or incentive to prevent it from happening. Targeted approaches can be further divided into proactive and reactive approaches.

With a proactive approach the company identifies customers who are likely to churn at a future date. These customers are then targeted with incentives or special programs to attempt to retain them.

In a reactive targeted approach the company waits until the customer cancels the account and then offers the customer an incentive (Neslin et al., 2004:4).

A proactive targeted approach has the advantage of lower incentive costs (because the customer is not “bribed” at the last minute to stay with the company). It also prevents a culture where customers threaten to churn in order to negotiate a better deal with the company (Neslin et al., 2004:4).

The proactive, targeted approach is dependent on a predictive statistical technique to predict churners with a high accuracy. Otherwise the company’s funds may be wasted on unnecessary programs that incorrectly identified customers.

1.3.Main Churn Predictors

According to Chu, Tsai and Ho (2007:704) the main contributors to churn in the telecommunications industry are; price, coverage, quality and customer service. Their contributions to churn can be seen from Figure 3.

Figure 3 indicates that the primary reason for churn is price related (47% of the sample). The customer churns because a cheaper service or product is available, through no fault of the company. This means that a perfect retention strategy, based on customer satisfaction, can only prevent 53% of the churners (Chu et al., 2007:704).

1.4.Churn Management Framework

Datta, Masand, Mani and Li (2001:486) proposed a five stage framework for customer churn management (Figure 4).

The first stage is to identify suitable data for the modelling process. The quality of this data is extremely important. Poor data quality can cause large losses in money, time and opportunities (Olson, 2003:1). It is also important to determine if all the available historical data, or only the most recent data, is going to be used.

The second stage consists of the data semantics problem. It has a direct link with the first stage. In order to complete the first stage successfully, a complete understanding of the data and the variables’ information are required. Data quality issues are linked to data semantics because it often influences data interpretation directly. It frequently leads to data misinterpretation (Dasu & Johnson, 2003:100).

Stage three handles feature selection. Cios, Pedrycz, Swiniarski and Kurgan (2007:207) define feature selection as “a process of finding a subset of features, from the original set of features forming patterns in a given data set…”. It is important to select a sufficient number of diverse features for the modelling phase. Section 5.5.3 discusses some of the most important features found in the literature.

Stage four is the predictive model development stage. There are many alternative methods available. Figure 5 shows the number of times a statistical technique was mentioned in the papers the author read. These methods are discussed in detail in Section 6.

The final stage is the model validation process. The goal of this stage is to ensure that the model delivers accurate predictions.

5.5.1Stage one – Identify data

Usually a churn indicator flag must be derived in order to define churners. Currently, there exists no standard accepted definition for churn (Attaa, 2009). One of the popular definitions state that a customer is considered churned if the customer had no active products for three consecutive months (Attaa, 2009; Virgin Media, 2009; Orascom Telecom, 2008). Once a target variable is derived, the set of best features (variables) can be determined.

5.5.2Stage two – Data semantics

Data semantics is the process of understanding the context of the data. Certain variables are difficult to interpret and must be carefully studied. It is also important to use consistent data definitions in the database. Datta, et al. (2001) claims that this phase is extremely important.

5.5.3Stage three – Feature selection

Feature selection is another important stage. The variables selected here are used in the modelling stage. It consists of two phases. Firstly, an initial feature subset is determined. Secondly, the subset is evaluated based on a certain criterion.

Ahn et al. (2006:554) describe four main types of determinants in churn. These determinants should be included in the initial feature subset.

Customer dissatisfaction is the first determinant of churn mentioned. It is driven by network and call quality. Service failures have also been identified as “triggers” that accelerate churn. Customers who are unhappy can have an extended negative influence on a company. They can spread negative word-of-month and also appeal to third-party consumer affair bodies (Ahn et al., 2006:555).

Cost of switching is the second main determinant. Customers maintain their relationships with a company based on one of two reasons: they “have to” stay (constraint) or they “want to” stay (loyalty). Companies can use loyalty programs or membership cards to encourage their customers to “want to” stay (Ahn et al., 2006:556).

Service usage is the third main determinant. A customer’s service usage can broadly be described with minutes of use, frequency of use and total number of distinct numbers used. Service usage is one of the most popular predictors in churn models. It is still unclear if the correlation between churn and service usage is positive or negative (Ahn et al., 2006:556).

The final main determinant is customer status. According to Ahn et al. (2006:556), customers seldom churn suddenly from a service provider. Customers are usually suspended for a while due to payment issues, or they decide not to use the service for a while, before they churn.

Wei and Chiu (2002:105) use length of service and payment method as further possible predictors of churn. Customers with a longer service history are less likely to churn. Customers who authorise direct payment from their bank accounts are also expected to be less likely to churn.

Qi, Zhang, Shu, Li and Ge (2004?:2) derived different growth rates and number of abnormal fluctuation variables to model churn. Customers with growing usage are less likely to churn and customers with a high abnormal fluctuation are more likely to churn.

5.5.4Stage four – Model development

It is clear from Figure 5 that decision tree models are the most frequently used models. The second most popular technique is logistic regression, followed closely by neural networks and survival analysis. The technique that featured in the least number of papers is discriminant analysis.

Discriminant analysis is a multivariate technique that classifies observations into existing categories. A mathematical function is derived from a set of continuous variables that best discriminates among the set of categories (Meilgaard, Civille & Carr, 1999:323).

According to Cohen and Cohen (2002:485) discriminant analysis makes stronger modelling assumptions than logistic regression. These include that the predictor variables must be multivariate normally distributed and the within-group covariance matrix must be homogeneous. These assumptions are rarely met in practice.

According to Harrell (2001:217) even if these assumptions are met, the results obtained from logistic regression are still as accurate as those obtained from discrimination analysis. Discriminant analysis will, therefore, not be considered.

A neural network is a parallel data processing structure that possesses the ability to learn. The concept is roughly based on the human brain (Hadden, Tiwari, Roy & Ruta, 2006:2). Most neural networks are based on the perceptron architecture where a weighted linear combination of inputs is sent through a nonlinear function.

According to de Waal and du Toit (2006:1) neural networks have been known to offer accurate predictions with difficult interpretations. Understanding the drivers of churn is one of the main goals of churn modelling and, unfortunately, traditional neural networks provide limited understanding of the model.

Yang and Chiu (2007:319) confirm this by stating that neural networks use an internal weight scheme that doesn’t provide any insight into why the solution is valid. It is often called a black-box methodology and neural networks are, therefore, also not considered in this study.

The statistical methodologies used in this study are decision trees, logistic regression and survival analysis. Decision tree modelling is discussed in Section 6.1, logistic regression in Sections 6.2 and 6.3 and survival analysis is discussed in Section 6.4.

5.5.5Stage five – Validation of results

Each modelling technique has its own, specific validation method. To compare the models, accuracy will be used. However, a high accuracy on the training and validation data sets does not automatically result in accurate predictions on the population dataset. It is important to take the impact of oversampling into account. Section 5.6 discusses oversampling and the adjustments that need to be made.

1.5.Adjustments for Target Level Imbalances

From Telkom’s data it is clear that churn is a rare event of great interest and great value (Gupta, Hanssens, Hardie, Kahn, Kumar, Lin & Sriram, 2006:152).

If the event is rare, using a sample with the same proportion of events and non-events as the population is not ideal. Assume a decision tree is developed from such a sample and the event rate (x%) is very low. A prediction model could obtain a high accuracy (1-x%) by simply assigning all the cases to the majority level (e.g. predict all customers are non-churners) (Wei & Chiu, 2002:106). A sample with more balanced levels of the target is required.

Basic sampling methods to decrease the level of class imbalances include under-sampling and over-sampling. Under-sampling eliminates some of the majority-class cases by randomly selecting a lower percentage of them for the sample. Over-sampling duplicates minority-class cases by including a randomly selected case more than once (Burez & Van Den Poel, 2009:4630).

Under-sampling has the drawback that potentially useful information is unused. Over-sampling has the drawback that it might lead to over-fitting because cases are duplicated. Studies have shown that over-sampling is ineffective at improving the recognition of the minority class (Drummond & Holte, 2003:8). According to Chen, Liaw & Breiman, (2004:2) under-sampling has an edge over over-sampling.

However, if the probability of an event (target variable equals one) in the population differs from the probability of an event in the sample, it is necessary to make adjustments for the prior probabilities. Otherwise the probability of the event will be overestimated. This will lead to score graphs and statistics that are inaccurate or misleading (Georges, 2007:456).

Therefore, decision-based statistics based on accuracy (or misclassification) misrepresent the model performance on the population. A model developed on this sample will identify more churners than there actually are (high false alarm rate). Without an adjustment for prior probabilities, the estimates for the event will be overestimated.

According to Potts (2001:72) the accuracy can be adjusted with equation 1. It takes prior probabilities into account.

With:

: the population proportion of non-churners

: the population proportion of churners

: the sample proportion of non-churners

: the sample proportion of churners

: the number of true negatives (number of correctly predicted non-

churners)

: the number of true positives (number of correctly predicted churners)

: the number of instances in the sample

However, accuracy as a model efficiency measure trained on an under-sampled dataset is dependent on the threshold. This threshold is influenced by the class imbalance between the sample and the population (Burez & Van Den Poel, 2009:4626).

Business decision making in different ways

1.0 Introduction

This project is not only done for the sake of submitting as we are asked to but also to gain knowledge by a lot of means in both practical and theoretical ways. Text books and study guides cannot give complete knowledge to any student. And I believe that the assignments are given for students to gain extra practical knowledge from the wide world around. In the study of business decision making us mainly focus on the knowledge of different methods of data analyses and how it is useful for business contest and then the presentation of data in an appropriate way to make decisions and predictions. Its purpose is to build better understanding of different business issues and the ways to tackle them. This project report is under the wide range of business decision making of an organization. We have discussed representative measures and measures of dispersions and the difference between them and how they are used to interpret information in a useful manner. After that we use graphs to present the data in order to make them easy then using the graphs I draw some conclusion for business purposes. Finally we have given some solutions for a company which is encountering problems in telecommunications and inventory control. I have discussed the usefulness of intranet in the process of inventory control to overcome from poor inventory management. Also I have provided some solutions by comparing two proposals using DCF and IRR techniques and clearly mention which proposal the company should adopt in order to enhance its inventory control capacity effectively. This report helped me to apply the theoretical knowledge into real world examples and evaluate the advantages and disadvantages and make business decisions.

2.1Collecting and maintaining the medical data and Medical Records.

In modern clinics and hospitals, and in many public health departments, data in each of these categories can be found in the records of individuals who have received services there, but not all the data are in the same file. Administrative and economic data are usually in separate files from clinical data; both are linked by personal identifying information. Behavioural information, such as the fact that an individual did not obtain prescribed medication or fails to keep appointments can be extracted by linking facts in a clinical record with the records of medications dispensed and/or appointments kept. Records in hospitals and clinics are mostly computer-processed and stored, so it is technically feasible to extract and analyze the relevant information, for instance, occupation, diagnosis, and method of payment for the service that was provided, or behavioural information. Such analyses are often conducted for routine or for research purposes, although there are some ethical constraints to protect the privacy and preserve the confidentiality of individuals.

Primary sources-

Primary data sources are where YOU yourself have collected the data and it is not someone else’s. For example a questionnaire created by you and handed out to the specific people, is a primary source. You can then use them to prove a certain hypothesis and explain a situation.

Statistics,

Surveys,

Opinion polls,

Scientific data,

Transcripts

Records of organizations and government agencies

Secondary data-

Secondary data are indispensable for most organizational research. Secondary data refer to information gathered by someone other than the researcher conducting the current study.

Books

Periodicals government publications of economic indicators,

Census data,

Statistical abstracts,

Data bases,

The media, annual reports of companies,

Case studies

Other archival records.

2.2 Data collection methodology and Questionnaire
Records of Births and Deaths

Vital records (certifications of births and deaths) are similarly computer-stored and can be analyzed in many ways. Collection of data for birth and death certificates relies on the fact that recording of both births and deaths is a legal obligation—and individuals have powerful reasons, including financial incentives such as collection of insurance benefits, for completing all the formal procedures for certification of these vital events. The paper records that individuals require for various purposes are collected and collated in regional and national offices, such as the U.S. National Center for Health Statistics, and published in monthly bulletins and annual reports. Birth certificates record details such as full name, birthdate, names and ages of parents, birthplace, and birthweight. These items of information can be used to construct a unique sequence of numbers and alphabet letters to identify each individual with a high degree of precision. Death certificates contain a great deal of valuable information: name at birth as well as at death, age, sex, place of birth as well as death, and cause of death. The personal identifying information can be used to link the death certificate to other health records. The reliability of death certificate data varies according to the cause and place: Deaths in hospitals have usually been preceded by a sufficient opportunity for investigations to yield a reliable diagnosis, but deaths at home may be associated with illnesses that have not been investigated, so they may have only patchy and incomplete old medical records or the family doctor’s working diagnosis, which may be no more than an educated guess. Deaths in other places, such as on the street or at work, are usually investigated by a coroner or medical examiner, so the information is reasonably reliable. Other vital records, for example, marriages and divorces and dissolution of marriages, have less direct utility for health purposes but do shed some light on aspects of social health.

Health Surveys

Unlike births and deaths, health surveys are experienced by only a sample of the people; but if it is a statistically representative sample, inferences about findings can be generalized with some confidence. Survey data may be collected by asking questions either in an oral interview or over the telephone, or by giving the respondents a written questionnaire and collecting their answers. The survey data are collated, checked, edited for consistency, processed and analyzed generally by means of a package computer program. A very wide variety of data can be collected this way, covering details such as past medical events, personal habits, family history, occupation, income, social status, family and other support networks, and so on. In the U.S. National Health and Nutrition Surveys, physical examinations, such as blood pressure measurement, and laboratory tests, such as blood chemistry and counts, are carried out on a subsample.

Records of medical examinations on school children, military recruits, or applicants for employment in many industries are potentially another useful source of data, but these records tend to be scattered over many different sites and it is logistically difficult to collect and collate them centrally.

Health Research Data

The depth, range, and scope of data collected in health is diverse and complex, so it cannot be considered in detail here. Research on fields as diverse as biochemistry, psychology, genetics, and sports physiology have usefully illuminated aspects of population health, but the problem of central collection and collation and of making valid generalizations reduces the usefulness of most data from health-related research for the purpose of delineating aspects of national health.

Unobtrusive Data Sources and Methods of Collection

Unobtrusive methods and indirect methods can be a rich source of information from which it is sometimes possible to make important inferences about the health of the population or samples thereof. Economic statistics such as sales of tobacco and alcohol reveal national consumption patterns; counting cigarette butts in school playgrounds under collected conditions is an unobtrusive way to get a very rough measure of cigarette consumption by school children. Calls to the police to settle domestic disturbances provide a rough measure of the prevalence of family violence. Traffic crashes involving police reports and/or insurance claims reveal much about aspects of risk-taking behavior, for example, the dangerous practice of using cell phones while driving. These are among many examples of unobtrusive data sources, offered merely to illustrate the potential value of this approach.

The questionnaire contains something in each of the following categories:

Personal identifying data: name, age (birth date), sex, and so on.
Socio-demographic data: sex, age, occupation, place of residence.
Clinical data: medical history, investigations, diagnoses, treatment regimens.
Administrative data: referrals, sites of care.
Economic data: insurance coverage, method of payment.
Behavioral data: adherence to the recommended regimen (or otherwise).
3.0 Data Analysis
Representative Values.

These are also called as measures of location or measures of central tendency. They indicate the canter or most typical value of a data set lies. This includes three important measures: mean, median and mode. Mean and median can be only applied for quantitative data, but mode can be used with either quantitative or qualitative data.

Mean

This is the most commonly used measure which is the average of a data set. This is the sum of the observations divided by the number of observations.

Advantages of mean-objective:

Easy to calculate
Easy to understand
Calculated from all the data.

Disadvantages-affected by-outlying values

May be some distance from most values.

Median

Median of a data set is the number that divides the bottom 50% of the data from the top 50%.

Advantages-
Easy to understand
Give a value that actually occurred
Not being affected by outlying values.
Disadvantages-
Does not consider all the data
Can be used only with cardinal data.
Not easy to use in other analyses.

Mode

Mode of a data set is the number that occur frequently (more than one)

Advantages-
Being an actual value
Not affected by outlying value
Disadvantages-
Can be more than one mode or none
Does not consider all the data
Cannot be used in further analyses.

Comparison of mean, median and mode

For this garage, its representative values are as follows,

Mean- 335

Median- 323

Mode- 430

As we can see mean and median does not vary drastically, but mode on the other hand varies.

Here the owner has to select which price he has to charge among all these.

Mode is very high and it doesn’t consider all the values, so if the owner charge ?430 it will be expensive and the customers may switch to competitors. Therefore, owner should not choose mode.

Now the selection is between mean and median. Both of them look reasonable and close to most of the cost in October. Median is usually preferred when the data set have more extreme observations. Unless it is likely to select mean because it considers all the data.

From the overview of the cost in October it doesn’t have extreme values at all. So the mean value wouldn’t have affected much.

Therefore it is advisable that the owner chooses the mean value of ?335

Measures of Dispersion

Representative measures only indicate the location of a set of data and two data sets can have same mean, median and mode. In that case we cannot make any decision using representative values. To describe the difference we use a descriptive measure that indicates the amount of variation which is known as measures of dispersion or measures of spread.

This includes the following measurements:

Range-Range is simply the difference between the highest value and the lowest value. It is easy to calculate and understand, but it only consider the largest and smallest value and ignore all the other values and it is highly affected by extreme values.
Quartile range- Quartile range is the difference between 3rd quartile and 1st quartile. It is also easy to calculate, but it does not consider all the values in a data set so it is not a good indicator.
Variance and Standard Deviation- Variance measures how far the observations are from the mean. This is more important statistics because it considers all the observations and is used for further analysis. Standard deviation is the square root of variance. Both variance and standard deviation provide useful information for decision making and making comparisons.

From the calculation range is ?284 and quartile range is ?170, but because of the defects of them we cannot use them to derive further decisions. Variance is 8426.9 and standard deviation is 91.79. From the figures we can see observations are highly deviated from the mean. Variance and Standard deviations are used to compare two data sets. So the owner of this garage can compare these two figures with a similar garage or the cost of November and make decisions such as select the price which has smaller variance and standard deviation.

Quartiles and percentiles also like representative measure. They indicate the percentage of value below a certain value i.e. 3rd quartile indicate 75% of the observations are below a certain amount and 25% of observations are above.

Quartiles and percentile values of the garage

Quartiles- 1

248.5

2

322.5

3

418.5

Percentiles- 75%

418.5

50%

322.5

60%

349.4

From the above figures we can see only 25% of the values are above ?418 so we shouldn’t charge a price above than that if we do so we will lose many of their customers. 25% of the observations are above ?248.5 so we have to select a price between ?248 and ? 418. Earlier we have found out the mean which is ?335. This is between 2nd quartile and 60% of percentile. So from the use of quartile and percentile we can select ?335 as the service price. Thus quartile and percentile help us in decision making.

Correlation coefficient measures the strength of the linear relationship between two variables. It is denoted by “r”. Value of “r” always lie between -1 and +1. If “r” is closer to +1, two variables have strong positive relationship. Correlation and coefficient also helps to make business decisions.

4.0 Presentation of Information

Tables are good at presenting a lot of information, but it can still be difficult to identify the underlying patterns. Therefore the uses of charts and graphs play an important part in data presentation in an effective way. Graphical method includes scatter graph, bar charts, line charts, pie charts and histograms.

Pie charts

They are simple diagrams that give a summary of categorical data. Each slice of a circle represents one category. Pie charts are very simple and can make an impact but they show only very small amounts of data. When there is more data it becomes complicated and confusing. But using pie charts we can make comparisons. Here we can see the amount of commission Trevor plc paid is increasing because year 2008 has big proportion in the circle then 2007,2006 and 2005.so we can expect the amount will be higher than 2008 for the next year.

Bar Charts

Like pie charts, bar charts show the number of observations in different categories. Each category is represented by separate bar and the length of the bar is proportionate to the number of observations. Contrast to pie chart, more amounts of data can be plotted in bar charts. It is easy to make comparisons in different periods with different observations. Here sales of BMW and Mercedes are increasing continuously but sales of other cars fluctuating. Also we can see over all turn over also increasing year by year.

Line Chart

This is also another way of data presenting. Here we use line rather than using bar or circles. It is easy to draw line chart and easy to understand the underlying trend and make predictions. Area chart also like line chart but it shows the whole amount and shows each category as area. By using area chart we can understand the trend and also make comparisons. Line chart of Trevor plc indicates except Lexus, sales of other cars are increasing. But Mercedes show a dramatic increase from 2006 to 2008. During the period between 2005 and 2006 car sales tend to be steady. From the out come of this line chart Trevor plc mainly focus on BMW and Mercedes to increase its turn over in the forth coming years. Area chart also indicates the same result that line chart shows.

Scatter Diagram and the trend line

Scatter diagram drawn using two variables. Here we draw commission against year. Commission is plot in the “y” axis and year in the “x” axis. Scatter diagram explain the relationship between two variables whether they are positively or negatively correlated and whether they are strong or weak. Commission has a positive relationship with year for Trevor plc and the relationship is strong because most of the observation lies closer to straight line. We have calculated the correlation coefficient between commission and year and it comes 0.9744 this indicates strong positive relationship.

Trend lines used to understand the underlying trend and make useful forecasting. The trend line of Trevor plc shows upward trend among commission and year. We can predict the commission would be approximately ?18000 in 2009 and it would be ?18500-?19000 in year 2009.

6.0 Intranet

To: The Board of Directors

From: Management Consultant

Date: 20.12.2009

Subject: Intranet and its evaluation

Intranet is a private network that is contained within an enterprise. It may consist of many interlinked local area networks. Typically, an intranet includes connections through one or more gateway computers to the outside Internet. The main purpose of an intranet is to share company information and computing resources among employees. And also to share information within the branches of the same organisation.

Advantages:
Easy access to internal and external information
Improves communication
Increases collaboration and coordination
Supports links with customers and partners
Can capture and share knowledge
Productivity can be increased
Margins of errors will be reduced
High flexibility
It provides with timely and accurate information
It allows communication within the branches of the organisation.
Disadvantages:
Installation can maintenance can be expensive.
This may reduce face to face meetings with clients or business partners.
7.0 Management Information System

Management information system (MIS) is a system that allows managers to make decisions for the successful operation of businesses. Management information systems consist of computer resources, people, and procedures used in the modern business enterprise. MIS also refers to the organization that develops and maintains most or all of the computer systems in the enterprise so that managers can make decisions. The goal of the MIS organization is to deliver information systems to the various levels of managers: Strategic, Tactical and Operational levels.

Types of Information vary according to the levels of management.

Strategic management will need information for long term planning and corporate strategy. This will be less structured.

Tactical Management needs to take short term decisions to focus on improving profitability and performance.

Operational management needs information on day to day operations of the organisation.

11.0 Conclusion

Finally, I would like to conclude my report on Business Decision making. Firstly, I started with various method of data collection the analysis of the data gathered and prepared a sample questionnaire based on the example used. Then the presentation of data through tables have been discussed and continued with the information for decision making. Afterwards, I moved to evaluate the advantages and disadvantages of Intranet and its usefulness in controlling inventory. I also discussed about various inventory control methods used by organisations. Finally, I drew a conclusion on the investment decision scenario given.

This report made me clearly understand all the subject areas I learnt in the lectures and I found it useful.

‘Big’ Data Science and Scientists

‘BIG’ DATA SCIENCE

If you could possibly take a trip back in time with a time machine and say to people that today a child can interact with one another from anywhere and query trillions of data all over the globe with a simple click on his/her computer they would have said that it is science fiction !

Today more than 2.9 million emails are sent across the internet every second. 375 megabytes of data is consumed by households each day. Google processes 24 petabyte of data per day. Now that’s a lot of data !! With each click, like and share, the world’s data pool is expanding faster than we comprehend. Data is being created every minute of every day without us even noticing it. Businesses today are paying attention to scores of data sources to make crucial decisions about the future. The rise of digital and mobile communication has made the world become more connected, networked and traceable which has typically resulted in the availability of such large scale data sets.

So what is this buzz word “Big Data” all about ? Big data may be defined as data sets whose size is beyond the ability of typical database software tools to capture, create, manage and process data. The definition can differ by sector, depending on what kinds of software tools are commonly available and what sizes of data sets are common in a particular industry.

The explosion in digital data, bandwidth and processing power – combined with new tools for analyzing the data has sparked massive interest in the emerging field of data science. Big data has now reached every sector in the global economy. Big data has become an integral part of solving the world’s problems. It allows companies to know more about their customers, products and on their own infrastructure. More recently, people have become extensively focused on the monetization of that data.

According to a McKinsey Global Institute Report[1] in 2011, simply making big data more easily accessible to relevant stakeholders in a timely manner can create enormous value. For example, in the public sector, making relevant data more easily accessible across otherwise separated departments can sharply cut search and processing time. Big data also allows organizations to create highly specific segmentations and to tailor products and services precisely to meet those needs. This approach is widely known in marketing and risk management but can be revolutionary elsewhere.

Big Data is improving transportation and power consumption in cities, making our favorite websites & social networks more efficient, and even preventing suicides. Businesses are collecting more data than they know what to do with. Big data is everywhere; the volume of data produced, saved and mined is startling. Today, companies use data collection and analysis to formulate more cogent business strategies. Manufactures use data obtained from the use of real products to improve and develop new products and to create innovative after-sale service offerings. This will continue to be an emerging area for all industries. Data has become a competitive advantage and necessary part of product development.

Companies succeed in the big data era not simply because they have more or better data, but because they have good teams that set clear objectives and define what success looks like by asking the right questions. Big data are also creating new growth opportunities and entirely new categories of companies, such as those that collect and analyze industrial data.

One of the most impressive areas, where the concept of Big data is taking place is the area of machine learning. Machine Learning can be defined as the study of computer algorithms that improve automatically through experience. Machine learning is a branch of artificial intelligence which itself is a branch of computer science. Applications range from data mining programs that discover general rules in large data sets, to information filtering systems that learns automatically the user’s interests.

Rising alongside the relatively new technology of big data is the new job title data scientist. An article by Thomas H. Davenport and D.J. Patil in Harvard Business Review[2] describes ‘Data Scientist’ as the ‘Sexiest Job of the 21st Century’. You have to buy the logic that what makes a career “sexy” is when demand for your skills exceeds supply, allowing you to command a sizable paycheck and options. The Harvard Business Review actually compares these “data scientists” to the quants of 1980s and 1990s on Wall Street, who pioneered “financial engineering” and algorithmic trading. The need for data experts is growing and demand is on track to hit unprecedented levels in the next five years

Who are Data Scientists ?

Data scientists are people who know how to ask the right questions to get the most value out of massive volumes of data. In other words, data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.

Good data scientists will not just address business problems; they will choose the right problems that have the most value to the organization. They combine the analytical capabilities of a scientist or an engineer with the business acumen of the enterprise executive.

Data scientists have changed and keep changing the way things work. They integrate big data technology into both IT departments and business functions. Data scientist’s must also understand the business applications of big data and how it will affect the business organization and be able to communicate with IT and business management. The best data scientists are comfortable speaking the language of business and helping companies reformulate their challenges.

Data science due to its interdisciplinary nature requires an intersection of abilities of hacking skills, math and statistics knowledge and substantive expertise in the field of science. Hacking skills are necessary for working with massive amount of electronic data that must be acquired, cleaned and manipulated. Math and statistics knowledge allows a data scientist to choose appropriate methods and tools in order to extract insight from data. Substantive expertise in a scientific field is crucial for generating motivating questions and hypotheses to interpret results. Traditional research lies at the intersection of knowledge of math and statistics with substantive expertise in a scientific field. Machine learning stems from combining hacking skills with math and statistics knowledge, but does not require scientific motivation. Science is about discovery and raising knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. Hacking skills combined with substantive scientific expertise without rigorous methods can beget incorrect analysis.

A good scientist can understand the current state of a field, pick challenging questions were a success will actually lead to useful new knowledge and push that field further through their work.

How to become a Data Scientist ?

No university programs in India have yet been designed to develop data scientists, so recruiting them requires creativity. You cannot become a big data scientist overnight. Data Scientist need to know how to code and should be comfortable with mathematics and statistics. Data Scientist need know machine learning & software engineering. Learning data science can be really hard. They also need to know how to organize large data sets and use visualization tools and techniques.

Data scientists need to know how to code either in SAS, SPSS, Python or R. Statistical Package for the Social Sciences (SPSS) is a software package currently developed by IBM is widely used program for statistical analysis in social science. Statistical Analysis System (SAS) software suite developed by SAS Institute is mainly used in advanced analytics. SAS is the largest market-share holder for advanced analytics. Python is a high-level programming language, which is the most commonly used by data scientist’s community. Finally, R is a free software programming language for statistical computing and graphics. R language has become a de facto standard among statisticians for developing statistical software and is widely used for statistical software development and data analysis. R is a part of the GNU Project which is a collaboration that supports open source projects.

A few online courses would help you learn some of the main coding languages. One such course that is available currently is through the popular MOOCs website coursera.org. A specialization course offered by Johns Hopkins University through coursera helps you learn R programming, visualize data, machine learning and to develop data products. There are few more courses available through coursera that helps you to learn data science. Udacity is another popular MOOCs website that offers courses on Data Science, Machine Learning & Statistics. CodeAcademy also offers similar courses to learn data science and coding in Python.

When you start operating with data at the scale of the web, the fundamental approach and process of analysis must and will change. Most data scientists are working on problems that can’t be run on a single machine. They have large data sets that require distributed processing. Hadoop is an open-source software framework for storing and large-scale processing of data-sets on clusters of commodity hardware. MapReduce is this programming paradigm that allows for massive scalability across the servers in a Hadoop cluster. Apache Spark is Hadoop’s speedy Swiss Army knife. It is a fast -running data analysis system that provides real-time data processing functions to Hadoop. It is important that a data scientist must be able to work with unstructured data, whether it is from social media, videos or even audio.

KDnuggets is a popular website among data scientist that mainly focuses on latest updates and news in the field of Business Analytics, Data Mining, and Data Science. KDnuggets also offers a free Data Mining Course – the teaching modules for a one-semester introductory course on Data Mining, suitable for advanced undergraduates or first-year graduate students.

Kaggle is a platform for data prediction competitions. It is a platform for predictive modeling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. Kaggle hosts many data science competitions where you can practice, test your skills with complex, real world data and tackle actual business problems. Many employers do take Kaggle rankings seriously, as they can be seen as pertinent, hands-on project work. Kaggle aims at making data science a sport.

Finally to be a data scientist you’ll need a good understanding of the industry you’re working in and know what business problems your company is trying to solve. In terms of data science, being able to find out which problems are crucial to solve for the business is critical, in addition to identifying new ways should the business should be leveraging its data.

A study by Burtch Works[3] in April 2014, finds that data scientists earn a median salary that can be up to 40% higher than other Big Data professionals at the same job level. Data scientists have a median of nine years of experience, compared to other Big Data professionals who have a median of 11 years. More than one-third of data scientists are currently in the first five years of their careers. The gaming and technology industries pay higher salaries to data scientists than all other industries.

LinkedIn, a popular business oriented social networking website voted “statistical analysis and data mining” the top skill that got people hired in the year 2014. Data science has a bright future ahead there will only be more data and more of a need for people who can find meaning and value in that data. Despite the growing opportunity, demand for data scientist has outpaced supply of talent and will for the next five years.

A Study On Business Forecasting Statistics Essay

The aim of this report is to show my understanding of business forecasting using data which was drawn from the UK national statistics. It is a quarterly series of total consumer credit gross lending in the UK from the second quarter 1993 to the second quarter 2009.

The report answers four key questions that are relevant to the coursework.

In this section the data will be examined, looking for seasonal effects, trends and cycles. Each time period represents a single piece of data, which must be split into trend-cycle and seasonal effect. The line graph in Figure 1 identifies a clear upward trend-cycle, which must be removed so that the seasonal effect can be predicted.

Figure 1 displays long-term credit lending in the UK, which has recently been hit by an economic crisis. Figure 2 also proves there is evidence of a trend because the ACF values do not come down to zero. Even though the trend is clear in Figure 1 and 2 the seasonal pattern is not. Therefore, it is important the trend-cycle is removed so the seasonal effect can be estimated clearly. Using a process called differencing will remove the trend whilst keeping the pattern.

Drawing scattering plots and calculating correlation coefficients on the differenced data will reveal the pattern repeat.

Scatter Plot correlation

The following diagram (Figure 3) represents the correlation between the original credit lending data and four lags (quarters). A strong correlation is represented by is showed by a straight-line relationship.

As depicted in Figure 3, the scatter plot diagrams show that the credit lending data against lag 4 represents the best straight line. Even though the last diagram represents the straightest line, the seasonal pattern is still unclear. Therefore differencing must be used to resolve this issue.

Differencing

Differencing is used to remove a trend-cycle component. Figure 4 results display an ACF graph, which indicates a four-point pattern repeat. Moreover, figure 5 shows a line graph of the first difference. The graph displays a four-point repeat but the trend is still clearly apparent. To remove the trend completely the data must differenced a second time.

First differencing is a useful tool for removing non-stationary. However, first differencing does not always eliminate non-stationary and the data may have to be differenced a second time. In practice, it is not essential to go beyond second differencing, because real data generally involve non-stationary of only the first or second level.

Figure 6 and 7 displays the second difference data. Figure 6 displays an ACF graph of the second difference, which reinforces the idea of a four-point repeat. Suffice to say, figure 7 proves the trend-cycle component has been completely removed and that there is in fact a four-point pattern repeat.

Question 2

Multiple regression involves fitting a linear expression by minimising the sum of squared deviations between the sample data and the fitted model. There are several models that regression can fit. Multiple regression can be implemented using linear and nonlinear regression. The following section explains multiple regression using dummy variables.

Dummy variables are used in a multiple regression to fit trends and pattern repeats in a holistic way. As the credit lending data is now seasonal, a common method used to handle the seasonality in a regression framework is to use dummy variables. The following section will include dummy variables to indicate the quarters, which will be used to indicate if there are any quarterly influences on sales. The three new variables can be defined:

Q1 = first quarter
Q2 = second quarter
Q3 = third quarter
Trend and seasonal models using model variables

The following equations are used by SPSS to create different outputs. Each model is judged in terms of its adjusted R2.

Linear trend + seasonal model

Data = a + c time + b1 x Q1 + b2 x Q2 + b3 x Q3 + error

Quadratic trend + seasonal model

Data = a + c time + b1 x Q1 + b2 x Q2 + b3 x Q3 + error

Cubic trend + seasonal model

Data = a + c time + b1 x Q1 + b2 x Q2 + b3 x Q3 + error

Initially, data and time columns were inputted that displayed the trends. Moreover, the sales data was regressed against time and the dummy variables. Due to multi-collinearity (i.e. at least one of the variables being completely determined by the others) there was no need for all four variables, just Q1, Q2 and Q3.

Linear regression

Linear regression is used to define a line that comes closest to the original credit lending data. Moreover, linear regression finds values for the slope and intercept that find the line that minimizes the sum of the square of the vertical distances between the points and the lines.

Model Summary

Model

R

R Square

Adjusted R Square

Std. Error of the Estimate

1

.971a

.943

.939

3236.90933

Figure 8. SPSS output displaying the adjusted coefficient of determination R squared

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1

(Constant)

17115.816

1149.166

14.894

.000

time

767.068

26.084

.972

29.408

.000

Q1

-1627.354

1223.715

-.054

-1.330

.189

Q2

-838.519

1202.873

-.028

-.697

.489

Q3

163.782

1223.715

.005

.134

.894

Figure 9

The adjusted coefficient of determination R squared is 0.939, which is an excellent fit (Figure 8). The coefficient of variable ‘time’, 767.068, is positive, indicating an upward trend. All the coefficients are not significant at the 5% level (0.05). Hence, variables must be removed. Initially, Q3 is removed because it is the least significant variable (Figure 9). Once Q3 is removed it is still apparent Q2 is the least significant value. Although Q3 and Q2 is removed, Q1 is still not significant. All the quarterly variables must be removed, therefore, leaving time as the only variable, which is significant.

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1

(Constant)

16582.815

866.879

19.129

.000

time

765.443

26.000

.970

29.440

.000

Figure 10

The following table (Table 1) analyses the original forecast against the holdback data using data in Figure 10. The following equation is used to calculate the predicted values.

Predictedvalues = 16582.815+765.443*time

Original Data

Predicted Values

50878.00

60978.51

52199.00

61743.95

50261.00

62509.40

49615.00

63274.84

47995.00

64040.28

45273.00

64805.72

42836.00

65571.17

43321.00

66336.61

Table 1

Suffice to say, this model is ineffective at predicting future values. As the original holdback data decreases for each quarter, the predicted values increase during time, showing no significant correlation.

Non-Linear regression

Non-linear regression aims to find a relationship between a response variable and one or more explanatory variables in a non-linear fashion.

(Quadratic)
Model Summaryb

Model

R

R Square

Adjusted R Square

Std. Error of the Estimate

1

.986a

.972

.969

2305.35222

Figure 11

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1

(Constant)

11840.996

1099.980

10.765

.000

time

1293.642

75.681

1.639

17.093

.000

time2

-9.079

1.265

-.688

-7.177

.000

Q1

-1618.275

871.540

-.054

-1.857

.069

Q2

-487.470

858.091

-.017

-.568

.572

Q3

172.861

871.540

.006

.198

.844

Figure 12

The quadratic non-linear adjusted coefficient of determination R squared is 0.972 (Figure 11), which is a slight improvement on the linear coefficient (Figure 8). The coefficient of variable ‘time’, 1293.642, is positive, indicating an upward trend, whereas, ‘time2?, is -9.079, which is negative. Overall, the positive and negative values indicate a curve in the trend.

All the coefficients are not significant at the 5% level. Hence, variables must also be removed. Initially, Q3 is removed because it is the least significant variable (Figure 9). Once Q3 is removed it is still apparent Q2 is the least significant value. Once Q2 and Q3 have been removed it is obvious Q1 is under the 5% level, meaning it is significant (Figure 13).

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1

(Constant)

11698.512

946.957

12.354

.000

time

1297.080

74.568

1.643

17.395

.000

time2

-9.143

1.246

-.693

-7.338

.000

Q1

-1504.980

700.832

-.050

-2.147

.036

Figure 13

Table 2 displays analysis of the original forecast against the holdback data using data in Figure 13. The following equation is used to calculate the predicted values:

QuadPredictedvalues = 11698.512+1297.080*time+(-9.143)*time2+(-1504.980)*Q1

Original Data

Predicted Values

50878.00

56172.10

52199.00

56399.45

50261.00

55103.53

49615.00

56799.29

47995.00

56971.78

45273.00

57125.98

42836.00

55756.92

43321.00

57379.54

Table 2

Compared to Table 1, Table 2 presents predicted data values that are closer in range, but are not accurate enough.

Non-Linear model (Cubic)
Model Summaryb

Model

R

R Square

Adjusted R Square

Std. Error of the Estimate

1

.997a

.993

.992

1151.70013

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1

(Constant)

17430.277

710.197

24.543

.000

time

186.531

96.802

.236

1.927

.060

time2

38.217

3.859

2.897

9.903

.000

time3

-.544

.044

-2.257

-12.424

.000

Q1

-1458.158

435.592

-.048

-3.348

.002

Q2

-487.470

428.682

-.017

-1.137

.261

Q3

12.745

435.592

.000

.029

.977

Figure 15

The adjusted coefficient of determination R squared is 0.992, which is the best fit (Figure 14). The coefficient of variable ‘time’, 186.531, and ‘time2?, 38.217, is positive, indicating an upward trend. The coefficient of ‘time3? is -.544, which indicates a curve in trend. All the coefficients are not significant at the 5% level. Hence, variables must be removed. Initially, Q3 is removed because it is the least significant variable (Figure 15). Once Q3 is removed it is still apparent Q2 is the least significant value. Once Q3 and Q2 have been removed Q1 is now significant but the ‘time’ variable is not so it must also be removed.

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1

(Constant)

18354.735

327.059

56.120

.000

time2

45.502

.956

3.449

47.572

.000

time3

-.623

.017

-2.586

-35.661

.000

Q1

-1253.682

362.939

-.042

-3.454

.001

Figure 16

Table 3 displays analysis of the original forecast against the holdback data using data in Figure 16. The following equation is used to calculate the predicted values:

CubPredictedvalues = 18354.735+45.502*time2+(-.623)*time3+(-1253.682)*Q1

Original Data

Predicted Values

50878.00

49868.69

52199.00

48796.08

50261.00

46340.25

49615.00

46258.51

47995.00

44786.08

45273.00

43172.89

42836.00

40161.53

43321.00

39509.31

Table 3

Suffice to say, the cubic model displays the most accurate predicted values compared to the linear and quadratic models. Table 3 shows that the original data and predicted values gradually decrease.

Question 3

Box Jenkins is used to find a suitable formula so that the residuals are as small as possible and exhibit no pattern. The model is built only involving a few steps, which may be repeated as necessary, resulting with a specific formula that replicates the patterns in the series as closely as possible and also produces accurate forecasts.

The following section will show a combination of decomposition and Box-Jenkins ARIMA approaches.

For each of the original variables analysed by the procedure, the Seasonal Decomposition procedure creates four new variables for the modelling data:

SAF: Seasonal factors
SAS: Seasonally adjusted series, i.e. de-seasonalised data, representing the original series with seasonal variations removed.
STC: Smoothed trend-cycle component, which is smoothed version of the seasonally adjusted series that shows both trend and cyclic components.
ERR: The residual component of the series for a particular observation

Autoregressive (AR) models can be effectively coupled with moving average (MA) models to form a general and useful class of time series models called autoregressive moving average (ARMA) models,. However, they can only be used when the data is stationary. This class of models can be extended to non-stationary series by allowing differencing of the data series. These are called autoregressive integrated moving average (ARIMA) models.

The variable SAS will be used in the ARIMA models because the original credit lending data is de-seasonalised. As the data in Figure 19 is de-seasonalised it is important the trend is removed, which results in seasonalised data. Therefore, as mentioned before, the data must be differenced to remove the trend and create a stationary model.

Model Statistics

Model

Number of Predictors

Model Fit statistics

Ljung-Box Q(18)

Number of Outliers

Stationary R-squared

Normalized BIC

Statistics

DF

Sig.

Seasonal adjusted series for creditlending from SEASON, MOD_2, MUL EQU 4-Model_1

0

.485

14.040

18.693

15

.228

0

Model Statistics

Model

Number of Predictors

Model Fit statistics

Ljung-Box Q(18)

Number of Outliers

Stationary R-squared

Normalized BIC

Statistics

DF

Sig.

Seasonal adjusted series for creditlending from SEASON, MOD_2, MUL EQU 4-Model_1

0

.476

13.872

16.572

17

.484

0

ARMA (3,2,0)

Original Data

Predicted Values

50878.00

50335.29843

52199.00

50252.00595

50261.00

50310.44277

49615.00

49629.75233

47995.00

Application of Regression Analysis

Chapter-3

Methodology

In the application of regression analysis, often the data set consist of unusual observations which are either outliers (noise) or influential observations. These observations may have large residuals and affect the parameters of the regression co-efficient and the whole regression analysis and become the source of misleading results and interpretations. Therefore it is very important to consider these suspected observations very carefully and made a decision that either these observations should be included or removed from the analysis.

In regression analysis, the basic step is to determine whether one or more observations can influence the results and interpretations of the analysis. If the regression analysis have one independent variable, then it is easy to detect observations in dependent and independent variables by using scatter plot, box plot and residual plot etc. But graphical method to identify outlier and/or influential observation is a subjective approach. It is also well known that in the presence of multiple outliers there can be a masking or swamping effect. Masking (false negative) occurs when an outlying subset remains undetected due the presence of another, usually adjacent subset. Swamping (false positive) occurs when usual observation is incorrectly identified as outlier in the presence of another usually remote subset of observations.

In the present study, some well known diagnostics are compared to identify multiple influential observations. For this purpose, first, robust regression methods are used to identify influential observation in Poisson regression, then to conform that the observations identified by robust regression method are genuine influential observations, some diagnostic measures based on single case deletion approach like Pearson chi-square, deviance residual, hat matrix, likelihood residual test, cook’s distance, difference of fits, squared difference in beta are considered but in the presence of masking and swamping diagnostics based on single case deletion fail to identify outlier and influential observations. Therefore to remove or minimize the masking and swamping phenomena some group deletion approaches; generalized standardized Pearson residual, generalized difference of fits, generalized squared difference in beta are taken.

3.2 Diagnostic measures based on single case deletion

This section presents the detail of single case deleted measures which are used to identify multiple influential observations in Poisson regression model. These measures are change in Pearson chi-square, change in deviance, hat matrix, likelihood residual test, cook’s distance, difference of fits (DFFITS),squared difference in beta(SDBETA).

Pearson chi-square

To show the amount of change in Poisson regression estimates that would occurred if the kth observation is deleted, Pearson ?2 statistic is proposed to detect the outlier. Such diagnostic statistics are one that examine the effected of deleting single case on the overall summary measures of fit.

Let denotes the Pearson ?2 and denotes the statistic after the case k is deleted. Using one-step linear approximations given by Pregibon (1981). The decrease in the value of statistics due to deletion of the kth case is

? = E- , k=1,2,3,…..,n 3.1

is defined as:

3.2

=

And for the kth deleted case is:

= 3.3

Deviance residual

The one-step linear approximation for change in deviance when the kth case is deleted is:

?D = D E- D(-k) 3.4

Because the deviance is used to measure the goodness of fit of a model, a substantial decrease in the deviance after the deletion of the kth observation is indicate that is observation is a misfit. The deviance of Poisson regression with kth observation is:

D=2 3.5

Where = exp (

D(-k)= 2 3.6

A larger value of ?D(-k) indicates that the kth value is an outlier.

Hat matrix:

The Hat matrix is used in residual diagnostics to measure the influence of each observation. The hat values, hii, are the diagonal entries of the Hat matrix which is calculated using

H=V1/2X(XTVX)-1XTV1/2 3.7

Where V=diag[var(yi)(ii)]-1

var(yi)=E(yi)=

In Poisson regression model

=i) = (,where g function is usually called the link function and With the log link in Poisson regression

i=

=

V=diag( 3.8

(XTVX)-1 is an estimated covariance matrix of and hii is the ith diagonal element of Hat matrix H. The properties of the diagonal element of hat matrix i.e leverage values are

0

and

Where k indicates the parameter of the regression model with intercept term. An observation is said to be influential if ckn. where c is a suitably constant 2 and 3 or more. Using twice the mean thumb rule suggested by Hoaglin and Welsch (1978), an observation with 2kn considered as influential.

Likelihood residual test

For the detection of outliers, Williams (1987) introduced the likelihood residual. The squared likelihood residual is a weighted average of the squared standardized deviance and Pearson residual is defined as:

3.9

and it is approximately equals to likelihood ratio test for testing whether an observation is an outlier and it also called approximate studentized residual, is standardized Pearson residual is defined as:

= 3.10

is standardized deviance residual is defined as:

= 3.11

= sign(

Where is called the deviance residual and it is another popular residual because the sum of square of these residual is a deviance statistic.

Because the average value, KN, of hi is small is much closer to than to ,and therefore also approximately normally distributed. An observation is considered to be influential if |t(1, n

Difference of fits test (DFFITS)

Difference of fits test for Poisson regression is defined as:

(DFFITS)i= , i=1,2,3,…..,n 3.12

Where and are respectively the ith fitted response and an estimated standard error with the ith observation is deleted. DFFITS can be expressed in terms of standardized Pearson residuals and leverage values as:

(DFFITS)i= 3.13

= =

An observation is said to be influential if the value of DFFITS 2.

Cook’s Distance:

Cook (1977) suggests the statistics which measures the change in parameter estimates caused by deleting each observation, and defined as:

CDi= 3.14

Where is estimated parameter of without ith observation. There is also a relationship between difference of fits test and Cook’s distance which can be expressed as:

CDi= 3.15

Using approximation suggested by Pregibon’s C.D can be expressed as:

() 3.16

Observation with CD value greater than 1 is treated as an influential.

Squared Difference in Beta (SDFBETA)

The measure is originated from the idea of Cook’s distance (1977) based on single case deletion diagnostic and brings a modification in DFBETA (Belsley et al., 1980), and it is defined as

(SDFBETA)i = 3.17

After some necessary calculation SDFBETA can be relate with DFFITS as:

(SDFBETA)i = 3.18

The ith observation is influential if (SDFBETA)i

Diagnostic measures based on group deletion approach

This section includes the detail of group deleted measures which are used to identify the multiple influential observations in Poisson regression model. Multiple influential observations can misfit the data and can create the masking or swamping effect. Diagnostics based on group deletion are effective for identification of multiple influential observations and are free from masking and swamping effect in the data. These measures are generalized standardized Pearson residual (GSPR), generalized difference of fits (GDFFITS) and generalized squared difference in Beta(GSDFBETA).

3.3.1 Generalized standardized Pearson residual (GSPR)

Imon and Hadi (2008) introduced GSPR to identify multiple outliers and it is defined as:

i 3.19

= i 3.20

Where are respectively the diagonal elements of V and H (hat matrix) of remaining group. Observations corresponding to the cases |GSPR| > 3 are considered as outliers.

3.3.2 Generalized difference of fits (GDFFITS)

GDFFITS statistic can be expressed in terms of GSPR (Generalized standardized Pearson residual) and GWs (generalized weights).

GWs is denoted by and defined as:

for i 3.21

= for i 3.22

A value having is larger than, Median (MAD ( is considered to be influential i.e

> Median (MAD (

Finally GDFFITS is defined as

(GDFFITS)i= 3.23

We consider the observation as influential if

GDFFITSi 3

3.3.3 Generalized squared difference in Beta (GSDFBETA)

In order to identify the multiple outliers in dataset and to overcome the masking and swamping effect GSDFBETA is defined as:

GSDFBETAi = for i 3.24

= for i 3.25

Now the generalized GSDFBETA can be re-expressed in terms of GSPR and GWs:

GSDFBETAi = for i 3.26

= for i 3.27

A suggested cut-off value for the detection of influential observation is

GSDFBETA

Analysis of variance models

Abstract: Analysis of variance (ANOVA) models has become widely used tool and plays a fundamental role in much of the application of statistics today. Two-way ANOVA models involving random effects have found widespread application to experimental design in varied fields such as biology, econometrics, quality control, and engineering. The article is comprehensive presentation of methods and techniques for point estimation, interval estimation, estimation of variance components, and hypotheses tests for Two-Way Analysis of Variance with random effects.

Key words: Analysis of variance; two-way classification; variance components; random effects model

1. Introduction

The random effects model is not fraught with questions about assumptions as is the mixed effects model. Concerns have been expressed over the reasonableness of assuming that the interaction term abij is tossed into the model independently of ai and bj . However, uncorrelatedness, which with normality becomes independence, does seem to emerge from finite sampling models that define the interaction to be a function of the main A and B effects. The problem usually of interest is to estimate the components of variance.

The model (1) is referred to as a cross-classification model. A slightly different and equally important model is the nested model. For this latter model see (5) and the related discussion.

2. Estimation of variance components

The standard method of moments estimators for a balanced design(i.e., = n ) are based on the expected mean squares for the sums of nij squares. The credentials of the estimators (4) are that they are uniform minimum variance unbiased estimators (UMVUE) under normal theory, and uniform minimum variance quadratic unbiased estimators (UMVQUE) in general. They do, however, suffer the embarrassment of sometimes being negative, except for .e which is always positive. The actual maximum likelihood estimators would occur on a boundary rather than being negative. The best way is to always adjust an estimate to zero rather than report a negative value. It should certainly be possible to construct improved estimators along the lines of the Klotz-Milton-Zacks estimators used in the one-way classification. However, the details on these estimators have not been

worked out by anyone for the two-way classification. Estimating variance components from unbalanced data is not as straight-forward as from balanced data. This is so for two reasons. First, several methods of estimation are available (most of which reduce to the analysis of variance method for balanced data), but no one of them has yet been clearly established as superior to the others. Second, all the methods involve relatively cumbersome algebra; discussion of unbalanced data can therefore easily deteriorate into a welter of symbols, a situation we do our best (perhaps not successfully) to minimize here1.

On the other hand, extremely unbalanced designs are a horror story. A number of different methods have been proposed for handling them, but all involve extensive algebraic manipulations. The technical detail required to carry out these analyses exceeds the limitations set for this article. On occasion factors A and B are such that it makes no sense to postulate the existence of interactions, so the terms abij should be dropped from (1). In this case .ab disappears from (3) and the estimators for .a and 1 Djordjevic V., Lepojevic V., Henderson?s approach to Variance Components estimation for unbalanced data, Facta Universitatis, Vol.2 No.1, 2004. pg. 59

Another variation on the model (1) gives rise to the nested model. In general, the nested model for components of variance problems occur more frequently in practice than does the cross-classification model. In the nested model the main effects for one factor, say, B, are missing in (1). The reason is that the entities creating the different levels of factor B are not the same for different levels of factor A. For example, the levels (subscript i ) of factor A might represent different litters, and the levels (subscript j) of factor B might be different animals, which are a different set for each litter. The additional subscript k might denote repeated measurements on each animal.

To be specific, the formal model for the nested design is: and independence between the different lettered variables. It is customary with this model to use the symbol b rather than ab because the interpretation for this term has changed from synergism or interaction to one of a main effect nested inside another main effect. For a balanced design the method of moments estimators are based on the sums of squares: which have degrees of freedom I-1, I (J-1), and IJ(n-1) , respectively. The mean squares corresponding to (7) have the expectations: The increasing tier phenomenon exhibited in (8) holds for nested designs with more than two effects. The only complication arises when one or more of the estimates are negative. This is an indication that the corresponding variance components are zero or negligible. One might want to resent any negative estimates to zero, combine the adjacent sums of squares, and subtract the combined mean squares from the mean squares higher in the tier.

Extension of these ideas to the unbalanced design does not represent as formidable a task for the nested design as it does for the crossed design. The sums of squares (7), appropriately modified for unbalanced designs, form the basis for the analysis. It is even possible to allow for varying numbers Ji of factor B for different levels of factor A.

3. Tests for variance components

The appropriate test statistics for various hypothesis of interest can be determined by examining the expected mean squares in the table of analysis of variance. However, we encounter the difficulty that even under the normality assumption exact F tests may not be available for some of the An analogous F statistic provides a test for H0:.b 2 =0 . Under the alternative no null hypotheses, these ratios are distributed as the appropriate ratios of multiplicative constants from (10) times central F random variables. Thus power calculations are made from central F tables for fixed effects models. The F tests of H :.2 =0 and H :.2 =0 mentioned in the 0 ab 0 a preceding paragraph are uniformly most powerful similar tests.

However, they are not likelihood ratio tests, which are more complicated because of boundaries to the parameter space. Although their general use is not recommended because of their extreme sensitivity to no normality, confidence intervals can be constructed based on the distribution theory 10. The complicated method of Bulmer (1957), which is described in Scheffe [11 pg. 27-28], is available. However, the approximate method of Satterhwaite [10 pg. 110-114] may produce just as good results.

The distribution theory for the sums of squares (7) used in conjunction with nested designs is straightforward and simple. To test the hypothesis H0:.b2 =0 one uses the F ratio MS (B)/MS(E), and to test H0:.a 2 =0 the appropriate ratio is MS (A)/MS (B). In all nested designs the higher line in the tier is always tested against the next lower line. If a conclusion is reached that .b2 =0 , then the test of H0:.a2 =0 could be improved by combining SS (B) and SS(E) to form a denominator sum of squares with I(J-1) + I J (n-1) degrees of freedom. Under alternative hypotheses these F ratios are distributed as central F ratios multiplied by the appropriate ratio of variances. This can be exploited to produce confidence intervals on some variance ratios. However, one still needs to rely on the approximate Satterhwaite [10 pg. 110-114] approach for constructing intervals on individual components.

4. Estimations of individual effects and overall mean
For the two-way crossed classification with random effects interest

The classical approach would be to use the estimates ?^ij = yij. The idea would be to shrink the individual estimates toward the common mean as in. where the shrinking factor S depends on the sums of squares SS (E), SS (AB), SS(B), and SS(A) . Unfortunately, the specific details on the construction of an appropriate S have not been worked out for the two-way classification as they have been for the one-way classification. Alternatively, attention might center on estimating a1,…, aI , or, equivalently, on the levels of factor B. Again, specific estimators have not been proposed to date for handling this situation.

In the nested design one sometimes wants an estimate and confidence interval for ?. One typically uses ?^= y… . In the balanced case this estimator has variance. This can be estimated by MS (A)/I J n. In the unbalanced case an estimate for the variability of y can be obtained by substituting estimates .^2, .^b2 and 2 into the expression for the variance of y… . Alternative estimators using different weights may be worth considering in the unbalanced case.

5. Conclusion

Analysis of variance (ANOVA) models have become widely used tools and play a fundamental role in much of the application of statistics today. In particular, ANOVA models involving random effects have found

widespread application to experimental design in a variety of fields requiring Two-Way Analysis of Variance for Random Models measurements of variance, including agriculture, biology, animal breeding, applied genetics, econometrics, quality control, medicine, engineering, and social sciences. With a two-way classification there are two distinct factors affecting the observed responses. Each factor is investigated at a variety of different levels in an experiment, and the combination of the two factors at different levels form a cross-classification. In a two-way classification each factor can be either fixed or random. If both factors are random, the model is called a random effects model.

Various estimators of variance components in the two-way crossed classification random effects model with one observation per cell are compared under the standard assumptions of normality and independence of the random effects. Mean squared error is used as the measure of performance. The estimators being compared are: the minimum variance unbiased, the restricted maximum likelihood, and several modifications of the unbiased and the restricted maximum likelihood estimators.

Rainfall Pattern in Enugu State, Nigeria

CHAPTER ONE

1.0 INTRODUCTION

Enugu State is located in the southeastern part of Nigeria created in 1991 from the old Anambra state and the principal cities in the state are Enugu,Agani,Awgu,Udi,Oji-River and Nsukka. The state shares borders with Abia and Imo State to the south, Ebonyi State to the East, Benue state to the Northeast, Kogi state to the Northwest and Anambra state to the West.

Enugu, the capital city of Enugu state, is approximately 21/2 driving hours away from Port Harcourt where coal shipments exited Nigeria. The word “Enugu” (from Enu Ugwu) means “the top of the hill”. The first European settlers arrived in the area in 1909, led by a British mining engineer, named Albert Kitson. In his quest for silver, he discovered coal in the Udi Ridge, colonial Governor of Nigeria Frederick Lugard took keen interest in the discovery, and by 1914 the first shipment of coal was made to Britain. As mining activities increased in the area, a permanent cosmopolitan settlement emerged, supported by a railway system. Enugu acquired township status in 1917 and became strategic to Britain interests.

Foreign businesses began to move into Enugu, the most notable of which were John Holt, Kingsway Stores, British Bank of West Africa and United Africa Company. From Enugu the British administration was able to spread its influence over the southern province of Nigeria. The colonial past of Enugu is today evidenced by the Georgian building types and meandering narrow roads within the residential area originally reserved for the whites, an area which is today called the Government Reserved Area (GRA).

The state Government and the Local government are the levels of government in Enugu state and have 17 Local Government areas. Economically, the state is predominantly rural and agrarian, with a substantial proportion of its working population engaged in farming, although trading (18.8%) and services (12.9%) are also important. In the urban areas trading is the dominant occupation, followed by services. A small proportion of the population is also engaged in manufacturing activities, with the most pronounced among them located in Enugu, Oji, Ohebedim and Nsukka. The state boasts of a number of markets especially at each of the divisional headquarters, prominent of which is the Ogbete Main market in the State capital, Electricity supply is relatively stable in Enugu and its Environs. The Oji River power station (which used to supply electricity to all of Eastern Nigeria) is located in Enugu state. The state had a population of 3,267,837 people at the census held in 2006 (estimated at over 3.8 million in 2012), it is home of the Igbo of southeastern Nigeria.

The average temperature in this city is cooler to mild (60 degrees Fahrenheit) in its cooler months and gets warmer to hot in its warmer months (upper 80 degrees Fahrenheit) and very good for outdoor activities with family and friends or just for personal leisure. Enugu has good soil-land and climatic conditions all year round, sitting at about 223 meters (732 ft) above sea level, and the soil is well drained during its rainy seasons.

The main temperature in Enugu state in the hottest month of February is about 87.16 0F (30.64 0C), while the lowest temperatures occur in the month of November, reaching 60.54 0F (15.86 0C). The lowest rainfall of about 0.16 cubic centimeters (0.0098 cu in) is normal in February, while the highest is about 35.7 cubic centimeters (2.18 cu in) in July.

The differences in altitude and relief create a large variation in climate in various regions of the country. In places that are characterized as semi-arid zones, climate shows wide fluctuation from year to year and even within seasons in the year. Semi arid regions receive very small, irregular, and unreliable rainfall (Workneh, 1987).

The annual cycle of the climatology of the rainfall over tropical Africa and in particular over Nigeria, is strongly determined by the position of the Inter Tropic Convergence Zone (ITCZ) (Griffiths, 1971). Variations in rainfall pattern throughout the country are the result of differences in elevation and seasonal changes in the atmospheric pressure systems that control the prevailing winds. The climate of Nigeria is characterized by high rainfall variation (Yilma et al., 1994). In Nigeria, several regions receive rainfall throughout the year, but in some regions rainfall is seasonal and low making irrigation necessary (Alemeraw and Eshetu, 2009). Rainfall is the most critical and key variable both in atmospheric and hydrological cycle. Rainfall patterns usually have spatial and temporal variability. This variability affects agricultural production, water supply, transportation, environment and urban planning, thus, the entire economy of a country, and the existence of its people. Rainfall variability is assumed to be the main cause for the frequently occurring climate extreme events such as drought and flood. These natural phenomena affect badly the agricultural production and hence the economy of the nation. In regions where the year-to-year variability is high, people often suffer great calamities due to floods or droughts. Even though damage due to extremes of rainfall cannot be avoided completely, a forewarning could certainly be useful (Nicholls, 1980). Nigeria is one of the countries whose economy is highly dependent on rain-fed agriculture and also facing recurring cycles of flood and drought. Current climate variability is already imposing a significant challenge to Nigeria in general and Enugu in particular, by affecting food security, water and energy supply, poverty reduction and sustainable development efforts, as well as by causing natural resource degradation and natural disasters. Recurrent floods in the past caused substantial human life and property loss in many parts of the country.

Methods of prediction of rainfall extreme events have often been based on studies of physical effects of rainfall or on statistical studies of rainfall time series. Rainfall forecast is relevance to the agriculture sector, since it contributes significantly to the economy of countries like Nigeria. In order to model and predict hydrologic events, one can use stochastic methods like time series methods. Numerous attempts have been made to predict behavioral pattern of rainfall using various techniques (Yevjevich, 1972; Dulluer and Kavas, 1978; Tsakiris, 1998). Awareness about the characteristics of the rainfall over an area such as the source, quantity, variability, distribution and the frequency of rainfall is essential for the implication in utilization and associated problems. Assessing rainfall variability is practically useful in making decision, risk management and optimum usage of water resources of countries. Thus, it is important to obtain accurate rainfall forecast at various geographic levels of Nigeria and work towards identifying periodicities in order to help policy makers improve their decisions by taking into consideration the available and future water resources. In this study, univariate Box-Jenkins methodology to build ARIMA model are used for assessing the rainfall pattern in Enugu State based on data from Nigerian Meteorological Agency.

1.1 Weather and Climate

Weather and climate over the earth are not constant with time: they change on different time series ranging from the geological to the diurnal through annual, the difference between weather and climate is a measure of time. Weather is what condition of the atmosphere over a short period of time and climate is how the atmosphere behaves over relatively long period of time. Seasonal and intra-seasonal time scales. Such variability is an inherent characteristic of the climate. The study of climatic fluctuations involves description and investigation of causes and effects of these fluctuations in the past and their statistical interpretation. Much of the work done is about variability of the two important meteorological parameters: rainfall and temperature. Rainfall is a term used to refer to water falling in drops after condensation of the atmospheric vapor. Also rainfall is the resultant product of a series of complex interactions taking place within the earth-atmosphere system. Rainfall is only water that falls from the sky, whereas precipitation is any wet things that fall from the sky, which include snow, frozen rain….etc. Water in all its forms and in all its various activities plays a crucial role in sustaining both the climate and life. It is also a major factor for planning and management of water resource project and agricultural production. Even though Nigeria enjoys a fairly good amount of rainfall, wide variability in its distribution with respect to space and time are responsible for the two extremes events (floods and droughts) (Yilma et. al,1994).

1.2 Rainfall Characteristics

Rainfall varies with latitude, elevation, topography, seasons, distance from the sea, and coastal Sea-surface temperature. Nigeria enjoys the humid tropical climate type. Because of its location just north of the equator, also, Nigeria enjoys a truly tropical climate characterized by the hot and wet conditions associated with the movement of the inter-Tropical convergence Zone (ITCZ) north and south of the equator.

While there is a general decrease in rainfall in Nigeria, the coastal area is experiencing slight increase. Apart from the general southward shift in rainfall patterns, the duration has also reduced from 50-360 (1993-2003) to 30-280 (2003-2013) rainy days per year. This has created ecological destabilization and altered the pattern of the vegetation belt especially in the northern part of the country. The rainfall pattern has also enhanced wind erosion and desertification, soil erosion and coastal flooding in the north, east and coastal areas of Nigeria respectively.

The country experiences consistently high temperatures all year round. Since temperature varies only slightly, rainfall distribution, over space and time, becomes the single most important factor in differentiating the seasons and climatic distribution are however dependent on the two air masses that prevail over the country. Their influences are directly linked to the movement of the ITCZ, north and south of the equator. The two air masses are the Tropical maritime(Tm) and the Tropical continental (Tc). The former is associated with the moisture-laden south-west winds (south westerlies) which blow from the Atlantic Ocean, while the latter is associated with the dry and dusty north-east winds (easterlies) which blow from the Sahara Desert.

Conversely, with the movement of the ITCZ into the Northern Hemisphere, the rain-bearing south westerlies prevail as far inland as possible to bring rain fall during the wet season. The implication is that there is a prolonged rainy season in the far south, while the far north undergoes long dry periods annually. Nigeria, therefore, has two major seasons, the lengths of which vary from north to south. The mean annual rainfall along the coast in the south-east is 4000mm while it is 500mm in the north-east.

Nigeria can, thus be broadly divided into the following climatic regions:

the humid sub-equatorial, in the southern lowlands
the hot tropical continental, in the far north
the moderated sub-temperate in the high plateaus and mountains
the hot, wet tropical, in the hinterland (the middle-belt )

1.3 The main effects of Rainfall

Trends in rainfall extremes have enormous implications. Extreme rainfall events cause significant damage to agriculture, ecology, and infrastructure. They also cause disruption to human activities, injury, and loss of life. Socioeconomic activities including agriculture, power generating, water supply, human health, etc. are also very sensitive to climate variations. As a result, Nigeria economy is heavily dependent on rainfall for generating employment, income, and foreign currency. Thus, rainfall is considered as the most important climatic element that influences Nigeria agriculture. The severity and frequency of occurrence of rainfall extremes events (meteorological, hydrological, and agricultural) vary for different parts of the country.

Drought: Drought is an insidious hazard of nature. It is often referred to as a “creeping phenomenon” and its impacts vary from region to region. Drought can therefore be difficult for people to understand; it is equally difficult to define, because what may be considered a drought in, say, Bali (six days without rain) would certainly not be considered a drought in Libya (annual rainfall less than 180 mm). Some drought years have coincided with EN events, while others have followed it. According to DDAEPA (2011) the trend of decreasing annual rainfall and increased rainfall variability is contributing to drought conditions in Nigeria Administration. The average annual rainfall patterns of Abuja for the periods 1999 to 2008 and 1984 to 1991 show two important trends. First, annual average rainfall has declined from the mean value by about 8.5% and 10% respectively. Secondly, the variability of rainfall shows an overall increasing trend, suggesting greater rainfall unreliability. These rainfall patterns have led to serious drought/flood episodes throughout the Administration.

Flood: Floods are known as the most frequent and devastating natural disasters in both developed and developing countries (Osti et al., 2008). Between 2000 and 2008 East Africa has experienced many episodes of flooding. Almost all of these flood episodes have significantly affected large parts of Ethiopia. Ethiopia’s topography characteristics has made the country pretty vulnerable to floods and resulting destruction and damage to life, economic, livelihoods, infrastructure, services and health system (FDPPA, 2007). Flooding is common in Ethiopia during the rainy season between June and September and the major type of flooding which the country is experiencing are flash flood and river floods (FDPPA, 2007).

Like other regions of Nigeria, the issue of flood continues to be of growing concern in Enugu especially to peoples residing in lowlands, along or near the flood courses as well as village located at the foot of hills and mountains. Flood disasters are occurring more frequently, and having an ever more dramatic impact on Enugu in terms of the costs on lives, livelihoods and environmental resources. The topography of Enugu Administration mainly consists of mountains and hills with steep slope, valleys, and river basins. The catchment characteristics accompanied with its large area coverage coupled with torrential rain fall during the short and long rainy season had been the main factors that contribute to the pervious flood events.

Soil Erosion: when soil moves from one location to another, it is referred to as soil erosion. The impact of rainfall striking the surface can cause soil erosion; erosion is a concern for farmers as their valuable, nutrient rich top soil can be washed away from rainfall. It can also weaken structures such as bridges or wash out roads. Vegetation can decrease the amount of soil that is eroded during a rain. Erosion has been going on and has produced river valleys and shaped hills and mountains. Such erosion is generally slow but can cause a rapid increase in the rate at which soil is eroded (i.e. a rate faster than natural weathering of bedrock can produce new soil). This has resulted in a loss of productive soil from crop and grazing land, as well as layers of infertile soil being deposited on formerly fertile crop lands: the formation of gullies: silting of lakes and streams, and land slips.

1.4 Aim and Objectives of the study

The main aim of this study is to analyze rainfall pattern in Enugu State using appropriate time series methods based on 15 years (January, 1999-Decimeber, 2013) data recorded at Nigerian Meteorological Agency (Enugu State).

Specific Objectives

1. To fit appropriate time series model to the monthly rainfall data.

2. To forecast the rainfall pattern in the study area.

1.5 Data source

The monthly rainfall data in millimeters for the period January, 1999 to December, 2013, collected from the Nigerian Meteorological Agency (Enugu State) were used in the study. The site was chosen due to availability of relatively long series of meteorological data, the data is a secondary data.

1.6 Significance of the Study

Knowledge of what happens to the water that reaches the earth surface will assist the study of many surface and subsurface water problems, for efficient control and management of water resources. For a country like Nigeria, whose welfare depends very much on rain-fed agriculture, a quantitative knowledge of water requirements of the region, availability of water for plant growth and supplemental irrigation, etc. on a monthly or seasonal basis is an essential requirement for agricultural development. In this regard, increased capacity to manage future climate change and weather extremes can also reduce the magnitude of economic, social and human damage and eventually, lead to better resistance. Assessing seasonal rainfall characteristics based on past records is essential to evaluate rainfall extreme risk and to contribute to development of mitigation strategies. Therefore, a reliable rainfall forecasting and assessing behavior at station, regional and national levels is very important.