SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1189
EFFICIENT FEATURE SELECTION FOR FAULT DIAGNOSIS OF AEROSPACE
SYSTEM USING SYNTAX AND SEMANTIC ALGORITHM
Meena E1, Revathi B2,
Sajanvethakumar F3(Assistant professor of Computerscience and Engineering)
123Department of Computer Science and Engineering, JEPPIAAR SRR Engineering College, Padur, Chennai 603103
----------------------------------------------------------------------------------------------------------------------------------------------------
ABSTRACT:Each and every year, the Aerospace system
handles the fault verbatim record database. So the usage of
fault verbatim record database is to generate the fault by
text, if the airplane does not pass the signal code at correct
time when the Airplane starts. It has high dimensional data,
learning difficulties and with unstructured verbatim record.
Learning difficulties, if the person have little amount of
English knowledge, it find difficult to understand. High
dimensional data, if the fault having 3 to 4 lines then it may
take some
time to understand and identify the faults. In proposed
system we introduce, Bi-level Feature Extraction Based Text
Mining. Bi-level is nothing but the comparison of higher
order and lower order. It fault feature derived from both
syntax level and semantic level. Syntax level used to
overcome the learning difficulties and the semantic level use
to convert high dimensional to the low dimensional. It can
be used to diagnosis the problem quickly and rectify the
problems.
1. INTRODUCTION
Text mining could be a knowledge-intensive task and
is gaining a lot of and a lot of attention in many industrial
fields, as an example, aerospace, automotive, railway,
power, medical, biomedicine, producing, sales and selling
sectors. In a railway field, advanced data technologies,
such as sensing element networks, RDIF techniques,
wireless communication, and net cloud, area unit won’t to
monitor the health of the aerospace systems. In the event
of malfunctioning, the diagnostic hassle symptoms are
generated and transmitted to the watching center info by
wired/wireless communications. When each diagnosis
episode a repair verbatim is recorded, that consists of a
matter description of the mixture of fault symptom (i.e.,
fault terms), e.g., “Speed Distance Unit (SDU) relevant
faults,” a fault symptom e.g., “SDU,” failure modes (i.e.,
fault classes), and at last corrective actions, e.g., “replaced
SDU,” taken to repair its faults.
However, the task of automatic discovery of
information from the repair verbatim may be a non-trivial
exercise primarily owing to the following reasons:
1) High-dimension information. In maintenance
documents, there are tens of thousands or maybe many
thousands of distinct terms or tokens. when elimination of
stop words and stemming, the set of options continues to
be overlarge for many learning algorithms.
2) unbalanced fault category distribution. In maintenance
documents, the number of examples in one fault category
(i.e., majority class) is considerably larger than that of the
others (i.e., minority classes). Such unbalanced category
distributions have exhibit a heavy issue to most classifier
learning algorithms that assume a comparatively balanced
distribution.
3) unsupervised text mining models. They will not turn
out topics that adjust to the user’s existing information.
One key reason is that the target functions of topic models,
e.g., Latent Dirichlet Allocation, LDA , typically don't
correlate well with human judgments.
This work proposes a bi-level feature extraction-based
text mining for fault designation to fulfill the aforesaid
challenges by mechanically analyzing the repair verbatim.
Our main plan is to extract fault options at syntax and
linguistics levels severally so fuse them to realize the
required results. Considering the very fact that the
extracted options at every level offers a distinct stress to a
specific facet of feature spaces and has its deficiencies, the
planned feature fusion of two levels could enhance the
exactness of fault designation for all fault categories,
particularly minority ones.
At the syntax level, we have a tendency to propose
associate degree improved χ2 statistics (ICHI) to deal with
the feature choice of unbalanced information set. First, we
have a tendency to overcome the negative result of
unbalanced information set by adjusting the feature
weight of minority and majority classes. This makes
minority categories comparatively distant from the
majority ones. Second, we have a tendency to contemplate
the Hellinger distance as a choice criterion for feature
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1190
choice, which is shown to be imbalance-insensitive. The
planned ICHI may be regarded as feature picks at the
syntax level as a result of it mainly uses the document-
word matrix. At the linguistics level, we have a tendency to
borrow the thought from and propose an LDA with
previous data (ab. PLDA) to perform the feature extraction.
By representing documents in topics rather than word
house, we have a tendency to area unit able to offer
additional feature extraction at the linguistics level to
compensate those extracted at the syntax level. the mixing
of previous data with the fundamental LDA is based on the
very fact that LDA, as associate degree unattended model,
cannot deal with such problems as choosing topic counts
and reducing the adverse result of common words, which
cannot turn out topics that adapt to a user’s existing data.
Previous data helps U.S.A. guide topic mining in basic LDA.
Finally, we have a tendency to fuse the extracted options
derived from the syntax level with the linguistics one by
serial fusion to boost Support Vector Machine (SVM)-
based fault diagnosing for all fault categories, particularly
minority ones.
2. RELATED WORK
To manage the challenges obligatory by unbalanced
category distributions, several learning algorithms are
planned. For instance, the sampling-based strategies, e.g.,
over-sampling scheme and under-sampling theme square
measure the best yet effective ones, within which
categories square measure replicated or curtail to achieve
an identical balanced result. Another well-liked
methodology is the value-sensitive learning theme that
takes the price matrix into thought throughout model
building and generates a model that has all-time low value.
Margineantu et al. examined various strategies for
incorporating value data into the C4.5 learning formula.
Joshi et al. planned PNrule, a two-phase rule induction
formula, to handle the mining of minority classes. Tang et
al. incorporated completely different rebalance heuristics,
as well as cost-sensitive learning, over-sampling and
under-sampling in SVM modeling and introduced four SVM
variations to tackle the imbalance learning downside. A
survey about this subject is found in Mladenic et al.
discussed the feature choice problems for unbalanced
category distributions. However, this work is restricted to
the Naive Bayesian classifier. Also, Zheng et al. planned a
feature choice method for unbalanced text documents by
adjusting the mix of positive and negative options within
the information. Their method sticks to the normal
goodness measures of options. Yin et al. planned to divide
the bulk category into comparatively smaller pseudo-
subclasses with comparatively uniform sizes to manage
influence of unbalanced information sets. In text mining-
based feature extraction, applied math and graphic
modeling has been paid a lot of and a lot of attention and
thought of as a well-liked and economical tool to mine
topics to scale back dimensions. For example, LDA was
antecedently wont to construct features for classification.
It usually acts to scale back information dimension. In
distinction, the essential LDA, as AN unattended model,
cannot perform to an adequate degree during a topic
mining method. To solve this downside, Andrzejewski et
al. incorporated domain information by employing a
Dirichlet Forest previous in LDA. Zhai et al. planned
probabilistic constraints as a relaxation mechanical
modification, that could be a soft constraint, to the chemist
sampling equation. Hospedales proposed weakly
supervised joint topic model that learned a model for all
the classes by employing a part shared common basis.
Wang proposed a unnatural topic model by adding
constraints to guide a subject mining method, that
improved the accuracy of mining topics.
3. ICHI-BASED FEATURE SELECTION AT
SYNTAX LEVEL
The basic idea of the proposed ICHI is to make a
minority class far away from the majority one by adjusting
weights of fault terms as shown in Fig. 1. To facilitate
understanding, we first define some notations. Tm is the
set of fault terms of minority fault classes, TM the set of
fault terms of majority fault class and Tc, the intersection
of Tm and TM, the common feature set.
SYNTAX LEVEL ALGORITHM
Data: Dataset S, fault term T, fault class F
Result: Feature set F1
Begin
W word segmentation
M word-Document matrix
For wi Є W and fj Є F do
R(i,j) correlation between fault term
and class
End
R1 normalization of R
F1(i)Fault feature
For fi, fj Є F do
F2(i,j) common fault feature set of fault
class by intersection of feature set
End
F2 common fault feature set by union
For fi Є F do
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1191
F2(i) Exclusive feature set by excluding
F2
W1(i) Weight of F2(i) by inverse
probability
End
For wk Є F do
L(wk) Hellinger distance
F1(i,j) common feature set selected by
highest k features according to hellinger distance
L
End
End
Fig1. Idea of proposed ICHI
Let image / denote the set distinction, Tm/Tc and
TM/Tc square measure
related with minority and
majority categories solely, severally,
thereby known as them as exclusive fault term sets.
3.1 χ2 Statistics and Hellinger Distance
χ2 statistics could be used to estimate the shortage of
independence between a term t and a class ci and might be
compared to the χ2 distribution with one degree of
freedom to evaluate extremeness. It's outlined as:
χ2(t,ci) = N[P(t,ci) (t,ci)− (t,ci) (t,ci)]2
(t) (t)P(ci)P(ci)
(1)
where N is that the total number of documents. (t, ci)
denotes the presence of term t and its membership in
class ci, (t,ci)presence of t however not its membership in
ci, (t, ci) absence of t but its membership in ci, and (t,ci)
absence of t and its nonmembership in ci. P(·,·) means
that the likelihood of presence/absence of term t and its
membership/non-membership in class ci.
Hellinger distance may be a live of spatial
arrangement divergence. Given 2 separate likelihood
distributions P = {p1,p2,..pn} and Q={q1,q2,…qn}, their
Hellinger distance is outlined as:
H( ,Q) =√1 √ ∑
2 (2)
By definition, the Hellinger distance may be a metric
satisfying triangle difference.√2 within the definition is
employed for making certain that H(P,Q) ≤ one for
all likelihood distributions.
3.2 ICHI Based Feature selection at Syntax Level
The main steps of ICHI-based feature choice area
unit summarized by algorithmic program one. once a fault
maintenance document D and a fault term wordbook Ω
area unit provided, word set W (i.e., fault term set) is
extracted by word segmentation.
According to W and fault categories C, a word-document
matrix M can be generated (lines 1-2). Then we have a
tendency to cypher correlations R between feature terms
and fault categories by χ2 statistics (lines 3-4). so as to
check the correlation between totally different fault terms
and totally different categories, we have a tendency to
normalize them as follows (line 5):
Increase
weights
Reselect
Decreased
weights
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1192
R(wi, cj) = R(i, j) x R(i, j)
∑i=1:mR(i, j) ∑j=1:nR(i,j)
= R(i, j)2
∑i=1:m R(i, j) × ∑j=1:n R(i, j)
(3)
where n is that the variety of fault terms contained in W, m
is the number of fault categories in C. In Eq. (3), the
correlation of feature term Badger State and fault category
cj depends on the correlations between term Badger State
and every one different fault categories besides cj .
Therefore, it is depicted exactly by the merchandise of R(i,
j)/∑i=1:mR(i, j) and R(i, j)/∑j=1:nR(i, j). we have a
tendency to then choose highly connected fault feature
sets F for every fault category by comparing correlations
with a given threshold (line 6). Next, lines 7–9 acquire the
inclined fault feature set F by intersecting each combine of
fault term sets. At an equivalent time, the exclusive feature
sets F of every fault category is obtained in line twelve.
Next, we have a tendency to change their weights in step
with chances of their corresponding fault categories (line
13).
To the gravity fault term set F, we want to judge the
distributive discrimination of every feature on fault
categories by computing its Hellinger distance with these
fault categories victimization Eq. (2) (line 16). Then we
have a tendency to use it to reselect the common options
of each fault category pairwise (line 17). At last, we have a
tendency to get the ultimate common feature set (F’) of the
information set by performing arts the union of all the
common feature sets of all fault categories pairwise (line
19). Thus, we have a tendency to complete the feature
choice of fault term features and find such feature space Fa
as [(exclusive feature sets, weights), common feature set]
(line 2 ), i.e., ( F, F),Fϖ].
4. PLDA BASED FEATURE SELECTION AT
SEMANTIC LEVEL
In this section, we first get to know about LDA and so
introduce the extraction of relationship supported
previous information. At last we have a tendency to gift the
projected PLDA that comes with prior information into
LDA to appreciate the feature choice at the semantic level.
SEMANTIC ALGORITHM
Data: Dataset S, Fault class F, Topic sets K
Result: Correlation г(wi,zk)
Begin
R1 Normalization of R
Ξ k clusters
Θ degree of correlation
For wi Є W and fi Є F do
If R1(wi,fj) is highest or lowest two ranks
in Ξ then
R1(wi,fj)is assigned SR or WR
Else
R1(wi,fj)is assigned as CR
End
End
Fault classes fi Є F is preassigned with two
corresponding copies z2*I, z2*i+1
г (wi,zk)initialize correlation between term and
topic with zeros
For wi Є W and zk Є Z do
If zk Є fj then
(wi,zk) is assigned with the value of
R1(wi,fj)
End
End
End
4.1 LDA
Given D documents expressed over W distinctive
words and T topics, LDA outputs the document-topic
distribution and topic word distribution, each of which
may be obtained with chemist Sampling. Its key step is that
the topic change for every word in every document in step
with
P(zi=j|z−i, w, α, β)∝
+ β + α
+Wβ + Tα
(4)
where zi=j denotes the ith word in an exceedingly
document appointed to topic j, z−i all the subject
assignments apart from the ith word, i.e., the current one.
w= {w1,w2,w3,….wn}, wherever every Wi belongs to some
document.α and β are hyper-parameters for the document-
topic and topic-word Dirichlet distributions, severally
is that the total range of same words Wisconsin
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1193
appointed to topic j, not together with this one and
the full range of words appointed to topic j, not
together with this one. ) is the range of words from
document di appointed to topic j, not together with this
one, and is that the total range of words in document
di, excluding this one. After M iterations of chemist
sampling for all words altogether documents, the
distribution φ and θ are finally calculable as follows:
φj
(wi) = + β
∑ (5)
θj
(di) = + α
(6)
∑
4.2 Extraction of Relationship-Based data
To facilitate understanding the extraction of previous
data, we offer 3 varieties of relationship between fault
terms and fault classes.
Strong Relationship (SR): fault terms powerfully relate
with a specific fault category and hardly relate with others.
Hence PLDA adds these options to the precise fault
category in topic mining based fault choice.
Weak Relationship (WR): fault terms hardly relate with a
specific fault category. These fault terms shouldn't be
associated with the precise fault category.
Complex Relationship (CR): fault terms powerfully relate
with more than one fault category. we must always
provide it comprehensive considerations in topic mining-
based fault choice.
The main steps of previous data extraction are
summarized into semantic algorithm. Like syntax
algorithm the normalized correlations (R) is calculated by
Line one. Then R is clustered into eight clusters Ξ by the K-
means bunch methodology (Line 2). Correlation degree
(Θ) between fault terms and fault categories, such as SR,
WR and CR, is then assigned to every pairwise term and
fault category (Lines 4–8). During this work, every fault
category is pre assigned with 2 corresponding topics. as an
example, topics z2∗i, z2∗i+1 corresponds fault ci ∈ C (1 ≤ i ≤
|C|), where |C| represents category count. Then the
correlation (Γ) between terms and topics will be obtained
(lines 13–15).
4.3 Incorporating previous data Into LDA
The main plan of incorporating previous data into
LDA is to revise the subject change possibilities by
victimization previous information. That means, during a
topic change method in (4), we multiply an extra indicator
operate δ(wi, zj), which represents a tough constraint of SR
and WR from terms to topics.
The final probability for topic change is:
P(zi = j|z−i, w, α, β) ∝ δ(wi, zj)
∗ + β + α
∑ ∑
(7)
where δ(wi, zj) represents intervention or facilitate from
pre-existent knowledge of SR and WR, that plays a key role
in this update. Within the topic change {for every|for
every} word in each document, δ(wi, zj) equals Γ(wi, zj). For
advanced relationship (CR), influence of fault term Badger
State and fault categories on topic-word distribution ought
to be all taken into account. Our basic plan is to see the
association between wi and Czj, wherever Czj denotes the
set of fault categories to that topic zj hooked up. If they
have relevance higher than a pregiven threshold, Γ(wi, zj)
ought to be assigned a positive variety. Otherwise, Γ(wi, zj)
is set as a negative variety. Therefore, (4) is revised as
follows:
P(zi = j|z−i,w, α, β)∝
(1 + Fwi,zj ) + β + α
∑w
W(1+Fw,zj) +Wβ + Tα
(8)
where Fwi,j corresponds to Γ(wi, zj)in semantic algorithm
and reflects the correlation of fault term wi with topic zj.
Then (8) is used to modification the sampling method for
fault knowledge set with CR relationship.
5. SERIAL FAULT FEATURE FUSION
The fault feature extracted at the syntax level is united
with those at the linguistics level. To facilitate
understanding, we denote the processed fault feature from
the syntax level as Fa= (a1, a2, . . . , aM) and also the one from
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1194
linguistics level Fb = (b1, b2,. . . , bN), wherever M and N
square measure the dimension at syntax and linguistics
levels severally. Here we tend to adopt a serial fusion
method to make a combined feature Fγ. it's outlined by
Fγ = (Fa, θ ∗ Fb)
= (a1,a2,...,aM,θ ∗ b1,θ ∗ b2,...,θ ∗ bN)
(9)
where θ is associate adjusting parameter. It may be
obtained from training set through learning. once the
accuracy modification in 2 continuous iterations is a
smaller amount than 0.1, we tend to set this price as θ. All
serially combined feature vectors kind associate (M+N)-
dimensional feature space.
6. EXPERIMENTAL RESULTS
The main cause of the accidents shows the following
results
1. Ground
2. After Take-off
3. Hijack / Bomb
4. Double Engine Failure
5. Landing - Short
6. Landing - Fast
7. Landing - Gear Up
7. CONCLUSION
Text mining of repair verbatim for fault diagnosis of
Aerospace systems poses a big challenge due to
unstructured verbatim, high-dimension data, and
imbalanced fault classes. In this paper, to improve the fault
diagnosis performance, especially on minority fault
classes, we have proposed a bi-level feature extraction-
based text mining method. We first adjust the exclusive
feature weights of various fault classes based on χ2
statistics and their distributions. Then we reselect the
common features according to both relevance and
Hellinger distance. This can be categorized as feature
selection at the syntax level. Next, we extract semantic
features by using a prior LDA model to make up for the
limitation of fault terms derived from the syntax level.
Finally, we fuse fault term sets derived from the syntax
level with those from the semantic level by serial fusion.
REFERENCES
1] L. Huang and Y. L. Murphey, “Text mining with
application to engineering diagnostics,” in Proc. 19th Int.
Conf. IEA/AIE, Annecy, France, 2006, pp. 1309–1317.
[2] D. G. Rajpathak, “An ontology based text mining system
for knowledge discovery from the diagnosis data in the
automotive domain,” Comput Ind., vol. 64, no. 5, pp. 565–
580, Jun. 2013.
3] J. Silmon and C. Roberts, “Improving switch reliability
with innovative condition monitoring techniques,” Proc.
IMechE, F C J. Rail Rapid Transit, vol. 224, no. 4, pp. 293–
302, 2010.
4] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet
allocation,” J. Mach.
Learn. Res., vol. 3, pp. 993–1022, Jan. 2003.
[5] J. Chang, J. Boyd-Graber, C.Wang, S. Gerrish, and D. Blei,
“Reading tea leaves: How humans interpret topic models,”
Neural Inf. Process. Syst., vol. 22, pp. 288–296, 2009.
6] D. A. Cieslak and N. V. Chawla, “Learning decision trees
for unbalanced data,” in Proceedings of the 2008 European
Conference on Machine Learning and Knowledge Discovery
in Databases-Part I. Berlin, Germany: Springer-Verlag,
2008, pp. 241–256.
7] T. Kailath, “The divergence and Bhattacharyya distance
measures in signal selection,” IEEE Trans. Commun.
Technol., vol. 15, no. 1, pp. 52–60,
Feb. 1967.

More Related Content

What's hot (20)

PDF
Text Segmentation for Online Subjective Examination using Machine Learning
IRJET Journal
 
DOC
0
butest
 
PDF
Database Design and the ER Model, Indexing and Hashing
Prabu U
 
PDF
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
csandit
 
PDF
Paper id 25201435
IJRAT
 
PDF
IRJET- Personality Recognition using Multi-Label Classification
IRJET Journal
 
PDF
Using Class Frequency for Improving Centroid-based Text Classification
IDES Editor
 
PDF
03 fauzi indonesian 9456 11nov17 edit septian
IAESIJEECS
 
PDF
Evolutionary Testing Approach for Solving Path- Oriented Multivariate Problems
IDES Editor
 
DOC
Doc format.
butest
 
PDF
Relevance feature discovery for text mining
redpel dot com
 
PDF
MICRE: Microservices In MediCal Research Environments
Martin Chapman
 
PDF
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCER
cscpconf
 
PDF
ENSEMBLE MODEL FOR CHUNKING
ijasuc
 
PDF
Semantic extraction of arabic
csandit
 
PDF
E0322035037
inventionjournals
 
PDF
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
Martin Chapman
 
PDF
10.1.1.17.6973
gjramanaa35
 
PDF
Fuzzy Rule Base System for Software Classification
ijcsit
 
PDF
20120140506007
IAEME Publication
 
Text Segmentation for Online Subjective Examination using Machine Learning
IRJET Journal
 
Database Design and the ER Model, Indexing and Hashing
Prabu U
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
csandit
 
Paper id 25201435
IJRAT
 
IRJET- Personality Recognition using Multi-Label Classification
IRJET Journal
 
Using Class Frequency for Improving Centroid-based Text Classification
IDES Editor
 
03 fauzi indonesian 9456 11nov17 edit septian
IAESIJEECS
 
Evolutionary Testing Approach for Solving Path- Oriented Multivariate Problems
IDES Editor
 
Doc format.
butest
 
Relevance feature discovery for text mining
redpel dot com
 
MICRE: Microservices In MediCal Research Environments
Martin Chapman
 
KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCER
cscpconf
 
ENSEMBLE MODEL FOR CHUNKING
ijasuc
 
Semantic extraction of arabic
csandit
 
E0322035037
inventionjournals
 
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
Martin Chapman
 
10.1.1.17.6973
gjramanaa35
 
Fuzzy Rule Base System for Software Classification
ijcsit
 
20120140506007
IAEME Publication
 

Similar to Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syntax and Semantic Algorithm (20)

PDF
The International Journal of Engineering and Science (The IJES)
theijes
 
PDF
Iaetsd an efficient and large data base using subset selection algorithm
Iaetsd Iaetsd
 
PDF
IRJET- Survey of Feature Selection based on Ant Colony
IRJET Journal
 
PDF
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Waqas Tariq
 
PDF
Fault detection of imbalanced data using incremental clustering
IRJET Journal
 
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEEFINALYEARSTUDENTPROJECTS
 
DOCX
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
IEEEMEMTECHSTUDENTSPROJECTS
 
PDF
Feature selection, optimization and clustering strategies of text documents
IJECEIAES
 
DOCX
A fast clustering based feature subset selection algorithm for high-dimension...
IEEEFINALYEARPROJECTS
 
DOCX
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
IEEEGLOBALSOFTTECHNOLOGIES
 
PDF
Choosing allowability boundaries for describing objects in subject areas
IAESIJAI
 
PDF
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
ijdmtaiir
 
PDF
Sensor Fault Detection in IoT System Using Machine Learning
IRJET Journal
 
PDF
Mapping Subsets of Scholarly Information
Paul Houle
 
PDF
Using data mining methods knowledge discovery for text mining
eSAT Publishing House
 
PDF
Using data mining methods knowledge discovery for text mining
eSAT Journals
 
PDF
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
IJCI JOURNAL
 
PDF
DISK FAILURE PREDICTION BASED ON MULTI-LAYER DOMAIN ADAPTIVE LEARNING
IJCI JOURNAL
 
PDF
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
IJCI JOURNAL
 
PDF
Context Driven Technique for Document Classification
IDES Editor
 
The International Journal of Engineering and Science (The IJES)
theijes
 
Iaetsd an efficient and large data base using subset selection algorithm
Iaetsd Iaetsd
 
IRJET- Survey of Feature Selection based on Ant Colony
IRJET Journal
 
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Waqas Tariq
 
Fault detection of imbalanced data using incremental clustering
IRJET Journal
 
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
IEEEMEMTECHSTUDENTSPROJECTS
 
Feature selection, optimization and clustering strategies of text documents
IJECEIAES
 
A fast clustering based feature subset selection algorithm for high-dimension...
IEEEFINALYEARPROJECTS
 
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
IEEEGLOBALSOFTTECHNOLOGIES
 
Choosing allowability boundaries for describing objects in subject areas
IAESIJAI
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
ijdmtaiir
 
Sensor Fault Detection in IoT System Using Machine Learning
IRJET Journal
 
Mapping Subsets of Scholarly Information
Paul Houle
 
Using data mining methods knowledge discovery for text mining
eSAT Publishing House
 
Using data mining methods knowledge discovery for text mining
eSAT Journals
 
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
IJCI JOURNAL
 
DISK FAILURE PREDICTION BASED ON MULTI-LAYER DOMAIN ADAPTIVE LEARNING
IJCI JOURNAL
 
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
IJCI JOURNAL
 
Context Driven Technique for Document Classification
IDES Editor
 
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

PDF
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
PDF
Decision support system in machine learning models for a face recognition-bas...
TELKOMNIKA JOURNAL
 
PDF
輪読会資料_Miipher and Miipher2 .
NABLAS株式会社
 
PPTX
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
PDF
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
PDF
Designing for Tomorrow – Architecture’s Role in the Sustainability Movement
BIM Services
 
PPTX
Introduction to File Transfer Protocol with commands in FTP
BeulahS2
 
PDF
تقرير عن التحليل الديناميكي لتدفق الهواء حول جناح.pdf
محمد قصص فتوتة
 
PPTX
Introduction to Python Programming Language
merlinjohnsy
 
PDF
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
PDF
01-introduction to the ProcessDesign.pdf
StiveBrack
 
PPTX
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
PDF
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
 
PPTX
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
PPT
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
PPTX
How to Un-Obsolete Your Legacy Keypad Design
Epec Engineered Technologies
 
PDF
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
PDF
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
 
PPTX
Computer network Computer network Computer network Computer network
Shrikant317689
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
Decision support system in machine learning models for a face recognition-bas...
TELKOMNIKA JOURNAL
 
輪読会資料_Miipher and Miipher2 .
NABLAS株式会社
 
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
Designing for Tomorrow – Architecture’s Role in the Sustainability Movement
BIM Services
 
Introduction to File Transfer Protocol with commands in FTP
BeulahS2
 
تقرير عن التحليل الديناميكي لتدفق الهواء حول جناح.pdf
محمد قصص فتوتة
 
Introduction to Python Programming Language
merlinjohnsy
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
01-introduction to the ProcessDesign.pdf
StiveBrack
 
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
 
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
How to Un-Obsolete Your Legacy Keypad Design
Epec Engineered Technologies
 
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
 
Computer network Computer network Computer network Computer network
Shrikant317689
 

Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syntax and Semantic Algorithm

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1189 EFFICIENT FEATURE SELECTION FOR FAULT DIAGNOSIS OF AEROSPACE SYSTEM USING SYNTAX AND SEMANTIC ALGORITHM Meena E1, Revathi B2, Sajanvethakumar F3(Assistant professor of Computerscience and Engineering) 123Department of Computer Science and Engineering, JEPPIAAR SRR Engineering College, Padur, Chennai 603103 ---------------------------------------------------------------------------------------------------------------------------------------------------- ABSTRACT:Each and every year, the Aerospace system handles the fault verbatim record database. So the usage of fault verbatim record database is to generate the fault by text, if the airplane does not pass the signal code at correct time when the Airplane starts. It has high dimensional data, learning difficulties and with unstructured verbatim record. Learning difficulties, if the person have little amount of English knowledge, it find difficult to understand. High dimensional data, if the fault having 3 to 4 lines then it may take some time to understand and identify the faults. In proposed system we introduce, Bi-level Feature Extraction Based Text Mining. Bi-level is nothing but the comparison of higher order and lower order. It fault feature derived from both syntax level and semantic level. Syntax level used to overcome the learning difficulties and the semantic level use to convert high dimensional to the low dimensional. It can be used to diagnosis the problem quickly and rectify the problems. 1. INTRODUCTION Text mining could be a knowledge-intensive task and is gaining a lot of and a lot of attention in many industrial fields, as an example, aerospace, automotive, railway, power, medical, biomedicine, producing, sales and selling sectors. In a railway field, advanced data technologies, such as sensing element networks, RDIF techniques, wireless communication, and net cloud, area unit won’t to monitor the health of the aerospace systems. In the event of malfunctioning, the diagnostic hassle symptoms are generated and transmitted to the watching center info by wired/wireless communications. When each diagnosis episode a repair verbatim is recorded, that consists of a matter description of the mixture of fault symptom (i.e., fault terms), e.g., “Speed Distance Unit (SDU) relevant faults,” a fault symptom e.g., “SDU,” failure modes (i.e., fault classes), and at last corrective actions, e.g., “replaced SDU,” taken to repair its faults. However, the task of automatic discovery of information from the repair verbatim may be a non-trivial exercise primarily owing to the following reasons: 1) High-dimension information. In maintenance documents, there are tens of thousands or maybe many thousands of distinct terms or tokens. when elimination of stop words and stemming, the set of options continues to be overlarge for many learning algorithms. 2) unbalanced fault category distribution. In maintenance documents, the number of examples in one fault category (i.e., majority class) is considerably larger than that of the others (i.e., minority classes). Such unbalanced category distributions have exhibit a heavy issue to most classifier learning algorithms that assume a comparatively balanced distribution. 3) unsupervised text mining models. They will not turn out topics that adjust to the user’s existing information. One key reason is that the target functions of topic models, e.g., Latent Dirichlet Allocation, LDA , typically don't correlate well with human judgments. This work proposes a bi-level feature extraction-based text mining for fault designation to fulfill the aforesaid challenges by mechanically analyzing the repair verbatim. Our main plan is to extract fault options at syntax and linguistics levels severally so fuse them to realize the required results. Considering the very fact that the extracted options at every level offers a distinct stress to a specific facet of feature spaces and has its deficiencies, the planned feature fusion of two levels could enhance the exactness of fault designation for all fault categories, particularly minority ones. At the syntax level, we have a tendency to propose associate degree improved χ2 statistics (ICHI) to deal with the feature choice of unbalanced information set. First, we have a tendency to overcome the negative result of unbalanced information set by adjusting the feature weight of minority and majority classes. This makes minority categories comparatively distant from the majority ones. Second, we have a tendency to contemplate the Hellinger distance as a choice criterion for feature
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1190 choice, which is shown to be imbalance-insensitive. The planned ICHI may be regarded as feature picks at the syntax level as a result of it mainly uses the document- word matrix. At the linguistics level, we have a tendency to borrow the thought from and propose an LDA with previous data (ab. PLDA) to perform the feature extraction. By representing documents in topics rather than word house, we have a tendency to area unit able to offer additional feature extraction at the linguistics level to compensate those extracted at the syntax level. the mixing of previous data with the fundamental LDA is based on the very fact that LDA, as associate degree unattended model, cannot deal with such problems as choosing topic counts and reducing the adverse result of common words, which cannot turn out topics that adapt to a user’s existing data. Previous data helps U.S.A. guide topic mining in basic LDA. Finally, we have a tendency to fuse the extracted options derived from the syntax level with the linguistics one by serial fusion to boost Support Vector Machine (SVM)- based fault diagnosing for all fault categories, particularly minority ones. 2. RELATED WORK To manage the challenges obligatory by unbalanced category distributions, several learning algorithms are planned. For instance, the sampling-based strategies, e.g., over-sampling scheme and under-sampling theme square measure the best yet effective ones, within which categories square measure replicated or curtail to achieve an identical balanced result. Another well-liked methodology is the value-sensitive learning theme that takes the price matrix into thought throughout model building and generates a model that has all-time low value. Margineantu et al. examined various strategies for incorporating value data into the C4.5 learning formula. Joshi et al. planned PNrule, a two-phase rule induction formula, to handle the mining of minority classes. Tang et al. incorporated completely different rebalance heuristics, as well as cost-sensitive learning, over-sampling and under-sampling in SVM modeling and introduced four SVM variations to tackle the imbalance learning downside. A survey about this subject is found in Mladenic et al. discussed the feature choice problems for unbalanced category distributions. However, this work is restricted to the Naive Bayesian classifier. Also, Zheng et al. planned a feature choice method for unbalanced text documents by adjusting the mix of positive and negative options within the information. Their method sticks to the normal goodness measures of options. Yin et al. planned to divide the bulk category into comparatively smaller pseudo- subclasses with comparatively uniform sizes to manage influence of unbalanced information sets. In text mining- based feature extraction, applied math and graphic modeling has been paid a lot of and a lot of attention and thought of as a well-liked and economical tool to mine topics to scale back dimensions. For example, LDA was antecedently wont to construct features for classification. It usually acts to scale back information dimension. In distinction, the essential LDA, as AN unattended model, cannot perform to an adequate degree during a topic mining method. To solve this downside, Andrzejewski et al. incorporated domain information by employing a Dirichlet Forest previous in LDA. Zhai et al. planned probabilistic constraints as a relaxation mechanical modification, that could be a soft constraint, to the chemist sampling equation. Hospedales proposed weakly supervised joint topic model that learned a model for all the classes by employing a part shared common basis. Wang proposed a unnatural topic model by adding constraints to guide a subject mining method, that improved the accuracy of mining topics. 3. ICHI-BASED FEATURE SELECTION AT SYNTAX LEVEL The basic idea of the proposed ICHI is to make a minority class far away from the majority one by adjusting weights of fault terms as shown in Fig. 1. To facilitate understanding, we first define some notations. Tm is the set of fault terms of minority fault classes, TM the set of fault terms of majority fault class and Tc, the intersection of Tm and TM, the common feature set. SYNTAX LEVEL ALGORITHM Data: Dataset S, fault term T, fault class F Result: Feature set F1 Begin W word segmentation M word-Document matrix For wi Є W and fj Є F do R(i,j) correlation between fault term and class End R1 normalization of R F1(i)Fault feature For fi, fj Є F do F2(i,j) common fault feature set of fault class by intersection of feature set End F2 common fault feature set by union For fi Є F do
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1191 F2(i) Exclusive feature set by excluding F2 W1(i) Weight of F2(i) by inverse probability End For wk Є F do L(wk) Hellinger distance F1(i,j) common feature set selected by highest k features according to hellinger distance L End End Fig1. Idea of proposed ICHI Let image / denote the set distinction, Tm/Tc and TM/Tc square measure related with minority and majority categories solely, severally, thereby known as them as exclusive fault term sets. 3.1 χ2 Statistics and Hellinger Distance χ2 statistics could be used to estimate the shortage of independence between a term t and a class ci and might be compared to the χ2 distribution with one degree of freedom to evaluate extremeness. It's outlined as: χ2(t,ci) = N[P(t,ci) (t,ci)− (t,ci) (t,ci)]2 (t) (t)P(ci)P(ci) (1) where N is that the total number of documents. (t, ci) denotes the presence of term t and its membership in class ci, (t,ci)presence of t however not its membership in ci, (t, ci) absence of t but its membership in ci, and (t,ci) absence of t and its nonmembership in ci. P(·,·) means that the likelihood of presence/absence of term t and its membership/non-membership in class ci. Hellinger distance may be a live of spatial arrangement divergence. Given 2 separate likelihood distributions P = {p1,p2,..pn} and Q={q1,q2,…qn}, their Hellinger distance is outlined as: H( ,Q) =√1 √ ∑ 2 (2) By definition, the Hellinger distance may be a metric satisfying triangle difference.√2 within the definition is employed for making certain that H(P,Q) ≤ one for all likelihood distributions. 3.2 ICHI Based Feature selection at Syntax Level The main steps of ICHI-based feature choice area unit summarized by algorithmic program one. once a fault maintenance document D and a fault term wordbook Ω area unit provided, word set W (i.e., fault term set) is extracted by word segmentation. According to W and fault categories C, a word-document matrix M can be generated (lines 1-2). Then we have a tendency to cypher correlations R between feature terms and fault categories by χ2 statistics (lines 3-4). so as to check the correlation between totally different fault terms and totally different categories, we have a tendency to normalize them as follows (line 5): Increase weights Reselect Decreased weights
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1192 R(wi, cj) = R(i, j) x R(i, j) ∑i=1:mR(i, j) ∑j=1:nR(i,j) = R(i, j)2 ∑i=1:m R(i, j) × ∑j=1:n R(i, j) (3) where n is that the variety of fault terms contained in W, m is the number of fault categories in C. In Eq. (3), the correlation of feature term Badger State and fault category cj depends on the correlations between term Badger State and every one different fault categories besides cj . Therefore, it is depicted exactly by the merchandise of R(i, j)/∑i=1:mR(i, j) and R(i, j)/∑j=1:nR(i, j). we have a tendency to then choose highly connected fault feature sets F for every fault category by comparing correlations with a given threshold (line 6). Next, lines 7–9 acquire the inclined fault feature set F by intersecting each combine of fault term sets. At an equivalent time, the exclusive feature sets F of every fault category is obtained in line twelve. Next, we have a tendency to change their weights in step with chances of their corresponding fault categories (line 13). To the gravity fault term set F, we want to judge the distributive discrimination of every feature on fault categories by computing its Hellinger distance with these fault categories victimization Eq. (2) (line 16). Then we have a tendency to use it to reselect the common options of each fault category pairwise (line 17). At last, we have a tendency to get the ultimate common feature set (F’) of the information set by performing arts the union of all the common feature sets of all fault categories pairwise (line 19). Thus, we have a tendency to complete the feature choice of fault term features and find such feature space Fa as [(exclusive feature sets, weights), common feature set] (line 2 ), i.e., ( F, F),Fϖ]. 4. PLDA BASED FEATURE SELECTION AT SEMANTIC LEVEL In this section, we first get to know about LDA and so introduce the extraction of relationship supported previous information. At last we have a tendency to gift the projected PLDA that comes with prior information into LDA to appreciate the feature choice at the semantic level. SEMANTIC ALGORITHM Data: Dataset S, Fault class F, Topic sets K Result: Correlation г(wi,zk) Begin R1 Normalization of R Ξ k clusters Θ degree of correlation For wi Є W and fi Є F do If R1(wi,fj) is highest or lowest two ranks in Ξ then R1(wi,fj)is assigned SR or WR Else R1(wi,fj)is assigned as CR End End Fault classes fi Є F is preassigned with two corresponding copies z2*I, z2*i+1 г (wi,zk)initialize correlation between term and topic with zeros For wi Є W and zk Є Z do If zk Є fj then (wi,zk) is assigned with the value of R1(wi,fj) End End End 4.1 LDA Given D documents expressed over W distinctive words and T topics, LDA outputs the document-topic distribution and topic word distribution, each of which may be obtained with chemist Sampling. Its key step is that the topic change for every word in every document in step with P(zi=j|z−i, w, α, β)∝ + β + α +Wβ + Tα (4) where zi=j denotes the ith word in an exceedingly document appointed to topic j, z−i all the subject assignments apart from the ith word, i.e., the current one. w= {w1,w2,w3,….wn}, wherever every Wi belongs to some document.α and β are hyper-parameters for the document- topic and topic-word Dirichlet distributions, severally is that the total range of same words Wisconsin
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1193 appointed to topic j, not together with this one and the full range of words appointed to topic j, not together with this one. ) is the range of words from document di appointed to topic j, not together with this one, and is that the total range of words in document di, excluding this one. After M iterations of chemist sampling for all words altogether documents, the distribution φ and θ are finally calculable as follows: φj (wi) = + β ∑ (5) θj (di) = + α (6) ∑ 4.2 Extraction of Relationship-Based data To facilitate understanding the extraction of previous data, we offer 3 varieties of relationship between fault terms and fault classes. Strong Relationship (SR): fault terms powerfully relate with a specific fault category and hardly relate with others. Hence PLDA adds these options to the precise fault category in topic mining based fault choice. Weak Relationship (WR): fault terms hardly relate with a specific fault category. These fault terms shouldn't be associated with the precise fault category. Complex Relationship (CR): fault terms powerfully relate with more than one fault category. we must always provide it comprehensive considerations in topic mining- based fault choice. The main steps of previous data extraction are summarized into semantic algorithm. Like syntax algorithm the normalized correlations (R) is calculated by Line one. Then R is clustered into eight clusters Ξ by the K- means bunch methodology (Line 2). Correlation degree (Θ) between fault terms and fault categories, such as SR, WR and CR, is then assigned to every pairwise term and fault category (Lines 4–8). During this work, every fault category is pre assigned with 2 corresponding topics. as an example, topics z2∗i, z2∗i+1 corresponds fault ci ∈ C (1 ≤ i ≤ |C|), where |C| represents category count. Then the correlation (Γ) between terms and topics will be obtained (lines 13–15). 4.3 Incorporating previous data Into LDA The main plan of incorporating previous data into LDA is to revise the subject change possibilities by victimization previous information. That means, during a topic change method in (4), we multiply an extra indicator operate δ(wi, zj), which represents a tough constraint of SR and WR from terms to topics. The final probability for topic change is: P(zi = j|z−i, w, α, β) ∝ δ(wi, zj) ∗ + β + α ∑ ∑ (7) where δ(wi, zj) represents intervention or facilitate from pre-existent knowledge of SR and WR, that plays a key role in this update. Within the topic change {for every|for every} word in each document, δ(wi, zj) equals Γ(wi, zj). For advanced relationship (CR), influence of fault term Badger State and fault categories on topic-word distribution ought to be all taken into account. Our basic plan is to see the association between wi and Czj, wherever Czj denotes the set of fault categories to that topic zj hooked up. If they have relevance higher than a pregiven threshold, Γ(wi, zj) ought to be assigned a positive variety. Otherwise, Γ(wi, zj) is set as a negative variety. Therefore, (4) is revised as follows: P(zi = j|z−i,w, α, β)∝ (1 + Fwi,zj ) + β + α ∑w W(1+Fw,zj) +Wβ + Tα (8) where Fwi,j corresponds to Γ(wi, zj)in semantic algorithm and reflects the correlation of fault term wi with topic zj. Then (8) is used to modification the sampling method for fault knowledge set with CR relationship. 5. SERIAL FAULT FEATURE FUSION The fault feature extracted at the syntax level is united with those at the linguistics level. To facilitate understanding, we denote the processed fault feature from the syntax level as Fa= (a1, a2, . . . , aM) and also the one from
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1194 linguistics level Fb = (b1, b2,. . . , bN), wherever M and N square measure the dimension at syntax and linguistics levels severally. Here we tend to adopt a serial fusion method to make a combined feature Fγ. it's outlined by Fγ = (Fa, θ ∗ Fb) = (a1,a2,...,aM,θ ∗ b1,θ ∗ b2,...,θ ∗ bN) (9) where θ is associate adjusting parameter. It may be obtained from training set through learning. once the accuracy modification in 2 continuous iterations is a smaller amount than 0.1, we tend to set this price as θ. All serially combined feature vectors kind associate (M+N)- dimensional feature space. 6. EXPERIMENTAL RESULTS The main cause of the accidents shows the following results 1. Ground 2. After Take-off 3. Hijack / Bomb 4. Double Engine Failure 5. Landing - Short 6. Landing - Fast 7. Landing - Gear Up 7. CONCLUSION Text mining of repair verbatim for fault diagnosis of Aerospace systems poses a big challenge due to unstructured verbatim, high-dimension data, and imbalanced fault classes. In this paper, to improve the fault diagnosis performance, especially on minority fault classes, we have proposed a bi-level feature extraction- based text mining method. We first adjust the exclusive feature weights of various fault classes based on χ2 statistics and their distributions. Then we reselect the common features according to both relevance and Hellinger distance. This can be categorized as feature selection at the syntax level. Next, we extract semantic features by using a prior LDA model to make up for the limitation of fault terms derived from the syntax level. Finally, we fuse fault term sets derived from the syntax level with those from the semantic level by serial fusion. REFERENCES 1] L. Huang and Y. L. Murphey, “Text mining with application to engineering diagnostics,” in Proc. 19th Int. Conf. IEA/AIE, Annecy, France, 2006, pp. 1309–1317. [2] D. G. Rajpathak, “An ontology based text mining system for knowledge discovery from the diagnosis data in the automotive domain,” Comput Ind., vol. 64, no. 5, pp. 565– 580, Jun. 2013. 3] J. Silmon and C. Roberts, “Improving switch reliability with innovative condition monitoring techniques,” Proc. IMechE, F C J. Rail Rapid Transit, vol. 224, no. 4, pp. 293– 302, 2010. 4] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Jan. 2003. [5] J. Chang, J. Boyd-Graber, C.Wang, S. Gerrish, and D. Blei, “Reading tea leaves: How humans interpret topic models,” Neural Inf. Process. Syst., vol. 22, pp. 288–296, 2009. 6] D. A. Cieslak and N. V. Chawla, “Learning decision trees for unbalanced data,” in Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases-Part I. Berlin, Germany: Springer-Verlag, 2008, pp. 241–256. 7] T. Kailath, “The divergence and Bhattacharyya distance measures in signal selection,” IEEE Trans. Commun. Technol., vol. 15, no. 1, pp. 52–60, Feb. 1967.